We are excited to present "Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models," a pioneering study on exploring trustworthiness in LLMs during pre-training. We explores five key dimensions of trustworthiness: reliability, privacy, toxicity, fairness, and robustness. By employing linear probing and extracting steering vectors from LLMs' pre-training checkpoints, the study aims to uncover the potential of pre-training in enhancing LLMs' trustworthiness. Furthermore, we investigates the dynamics of trustworthiness during pre-training through mutual information estimation, observing a two-phase phenomenon: fitting and compression. Our findings unveil new insights and encourage further developments in improving the trustworthiness of LLMs from an early stage.
Constructing steering vectors from the pre-training checkpoints
and intervening in the SFT model
Performance of various models across four general capabilities
and five trustworthiness capabilities
In this work, we take an initial and illuminating step towards elucidating the conceptual understanding of trustworthiness during pre-training.
Firstly, by linear probing LLMs across reliability, privacy, toxicity, fairness, and robustness, we investigate the ability of LLMs representations to discern opposing concepts within each trustworthiness dimension during the whole pre-training period.
Furthermore, motivated by the probing results, we conduct extensive experiments to reveal the potential of utilizing representations from LLMs during its previous pre-training period to enhance LLMs' own trustworthiness.
Finally, we use mutual information to probe LLMs during pre-training and reveal some similarities in the learning mechanism between LLMs and traditional DNNs.
Taken collectively, the empirical study presented in this work can not only justify the potential to improve the trustworthiness of LLMs using their own pre-training checkpoints but may also lead to a better understanding of the dynamics of LLM representations, especially the trustworthiness-related concepts.
@article{qian2024towards,
title={Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models},
author={Qian, Chen and Zhang, Jie and Yao, Wei and Liu, Dongrui and Yin, Zhenfei and Qiao, Yu and Liu, Yong and Shao, Jing},
journal={arXiv preprint arXiv:2402.19465},
year={2024}
}