Towards Tracing Trustworthiness Dynamics: <br>Revisiting Pre-training Period of Large Language Models

Towards Tracing Trustworthiness Dynamics:
Revisiting Pre-training Period of Large Language Models

ACL 2024

Chen Qian^{1 2*}, Jie Zhang^{1 3*}, Wei Yao^{1 2*}, Dongrui Liu^{1 4},
Zhenfei Yin^{1 5}, Yu Qiao¹, Yong Liu^2✉, Jing Shao^1✉

¹Shanghai Artificial Intelligence Laboratory; ²Renmin University of China; ³Chinese Academy of Sciences; ⁴Shanghai Jiao Tong University; ⁵The University of Sydney;

^*Equal Contribution ^✉Corresponding author

arXiv PDF Code Dataset WeChat Blog

Introduction

We are excited to present "Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models," a pioneering study on exploring trustworthiness in LLMs during pre-training. We explores five key dimensions of trustworthiness: reliability, privacy, toxicity, fairness, and robustness. By employing linear probing and extracting steering vectors from LLMs' pre-training checkpoints, the study aims to uncover the potential of pre-training in enhancing LLMs' trustworthiness. Furthermore, we investigates the dynamics of trustworthiness during pre-training through mutual information estimation, observing a two-phase phenomenon: fitting and compression. Our findings unveil new insights and encourage further developments in improving the trustworthiness of LLMs from an early stage.

Probing LLM Pre-training Dynamics in Trustworthiness

The linear probe accuracy on five trustworthiness dimensions for the first 80 pre-training checkpoints.

Models during the early stages of pre-training can already encode trustworthiness well.
Middle and high layers representations exhibit relatively high linearly separable patterns about trustworthiness than low layers.

Controlling Trustworthiness with the Help of Pre-training Checkpoints

Constructing steering vectors from the pre-training checkpoints
and intervening in the SFT model

Performance of various models across four general capabilities
and five trustworthiness capabilities

When using the steering vector from the pre-training checkpoints to intervene in the SFT model:

There is a significant improvement in three dimensions of trustworthiness.
The intervention has a minor impact on the general capabilities of the model.
Enhance the trustworthiness performance of the SFT model more effectively compared to the steering vectors from the SFT model itself.

Probing LLMs using Mutual Information

We take an alternative view by probing LLMs with mutual information during pre-training.
During the pre-training period of LLMs, there exist two distinct phases regarding trustworthiness: fitting and compression, which is in line with previous research on traditional DNNs.

Conclusion

In this work, we take an initial and illuminating step towards elucidating the conceptual understanding of trustworthiness during pre-training. Firstly, by linear probing LLMs across reliability, privacy, toxicity, fairness, and robustness, we investigate the ability of LLMs representations to discern opposing concepts within each trustworthiness dimension during the whole pre-training period. Furthermore, motivated by the probing results, we conduct extensive experiments to reveal the potential of utilizing representations from LLMs during its previous pre-training period to enhance LLMs' own trustworthiness. Finally, we use mutual information to probe LLMs during pre-training and reveal some similarities in the learning mechanism between LLMs and traditional DNNs.

Taken collectively, the empirical study presented in this work can not only justify the potential to improve the trustworthiness of LLMs using their own pre-training checkpoints but may also lead to a better understanding of the dynamics of LLM representations, especially the trustworthiness-related concepts.

BibTeX

@article{qian2024towards,
            title={Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models},
            author={Qian, Chen and Zhang, Jie and Yao, Wei and Liu, Dongrui and Yin, Zhenfei and Qiao, Yu and Liu, Yong and Shao, Jing},
            journal={arXiv preprint arXiv:2402.19465},
            year={2024}
}