Investigating Transferability in Pretrained Language Models

Alex Tamkin, Trisha Singh, Davide Giovanardi, Noah Goodman

While probing is a common technique for identifying knowledge in the representations of pretrained models, it is unclear whether this technique can explain the downstream success of models like BERT which are trained end-to-end during finetuning. To address this question, we compare probing with a different measure of transferability: the decrease in finetuning performance of a partially-reinitialized model. This technique reveals that in BERT, layers with high probing accuracy on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks. In addition, dataset size impacts layer transferability: the less finetuning data one has, the more important the middle and later layers of BERT become. Furthermore, BERT does not simply find a better initializer for individual layers; instead, interactions between layers matter and reordering BERT's layers prior to finetuning significantly harms evaluation metrics. These results provide a way of understanding the transferability of parameters in pretrained language models, revealing the fluidity and complexity of transfer learning in these models.

Knowledge Graph

arrow_drop_up

Comments

Sign up or login to leave a comment