Initially a variant of LSTM known as AWD LSTM was pre trained (unsupervised pre training) for language modelling task using wikipedia articles. In the next step the output layer was turned into a classifier and was fine tuned using various datasets from IMDB, yelp etc. When the model was tested on unseen data, sate of the art results were obtained. The paper further went on to claim that if a model was built using 10,000 rows from scratch then fine tuning the above model (transfer learning) would give much better results with 100 rows only. The only thing to keep in mind is they did not used a transformer in their architecture. This was because both these concepts were researched parallely (transformers and transfer learning) so researchers on both the sides had no idea of what work the other was doing. Transformers paper came in 2017 and ULMFit paper (transfer learning) came in early 2018.
Now architecture wise we had state of the art architecture i.e. Transformers and training wise we have a very beautiful and elegant concept of Transfer Learning. LLMs were the outcome of the combination of these 2 ideas.
Leave a reply