The groundbreaking capabilities of transformer models have created an explosion of alternatives. It can be daunting to sift through all the research papers to understand how the versions differ and what they can do.
This post will cover DistilBERT, a “distilled” version of BERT, one of the leading transformer models in use today. DistilBERT is a smaller, faster version of BERT that is 95% as good as BERT and 60% faster. This speed improvement means that it can be run on less powerful hardware.
This year, Apple announced that they would allow developers to use transformers on Apple devices. But if we want powerful transformer models to run on our phones and watches, we need to make them smaller. The largest transformer models available today require clusters of some of the most powerful hardware in the world – not well suited for running on your phone. As we find ways to create even smaller models, we can run more powerful models on our phones, watches, and other hardware-constrained environments.
To understand DistilBERT, we first need to understand what distillation is. Distillation is a family of techniques that allow us to make models smaller. Smaller models are generally faster, less expensive, and less energy-intensive than larger models, but the tradeoff is often decreased accuracy.
In this case, DistilBERT has 40% fewer parameters than BERT but retains 95% of BERT’s performance in language understanding. This feat was achievable by having BERT “teach” a smaller version of itself. This strategy is very effective for training models because normal datasets provide less information per example than a trained model can.
For example, let’s say we have a dataset with one label per row of data, and we’re trying to get the model to predict that label. When training the first model, we can only tell the model “yes”/”no,” but when training a new model using an already trained model, we can tell the model “almost.” Technically, this is done by using the probability distribution of labels provided by the teacher model rather than a single label from the dataset.
Additionally, a large portion of the complexity of many models is due to relatively rare edge cases. The output of a pre-trained model is often “smoother” than the real world, making it easier for the small model to learn. This has an additional advantage where in some cases the distilled model is more robust due to focusing on more meaningful signals rather than potential noise in the real-world dataset.
Additionally, training a distilled model can be vastly accelerated by making further use of the original model. Reducing the parameter size in the case of DistilBERT was done by removing half of the layers in the model. But, rather than starting from scratch, the remaining layers can be initialized with the same values as BERT. This is not perfect but it is much better than starting from scratch.
DistilBERT, like BERT, is a bidirectional model. It uses all of the text in a sentence to determine which word goes where, and this makes it not particularly well suited to generation tasks because it cannot know what the following words will be, as they have not been generated yet.. Instead, DistilBERT and BERT are better suited to other NLP tasks like classification or named entity recognition.
Something to note in the linked model is that it is uncased, which means that “DistilBERT” and “distilbert” are treated as the same word by the model. Intuitively, it seems like this might affect performance when dealing with proper nouns – for example, “Joy” could be someone’s name, and “joy” could be an emotion – but it turns out that the performance is almost exactly the same.
Most recent posts