Transformer Model DistilBERT: All You Need to Know

By Peter Klipfel

July 19, 2022   ·   5 min read

The groundbreaking capabilities of transformer models have created an explosion of alternatives. It can be daunting to sift through all the research papers to understand how the versions differ and what they can do.

This post will cover  DistilBERT, a “distilled” version of BERT, one of the leading transformer models in use today. DistilBERT is a smaller, faster version of BERT that is 95%  as good as BERT and 60% faster. This speed improvement means that it can be run on less powerful hardware.

This year, Apple announced that they would allow developers to use transformers on Apple devices. But if we want powerful transformer models to run on our phones and watches, we need to make them smaller. The largest transformer models available today require clusters of some of the most powerful hardware in the world – not well suited for running on your phone. As we find ways to create even smaller models, we can run more powerful models on our phones, watches, and other hardware-constrained environments.

Understanding Distillation

To understand DistilBERT, we first need to understand what distillation is. Distillation is a family of techniques that allow us to make models smaller. Smaller models are generally faster, less expensive, and less energy-intensive than larger models, but the tradeoff is often decreased accuracy.

In this case, DistilBERT has 40% fewer parameters than BERT but retains 95% of BERT’s performance in language understanding. This feat was achievable by having BERT “teach” a smaller version of itself. This strategy is very effective for training models because normal datasets provide less information per example than a trained model can.

Transformer models data

For example, let’s say  we have a dataset with one label per row of data, and we’re trying to get the model to predict that label. When training the first model, we can only tell the model “yes”/”no,” but when training a new model using an already trained model, we can tell the model “almost.” Technically, this is done by using the probability distribution of labels provided by the teacher model rather than a single label from the dataset.

Additionally, a large portion of the complexity of many models is due to relatively rare edge cases. The output of a pre-trained model is often “smoother” than the real world, making it easier for the small model to learn. This has an additional advantage where in some cases the distilled model is more robust due to focusing on more meaningful signals rather than potential noise in the real-world dataset.

Additionally, training a distilled model can be vastly accelerated by making further use of the original model. Reducing the parameter size in the case of DistilBERT was done by removing half of the layers in the model. But, rather than starting from scratch, the remaining layers can be initialized with the same values as BERT. This is not perfect but it is much better than starting from scratch.

Additional new updates:

  • Security features can apply to prompts during creation (no more jumping back to the security page to set up HITL) 
  •  Once you hit the test run, the different font colors clearly distinguish the inputs to the model (white font) vs. the outputs of the model (teal font).

A few additional things to note

DistilBERT, like BERT, is a bidirectional model. It uses all of the text in a sentence to determine which word goes where, and this makes it not particularly well suited to generation tasks because it cannot know what the following words will be, as they have not been generated yet..  Instead, DistilBERT and BERT are better suited to other NLP tasks like classification or named entity recognition.

Something to note in the linked model is that it is uncased, which means that “DistilBERT” and “distilbert” are treated as the same word by the model. Intuitively, it seems like this might affect performance when dealing with proper nouns – for example, “Joy” could be someone’s name, and “joy” could be an emotion – but it turns out that the performance is almost exactly the same

Are you new to Mantium?

Join us on the journey to democratize AI and experience all that Mantium can offer, including:

  • AI templates – Pre-populated templates give you the ease of use you desire and a path to AI in less time than it takes to drink a cup of coffee. 
  • Quick and efficient testing – Launch prototypes quickly and share the link to gain team buy-in and collaboration. 
  • Dedicated customer success team – As you might already know, we offer a world-class team of data scientists and NLP engineers to support your needs, from start to finish. 
  • All levels welcome – from dabbler to the AI enthusiast. Regardless of your AI skill set or level of expertise, we have the people and tools to support you in your AI journey. 

Sign up to join the waitlist for the beta of AI Builder.

Follow us on Twitter @MantiumAI | Follow us on LinkedIn Mantium


Peter Klipfel
Senior Evangelist Mantium
Peter is an AI Evangelist at Mantium, where he helps others discover the power and flexibility of AI systems through Mantium. He loves seeing AI make people's lives easier and is passionate about the space. Peter has worked in big data, machine learning, and AI for the last ten years, including as a software engineer and product leader.

Enjoy what you're reading?

Subscribe to our blog to keep up on the latest news, releases, thought leadership, and more.