Natural Language Processing: How did we get here? Where do large language models fit? – Mantium

Natural Language Processing: How did we get here? Where do large language models fit?

By Madison Van Horn

June 14, 2022   ·   7 min read

As the field of Natural Language Processing (NLP) takes the stage to transform businesses and enterprises around the globe, it’s natural to ask where NLP came from, what is the history behind it, and when? NLP spans linguistics, computer science, and AI, and focuses on developing computational systems that are able to understand natural language. However, while NLP suddenly seems to be everywhere these days, it isn’t as shiny and new as you might think. Today’s NLP technologies are built on a foundation of research that was assembled over decades of exploration and discovery. Let’s take a tour through the origins of NLP to better understand modern NLP systems like Transformers and Large Language Models.

The history of NLP requires an understanding of what NLP can do.

NLP allows computers to process large amounts of natural language data to better characterize and interpret text or audio, similar to how humans can. Our goal is to make computers capable of understanding the contents of data by capturing as much meaningful semantic and contextual information as possible. This is a difficult task for a human to do, let alone for a computer to fully comprehend the meaning that underlies an instance of language.

A walk down memory lane with NLP.

Symbolic NLP (1950s – 1980s)

NLP first emerged during the aftermath of WWII, when there was enormous pressure to increase the speed at which foreign language documents could be translated into English. This proved to be very time-consuming and difficult for a human to accomplish. The emphasis on automating tasks such as this came into play in the first generation of NLP, where the collection of hand-built rules representing the knowledge of a system were used to help analyze and categorize language. We refer to this rule-based phase as Symbolic NLP. The primary focus was building sets of rules to focus on producing syntax.

Statistical NLP (1990s-2010s)

The use of rules to understand language was not sustainable as we recognized that language is dynamic, culturally contingent, and indefinitely extensible. The emergence of new algorithms, accessibility of increased computational power, and availability of relevant data play a heavy role in the transition from Symbolic NLP to Statistical NLP. New NLP systems exploited patterns in language to perform complicated tasks with the assistance of complex statistical models such as Markov models, Bayesian Graphical Networks, and Support Vector Machines. 

Supervised learning also contributes to the shift toward Statistical NLP. Supervised learning uses an algorithm to learn a mapping between an input and target output, thus requiring large, annotated datasets. However, pain points revealed a large emphasis on the relevance of annotated data, which is difficult, time-consuming, and expensive.

Deep Architectures (2010s – current)

Continuous advancements in computational power and algorithms induced a rapid technological explosion in NLP. Statistical NLP heavily relied on feature engineering to use statistical and probabilistic models. Deep Neural Networks (DNNs) no longer heavily rely on feature engineering due to the ability to map raw data to various outputs. 

The popularity of DNNs led to a proliferation of architectures and rapid improvements on various NLP tasks. However, as each new architecture emerged and climbed to the top of the leaderboards, its shortcomings were usually revealed soon after. Exploring the evolution as well as the pros and cons of DNNs can help us understand where we are today. 

Timeline of the Evolution of Deep Neural Networks for Language Modeling:

  1. Feed Forward Neural Networks (Bengio et al., 2001)
  • Learn real-valued representations of words, map words into semantic space
  • Does not encode word order information
  1. Recursive Neural Networks (RNN; Mikolov etc. 2010)
  • Model language as a sequence, allows for a sequence dependencies (encodes word order)
  • Not good at encoding long sequences 
  1. Long Short-Term Memory Neural networks (LSTM; Graves, 2013)
  • Model language as a sequence, but with the ability to selectively remember or forget different parts of the sequence (encodes long and short dependencies)
  • Better at encoding sequences longer sequences, but still high risk of information loss  
  1. Attention Mechanism (Bengio et al., 2015)
  • Improved model’s ability to selectively attend to different parts of a sequence
  • LSTM + Attention requires a lot of parameters

The introduction of the attention mechanism led to the breakthrough of the Transformer model in 2017. The innovative paper “Attention Is All You Need” (Vaswani et al., 2017) highlights the benefits of the transformer model, such as better modeling of longer sequences with the attention mechanism and improved efficiency. Almost immediately, Transformer models improved many commercial technologies, including Google Translate and Google Search with BERT, and later MUM

Transformers remain a dominant, state-of-the-art architecture within the NLP world. Check out our blog for more information about Transformers and their business use cases.

Training and Scaling (current)

Since the advent of Transformers, researchers have focused heavily on increasing the efficiency and the size of these new models. The majority of Transformers are pre-trained models where these models are trained on large quantities of unlabeled data to learn patterns about language both on a grammatical and semantic level. The use of pre-training addresses the shortage of large labeled datasets. Once pre-training was introduced, models could better grasp patterns within language due to the massive amount of unlabeled data. It also allows anyone with access to these models to use the pre-trained model on downstream tasks.

The downside to pre-training is the computational cost. External providers such as OpenAI and HuggingFace pre-train the models for researchers and give access to these models. This access allows users to further fine-tune the model with much smaller, labeled datasets. Fine-tuning trains the pre-trained model for a specific supervised task by performing additional updates to the model’s weights for a new task. This introduced an extremely powerful functionality for using the same pre-trained large language model on different tasks while achieving high performance.

Scaling these models for better performance has been a massive effort within the research community. Scaling involves adjusting the number of parameters, dataset size, and total compute used for optimal training. We refer to these factors contributing to the model’s performance as Scaling Laws. OpenAI scaled up its architecture significantly (173 billion parameters compared to BERT which has 340 million parameters) to remove the need for fine-tuning on labeled datasets. Although, recently it’s been discovered that these large models are severely undertrained due to this massive scaling phase. Along with these undertrained models, scaling doesn’t solve prior issues with these models such as hallucination, grounding, and tyranny of the majority. 

Another consequence of scaling model size is the cost of developing and deploying fine-tuned versions of a model. For instance, fine-tuning a 175 billion parameter model like Open AI’s GPT-3 Davinci is very computationally expensive. Even with the fine-tuned model, you’re left with a new 175 billion parameter model, which requires enormous computational power to use. Accordingly, a new branch of research has emerged that focuses on so-called parameter efficient tuning (PET) methods, which aim to train these large, scaled models by updating only a small set of parameters. Results have been quite promising in this new area of research. In some instances, fine-tuning with a PET method yields comparable performance to fine-tuning a full model. Nonetheless, performance varies in unexpected ways, and there is not yet a universally accepted PET method. 

Now we are at the point where we can use these large language models and fine-tune them to whatever task we decide. What’s next? 

We see new directions for Transformers to utilize in-context learning. In-context learning uses none to few examples (also known as few-shot demonstrations) passed at inference to guide the model’s predictions via text interaction, with no additional training. These examples provide context that conditions the model to adjust to new tasks as seen in GPT-3. This is often referred to as prompt engineering. While in-context learning performs surprisingly well for conditioning one single large model for downstream tasks, there is still uncertainty about what it means for a model to understand given input.

Sources

ABOUT THE AUTHOR

Madison Van Horn
NLP Engineer Mantium
Madison Van Horn is a NLP Engineer at Mantium where she develops enterprise AI solutions, from use case discovery, state-of-the-art implementation, all the way through to safe deployment of large language models. She holds a Master’s degree in Artificial Intelligence from University of Edinburgh. Madison is passionate about making AI accessible especially to those who are less familiar with AI technology.

Enjoy what you're reading?

Subscribe to our blog to keep up on the latest news, releases, thought leadership, and more.