Workbook 1

Technology of large language models

Learn the fundamental technical concepts behind today's leading large AI models

1.1 What are large language models?

Large Language Models (LLMs), such as OpenAI’s ChatGPT or Google’s BARD, are large neural networks trained on vast amounts of data that can perform a variety of natural language generation tasks with unprecedented quality. How do they work and what makes them so powerful? This chapter of the LLM Workbook aims to provide an understanding of key underlying components and techniques such as language modelling and transformers. Meanwhile, we will highlight key characteristics of contemporary LLMs, such as their “foundational” nature for a wide range of tasks. Don’t worry if any of these terms seem unfamiliar – we’ll go through them one by one in this chapter.

Key Terminology

  • Language Modelling
    +

    “You shall know a word by the company it keeps”, Firth, J. R. 1957

    This phrase by linguist John Rupert Firth about the meaning of words has proved to be the most fundamental principle behind the success of today’s LLMs. It states that the meaning of a word can be inferred from the totality of the contexts in which it occurs. The statistical approach used to do this is known as language modelling. In practice, this means processing large amounts of text to determine how likely each word is to occur in a given context, i.e. the distributional information of words. If these learned probabilities are sufficiently accurate, it is possible to generate new text of the quality that can be seen in today’s LLMs.

    Take the following example:

    Yesterday was a beautiful day since the sky was <>.”

    A likely completion of this sentence by a human, based on our perception of the world, would be “blue” rather than, say, “cloudy”, making the former word much more likely. To obtain this information, all that is needed is a text that contains as many aspects of a word’s meaning as possible, reflecting all the contexts in which it occurs. The advent of the internet was the key to unlocking the true power of this approach by making vast amounts of textual data readily available, leading to an ever-increasing quality of language models (LMs).

  • Artificial Neural Networks
    +

    Inspired by their biological counterparts, artificial neural networks (or simply “neural networks” in the context of artificial intelligence) consist of small computational units called neurons. The main function of these neurons is to process and weight their input and then output a new signal when a certain threshold is passed. These units can be grouped together in interconnected layers to form a network, the complexity of which increases with the number of layers. These networks have been shown to be able to perform tasks that are far too complex for traditional rule-based programs to solve, and are very effective and efficient at storing distributional information about the data they are trained on in their weights (also called parameters).

  • Deep Learning
    +

    Deep learning is the process of gradually adjusting the weight of each neuron in a stack of artificial neural network layers over time, so that the whole network is able to improve at a given task. In other words, such a network is “trained” by repeatedly exposing it to many examples of an input and the desired output, and allowed to “learn” by adjusting the weights each time so that the results the network produces become more likely and closer to the desired output.

  • Neural Language Modelling
    +

    Neural language modelling refers to language modelling techniques based on neural networks. Remember that the goal of language modelling is to find out how likely each word is to occur in a given context. How can this kind of distributional information be determined practically and efficiently, especially as the number of contexts increases? Estimating this information by simply counting the occurrences of a word in all possible scenarios is impractical, because even a single word can occur in an exponentially large number of different contexts. Nevertheless, learning from a large amount of text is desirable, as it provides richer information than a small amount of text. With this in mind, researchers came up with the idea of harnessing the power of neural networks and deep learning. Since efficient storage of distributional information is exactly what neural networks are good at, they naturally lend themselves to the task of language modelling. This is why neural language modelling has become ubiquitous in natural language processing today.

1.2 Why develop large language models?

Above all, LLMs are extremely versatile and can be used for a wide range of tasks. Current models such as GPT-3.5, which powers ChatGPT, can answer questions, write poems or essays, translate between languages, or even generate code, all while being trained only on raw text. While some of these capabilities are still surprising, it is clear that language modelling is much more useful than it seemed just a few years ago. It has been shown to be suitable for learning a range of linguistic regularities, such as sentence structure, the interdependence of word forms, and their roles in a sentence. 

LLMs are also efficient in that they provide an end-to-end solution to the above tasks within a single model. Previously, a number of different methods were required, each responsible for a particular aspect of language generation for a specific task. Furthermore, the data required to learn all this information does not require tedious manual processing, but can simply be scraped from the internet or other sources of textual data, as we will see in the next section. It is clear, therefore, that in the near future LLMs have the potential to transform the way people create or design content in a wide range of fields. 

1.3 How do large language models work?

LLM Training Techniques

In order for LLMs to become powerful, they need to be trained on large amounts of data. Today, this is mainly done using a technique called self-supervised learning. But before we explain this, we first need to look at a closely related underlying concept, supervised learning

  • Supervised Learning
    +

    In supervised learning, a neural network is presented with, say, a sentence and asked to make a prediction about whether the sentence expresses a positive or negative sentiment. To learn the relationship between the words in a sentence and the sentiment they express, the neural network is presented with many (from thousands to millions) example sentence-sentiment pairs (also known as labels).

    Sentence: “Pretty good. I didn’t know any of the comedians, but the first time I watched it it put a smile on my face. I’ll be watching the next season soon.”

    Sentiment to be predicted: positive

    Assigning labels to examples is called annotation, and it usually has to be done by humans. This annotation work has been a serious bottleneck for model training in the past, as it is expensive to use humans to scale up datasets.

  • Self-Supervised Learning
    +

    Given the shortcomings of supervised learning and the need to train on large amounts of data, self-supervised learning allows models to learn from data that is not annotated. The key feature of self-supervised learning is that the labels are inherent in the data. Sentences are used as training data in the following way:

    Sentence: “The sun is slowly rising.”

    Input to the model: “The sun is slowly _

    Target word to be predicted: “rising”

    As can be seen from the example, the desired output word (label) in this case is simply the next word, which means that this type of training allows the use of “raw” (as in unannotated) data that can be easily scraped in large quantities from any source of digitised text ( internet, digital libraries, etc.).

Transformer and Attention

Transformer is a novel neural network architecture introduced by Vaswani et al. in 2017. Its key capability lies in its self-attention mechanism, which allows it to use contextual information more effectively and flexibly than previous artificial neural network types. Self-attention in transformers works like its counterpart in the human brain and cognition. From a cognitive perspective, attention plays a critical role in flexibly directing our awareness to selected aspects of information. Similarly, in artificial neural networks such as transformers, the attention mechanism allows the model to flexibly relate each word to all other words by weighting their relative importance to each other. This “transforms” the input into something meaningful for downstream tasks by highlighting important contextual information and associating it with the relevant words. 

For example, in the following sentence, some words are more important than others to the overall meaning:

“Alice crossed the road as she was in a hurry.”

Here the pronoun “she” refers to “Alice”. Identifying this type of association is known as coreference resolution. The self-attention mechanism allows transformers to intuitively learn that some words (e.g. “Alice”) in this sentence are more closely related to other words (e.g. “she”) and carry important information for them. In terms of the meaning of “she”, “Alice” may be the most important word, whereas “the” is quite unimportant. From a computational point of view, this means that “Alice” is given more weight in relation to the word “they”, so that a strong association is maintained between them.

Obviously, a huge amount of computation is required to establish these word-to-word associations given a large amount of data. Fortunately, a key advantage of transformers is their efficiency in dealing with these complex associations, because they are designed to allow computations to be performed independently and thus simultaneously for each word (parallelisability). This is important because training is a very time-consuming and resource-intensive process.

  • BERT (Bidirectional Encoder Representations from Transformers)
    +

    In 2018, Devlin et al. introduced the BERT model and a novel language modelling technique called masked language modelling. This technique is designed in particular to exploit the ability of transformers to produce rich, context-aware representations of words, i.e. to help them take into account information from a word’s left and right neighbours during language modelling. To achieve this, they preprocess a sentence like this:

    “The sky was <MASK> and the sun was shining.”

    Now the model has to infer based on all the other words what the original word in the masked position might have been. Each example can also have a certain number of randomly masked words, which prevents the model from learning only certain parts of a sentence. A model trained in this way can be used for a wide range of tasks that require a profound understanding of natural language, such as inferring the meaning of an expression in relation to a previous sentence.

    A prominent aspect of BERT that gave it such influence on subsequent work is its approach to model training: The model is trained in two stages. First, a “base” version is trained using language modelling on a large amount of data in a stage called pre-training. This is computationally intensive and time consuming. In a second stage, this pre-trained model is fine-tuned, i.e. adapted to a different task using relatively little data. The advantage of this is that the first, “costly” stage only needs to be performed once. The general linguistic information learned can then be reused for other tasks to achieve state-of-the-art performance.

    As described above, BERT is designed and trained to look “both ways” at the left and right neighbours of the masked words. This makes it particularly suitable for tasks that require information from the whole context, such as sentiment analysis or part-of-speech tagging, where correct predictions (e.g. whether a sentence has a positive sentiment) depend on information from the full input sentences. This architectural feature distinguishes it from “generative” models such as GPT.

  • GPT (Generative Pre-trained Transformer)
    +

    Another dominant family of contemporary models is the GPT family, with its most recent famous member ChatGPT. GPT models also use transformers and are trained to be good at just predicting the next word in a sentence. This means that they look only to the left of the input (whereas BERT can look at the whole context), a technique known as causal language modelling. This makes them good at producing coherent and linguistically correct output, which is why they are called “generative” models..

    Notably, “just predicting the next word” does some heavy lifting here. It turns out that this scheme allows a neural network that is large enough (above a certain number of parameters, typically billions in recent years) to make sense of the input “on the left” (known as the prompt) to the extent that it can perform a wide range of unexpected and surprising tasks, even without ever having been exposed to task-specific data.

1.4 What makes large language models so powerful?

Foundation Models

A key feature of large language models is their ability to generalise. This simply means that the knowledge that these models extract from their training data, such as the meaning of words, regularities about sentence structure, etc., can be applied to many different contexts. This feature links LLMs to the concept of foundation models. The term foundation models, coined in 2021, is motivated by a number of observations about the properties and capabilities of large neural network models trained on a wide range of data at scale. 

The power of foundation models lies in their ability to capture relevant information in such a way that it can be used as a common “foundation” for a wide variety of downstream applications, which is what distinguishes foundation models from traditional approaches. 

For example, when building a chatbot, a traditional natural language understanding (NLU) engine may have to assemble a complicated set of methods to eventually help the chatbot “understand” the user’s input. If a user asks, “Can you describe the development of quantum computing over the last decade?”, the engine may first need to recognise that the user is asking for a description, and then identify relevant time and entity information in the input. It then needs to retrieve relevant data from the database and form a complete answer. These methods need to be integrated and can lead to considerable complexity (see figure below). 

A classic natural language understanding (NLU) engine may involve the integration of multiple components such as intent classification and slot filling, whereas large language models provide an end-to-end, unified approach.

Today, however, large language models are able to capture all this information from simple user input and directly output the desired answers using a single large neural network that has been pre-trained on relevant data. Furthermore, because of its generality, the knowledge representation extracted by these large language models can even be reused to process information in other modalities, such as speech, audio, images or even source code. Famous examples include text-to-code models such as OpenAI Codex and text-to-image models such as DALL-E 2 or Stable Diffusion, which make extensive use of general linguistic information captured by LLMs.

Foundation models extract generalised knowledge representations from data across different modalities, which can be used for a variety of downstream tasks. From “The feasibility study on LEAM – Large European AI Models.
The relationship between machine learning, deep learning (after its resurgence around 2012), transformer-based models and foundation models. The term “foundation models” covers a wide range of models that are generalisable and can be applied to multiple modalities.

Since LLMs are neural networks that store generalised representations of knowledge in their parameters, the question of model size is closely related to the level of generalisation that can be achieved. Put simply, the more data available, the more information there is to extract and the more capacity (parameters) a model may need to store it. This relationship has led to rapid growth in model size, from a few hundred million to around a trillion parameters.

Foundation models grew significantly in size between 2018 and 2022, though it remains to be seen whether future models will continue this trend.

Homogenisation

The number of modelling techniques used today to achieve generalisation has decreased significantly. The transformer is used as an essential component in various model architectures (mostly variants of either BERT or GPT) due to its efficient and powerful attention mechanism. Training is done in one or two stages, first learning general knowledge during pre-training on raw data, and then, if available, fine-tuning on task-specific data to give the model more specific capabilities. 

The homogenisation of methodological approaches has also greatly facilitated research across different application areas. For example, LLMs can be used for protein sequence modelling as well as for speech processing or image generation.   

Emergence of capabilities

When a neural network is trained, there is usually a very specific task that the model learns to do. So when models are pre-trained on a classic language modelling task, they are expected to be good at producing coherent, linguistically correct text. It turns out that LLMs, when sufficiently scaled, can solve a surprisingly wide range of additional tasks from a natural language description (prompt) alone. These tasks can include machine translation, performing arithmetic, generating code or answering general questions.

Typically, the input to the pre-trained model is provided by the user in natural language. Based on this input, the model generates the most likely next word, appends it to the input, and continues until some stopping criterion is reached (e.g. an artificial word indicating the end of the sentence has been generated). Simply providing a task description as input is known as zero-shot learning, while adding a few examples of what a correct output might look like is known as few-shot learning

Example of few-shot and zero-shot learning with machine translation from English to French (adapted from Brown et al, 2020). In the case of few-shot learning, users first describe the task to be performed and then provide a few examples of the completed task, followed by a prompt. In the case of zero-shot learning, only a task description is provided, followed immediately by a prompt.

1.5 What’s next?

This chapter has provided an overview of the fundamental concepts, techniques and features of large language models. We covered the idea behind language modelling and the proliferation of large neural network architectures based on BERT and GPT. We also introduced the training technique of self-supervised learning, as well as the transformers and their powerful attention mechanism. Finally, we described crucial features observed in LLMs such as ChatGPT, such as the generalised knowledge they extract from raw text, which enables their emergent capabilities. 

The wider adoption of generative LLMs has just begun with the integration of ChatGPT into Microsoft’s Bing or BARD into Google Search. Code generation based on natural language prompts, a notable emergent feature of GPT-3 and used by Github’s Copilot (legal issues notwithstanding), has already proven to be a powerful assistant for software engineers. Potential and real use cases are explored in more detail in LLM Workbook chapter 2.

The compelling ability of LLMs to produce highly coherent texts raises ethical questions about the expectations that humans can have when interacting with a machine. Moreover, even the best LLMs have been shown to generate falsehoods (hallucinations). Considerations of the societal impact of LLMs must therefore include the dangers of large-scale misinformation and the reproduction of biases in their training data. There are also environmental and sustainability issues associated with the training and use of LLMs. These challenging and largely unresolved issues are addressed in LLM Workbook chapter 3

Test your knowledge

1/4
01

Which of the following statements best describes the difference between supervised learning and self-supervised learning?

It typically requires much less human labour for labelling and annotation to prepare training data for supervised learning than for self-supervised learning.
In terms of label preparation, it is easier to scale up the training dataset for supervised learning than for self-supervised learning.
Self-supervised learning uses labels that are inherent in the data, whereas supervised learning requires additional labels or annotations.
Using the next word as the target model output would be a typical case of supervised learning rather than self-supervised learning.
02

What is a key feature of the transformer neural network architecture introduced by Vaswani et al. in 2017?

It cannot handle coreference resolution efficiently.
It cannot perform computations for each word independently and simultaneously.
It can transform input into meaningful representations for downstream tasks.
It cannot determine the relative importance of words in a sentence.
03

What is the main difference between GPT and BERT models?

GPT models typically look only to the left of the input, whereas BERT models can look at the whole context.
GPT models use transformers whereas BERT models use recurrent neural networks.
GPT models are known as "discriminative" models, whereas BERT models are known as "generative" models.
GPT models are always trained on task-specific data, while BERT models can always perform tasks without task-specific data.
04

What is the difference between zero-shot learning and few-shot learning in the context of language models?

Zero-shot learning is used only for machine translation tasks, while few-shot learning is used only for code generation tasks.
Zero-shot learning uses only a task description as input, while few-shot learning also uses examples of correct answers.
Zero-shot learning requires a large amount of training data, while few-shot learning requires only a small amount of training data.
Zero-shot learning involves training a model on a specific task, while few-shot learning does not involve any training.

You answered questions correctly.

Next
Show result
Play again