Know About Transformer Machine Learning Model at a Glance

The transformer machine learning model has evolved and branched out into many different variants

In recent years, the transformer machine learning model has become one of the main highlights of advances in deep learning and deep neural networks. It is mainly used for advanced applications in natural language processing. Google is using it to enhance its search engine results. OpenAI has used transformers to create its famous GPT-2 and GPT-3 models.

Since its debut in 2017, the transformer architecture has evolved and branched out into many different variants, expanding beyond language tasks into other areas. They have been used for time series forecasting. They are the key innovation behind AlphaFold, DeepMind's protein structure prediction model. Codex, OpenAI's source code–generation model, is based on transformers. More recently, transformers have found their way into computer vision, where they are slowly replacing convolutional neural networks (CNN) in complicated tasks.

About the Transformer Architecture

The Transformer architecture follows an encoder-decoder structure but does not rely on recurrence and convolutions in order to generate an output.

In a nutshell, the task of the encoder, on the left half of the Transformer architecture, is to map an input sequence to a sequence of continuous representations, which is then fed into a decoder.

The decoder, on the right half of the architecture, receives the output of the encoder together with the decoder output at the previous time step, to generate an output sequence.

What Makes Transformers Exciting and How they Work

The classic feed-forward neural network is not designed to keep track of sequential data and maps each input into an output. This works for tasks such as classifying images but fails on sequential data such as text. A machine learning model that processes text must not only compute every word but also take into consideration how words come in sequences and relate to each other. The meaning of words can change depending on other words that come before and after them in the sentence.

Attention Layers in the Transformer Model

Once the sentence is transformed into a list of word embeddings, it is fed into the transformer's encoder module. Unlike RNN and LSTM models, the transformer does not receive one input at a time. It can receive an entire sentence's worth of embedding values and process them in parallel. This makes transformers more compute-efficient than their predecessors and also enables them to examine the context of the text in both forward and backward sequences.

To preserve the sequential nature of the words in the sentence, the transformer applies "positional encoding," which basically means that it modifies the values of each embedding vector to represent its location in the text.

Next, the input is passed to the first encoder block, which processes it through an "attention layer." The attention layer tries to capture the relations between the words in the sentence. For example, consider the sentence "The big black cat crossed the road after it dropped a bottle on its side." Here, the model must associate "it" with "cat" and "its" with "bottle." Accordingly, it should establish other associations such as "big" and "cat" or "crossed" and "cat." Otherwise put, the attention layer receives a list of word embeddings that represent the values of individual words and produce a list of vectors that represent both individual words and their relations to each other. The attention layer contains multiple "attention heads," each of which can capture different kinds of relations between words.

The output of the attention layer is fed to a feed-forward neural network that transforms it into a vector representation and sends it to the next attention layer. Transformers contain several blocks of attention and feed-forward layers to gradually capture more complicated relationships.

The task of the decoder module is to translate the encoder's attention vector into the output data (e.g., the translated version of the input text). During the training phase, the decoder has access both to the attention vector produced by the encoder and the expected outcome (e.g., the translated string).

The decoder uses the same tokenization, word embedding, and attention mechanism to process the expected outcome and create attention vectors. It then passes this attention vector and the attention layer in the encoder module, which establishes relations between the input and output values. In the translation application, this is the part where the words from the source and destination languages are mapped to each other. Like the encoder module, the decoder attention vector is passed through a feed-forward layer. Its result is then mapped to a very large vector which is the size of the target data (in the case of language translation, this can span across tens of thousands of words).

Know About Transformer Machine Learning Model at a Glance

The transformer machine learning model has evolved and branched out into many different variants

About the Transformer Architecture

What Makes Transformers Exciting and How they Work

Attention Layers in the Transformer Model

More Trending Stories

Related Stories