In the field of natural language processing, it’s crucial to understand how words and sentences are encoded and processed within neural networks. This involves transforming sentences into vectorized representations using methods like Word2Vec, which is essential for training models like Transformers. In this blog post, we will explore key concepts such as word embeddings, both absolute and relative position encodings, and the idea of relative attention. Together, these elements help the network recognize words and the connections between them in various positions, leading to a more nuanced comprehension of language.
Consider the following two sentences: ‘Bengio likes neural networks’ and ‘Hinton said that Bengio likes neural networks’. Before these sentences are inputted into a language model (such as a Transformer), they undergo a two-step preparation process.
The initial step involves mapping each word to its corresponding word embedding. Essentially, each word is assigned a d-dimensional vector (5-dimensional in this example). For the sake of this explanation, let’s assume that we have a simple function that receives a word as input and returns its corresponding 5-dimensional vector.
We can visually represent these vectors using a color-coding system for the individual elements in the 5-dimensional vector. This approach allows us to create the following figures (Figure 1) for both sentences.
In the figure on the top, the token at position 0 (on the x-axis) corresponds to the word ‘Bengio’, the token at position 1 corresponds to ‘likes’, and so forth. The specifics of the mapping are less important than understanding that columns 3 to 6 in the bottom image perfectly replicate the top image. This mapping shows us—and importantly, the neural network—that the first sentence is embedded within the second sentence. In other words, the information from the top figure is contained within the bottom figure.
The second step in preparing our data involves adding position encodings to the word embeddings. This step is crucial to counteract the permutation invariance inherent in the attention mechanism. For the context of this discussion, the specifics of the encoding scheme aren’t vital. So, let’s imagine a straightforward scheme as an example:
We can obtain the complete encoding by adding the position encodings at each position to the corresponding word embedding. Once done, the figures from the previous section can be updated as follows:
One critical aspect to observe here is that since the words ‘Bengio’, ‘likes’, ‘neural’, and ‘networks’ occupy different absolute positions in the two sentences—(0, 1, 2, 3) in the first and (3, 4, 5, 6) in the second—we add different position encodings to the word embeddings. As a result, the final sequence of vectors lacks the similarity they initially possessed. This transformation raises an interesting question—can we do something to simplify this process for our neural networks?
While carefully designed encoding schemes, such as the sinusoidal position encodings proposed in the “Attention Is All You Need” paper, represent a distinct approach to addressing this issue, our focus here is on relative position encodings. See the aside for a brief overview of sinusoidal encoding.
To implement relative position encodings, we introduce a minor modification: our encoding function now takes two parameters—the absolute position of a word in a sentence, and the position of the word for which we’re creating an encoding, in relation to the former:
For example, in the second sentence, if we are computing encodings relative to the word ‘Bengio’, then the word ‘Hinton’ would have a position index of ‘0 - 3 = -3’. Hence, we also need to account for negative indices when generating position encodings.
By picking a reference word, computing the relative position encodings for all other words in relation to this reference, and then adding these encodings to the word embeddings, the resulting figures closely resemble those in Figure 1. We can clearly identify the containment of the first sentence within the second.
There is a catch, however. In the case of absolute position encodings, given a sentence of N words, there is only one way to compute the complete sentence embedding—by adding the word embeddings to the absolute position encodings, leading to a tensor of shape (N, 5). Given this encoded sequence, we can calculate the output vectors in a single step:
For relative position encodings, as we compute the position embeddings relative to all words, there exist N distinct sentence embeddings. As a result, we must handle a tensor of shape (N, N, 5). The way we process the sentence in the relative position case differs from the absolute case, necessitating a novel approach.
One method to compute the model outputs, albeit at the cost of increased computational resources—specifically, N times more—is as follows:
That is, we first compute the relative position encodings relative to the first word, use them to compute the full encodings, and then feed that to the model to compute the first output vector. Then, we repeat the process for the second word, and so on.
In an ideal world where computational resources are unlimited and latency is a non-issue, we could stop here. However, in the real world, we need to be more resourceful to efficiently use relative encodings.
In this section, we will discuss the method presented in the Transformer-XL paper. The authors, in the process of improving the ability of Transformer models to take into account long-range dependencies, propose an interesting and efficient way to incorporate relative position encoding into self-attention.
Before we dive into the details, let us begin by examining the original self-attention computation. We work with a sequence of vectors X, which we consider to be the full encodings or sum of word and (absolute) position embeddings. We denote the word and position embeddings using E and P, respectively.
The matrices W_q, W_k, and W_v project the input to create the Query, Key, and Value matrices.
Then, given a single query vector, Q_i, the attention computation is given by:
We can further dissect the Query-Key product as:
To enhance the standard attention mechanism, the authors replace the P_i vectors with learned vectors u and v. The P^T matrix is replaced by the relative position encoding matrix R_{j-i}, leading to:
Here, the full relative position embedding matrix, R, is an extension of the absolute position embedding matrix, P, encompassing vectors corresponding to negative position indices as well:
The matrix R_{j-i} selects specific columns of the above matrix. For example, when i=0, we get:
For the case where i=1, R_{j-1} is defined as:
And so on. Thus, it’s clear to see how we can compute the Q_i K^T product for any i.
The above enhancements to the attention mechanism give us an efficient way to incorporate relative position encodings into self-attention, preserving the critical relationships between words while avoiding the computational overhead of a brute-force approach. By utilizing learned vectors and the relative position encoding matrix, we effectively address the complexity introduced by relative encodings. Here is the complete description of relative attention:
It is also important to note that while we introduced relative position encoding and relative attention in the context of language modelling, these ideas have proven effective in other domains, such as Automatic Speech Recognition—as seen in the Conformer, for example.
In this blog, we've explored the intricacies of relative position encoding and attention, highlighting their significance in natural language processing. From the foundational aspects of word embeddings to the transformative power of relative attention, these concepts are central to the evolution of neural networks.
If the advancements in AI and neural network architectures pique your interest, check out ABR's revolutionary state space neural network, the Legendre Memory Unit, on our product page to learn about all they can offer for time series processing. We provide benchmark results in published papers to show state-of-the-art performance on language modeling tasks with significantly fewer parameters. If you don't have time for reading papers, but want to get an intuition for what all the fuss is about, check out our non-technical introduction to the LMU. For more insights, research, and breakthroughs in the realm of artificial intelligence, be sure to stay connected with the ABR blog. Your journey into the ever-evolving world of AI is just beginning.
The sinusoidal positional encodings proposed in the “Attention Is All You Need” paper have a unique property: any fixed offset in position can be represented as a linear function of the positional embeddings. In other words, given any two position encodings, there exists a linear transformation (represented by a matrix) that maps one to the other. Let us explore this encoding scheme in some detail.
The positional encoding at position n is given by a vector where the ith element (of the vector) is defined as follows:
Given a fixed offset a, we can use the trigonometric identities to define the position encoding at position n + a. For even indices, we have:
For odd indices:
The above equations define a linear transformation of the form:
where M takes the form of a block diagonal matrix.
The existence of a linear transformation between different positional encodings implies that the model could, in theory, learn to shift its attention from one position to another in the sequence by applying an appropriate linear transformation to the positional encodings.