Research

Understanding Relative Position Encoding and Attention

August 16, 2023
Narsimha Chilkuri

In the field of natural language processing, it’s crucial to understand how words and sentences are encoded and processed within neural networks. This involves transforming sentences into vectorized representations using methods like Word2Vec, which is essential for training models like Transformers. In this blog post, we will explore key concepts such as word embeddings, both absolute and relative position encodings, and the idea of relative attention. Together, these elements help the network recognize words and the connections between them in various positions, leading to a more nuanced comprehension of language.

Word2Vec

Consider the following two sentences: ‘Bengio likes neural networks’ and ‘Hinton said that Bengio likes neural networks’. Before these sentences are inputted into a language model (such as a Transformer), they undergo a two-step preparation process.

Code snippet showing two sentence arrays: 'Bengio likes neural networks' and 'Hinton said that Bengio likes neural networks

Word Embeddings

The initial step involves mapping each word to its corresponding word embedding. Essentially, each word is assigned a d-dimensional vector (5-dimensional in this example). For the sake of this explanation, let’s assume that we have a simple function that receives a word as input and returns its corresponding 5-dimensional vector.

Code snippet illustrating the creation of word embeddings for sentences using a dictionary and a function to retrieve word embeddings

We can visually represent these vectors using a color-coding system for the individual elements in the 5-dimensional vector. This approach allows us to create the following figures (Figure 1) for both sentences.

Visualization of word embeddings for Sentence 1 with colored squares. Labeled 'dimension' from 0-4 on the left and words 'bengio', 'likes', 'neural', 'networks' at the bottom.
Visualization of word embeddings for Sentence 2 with colored squares. Labeled 'dimension' from 0-4 on the left and words 'hinton', 'said', 'that', 'bengio', 'likes', 'neural', 'networks' at the bottom.

In the figure on the top, the token at position 0 (on the x-axis) corresponds to the word ‘Bengio’, the token at position 1 corresponds to ‘likes’, and so forth. The specifics of the mapping are less important than understanding that columns 3 to 6 in the bottom image perfectly replicate the top image. This mapping shows us—and importantly, the neural network—that the first sentence is embedded within the second sentence. In other words, the information from the top figure is contained within the bottom figure.

Position Encodings

The second step in preparing our data involves adding position encodings to the word embeddings. This step is crucial to counteract the permutation invariance inherent in the attention mechanism. For the context of this discussion, the specifics of the encoding scheme aren’t vital. So, let’s imagine a straightforward scheme as an example:

Code snippet illustrating position encodings generation and their addition to word embeddings for two sentences.

We can obtain the complete encoding by adding the position encodings at each position to the corresponding word embedding. Once done, the figures from the previous section can be updated as follows:

Visualization of word plus positive embeddings for Sentence 1 with colored squares. Labeled 'dimension' from 0-4 on the left and words 'bengio', 'likes', 'neural', and 'networks', at the bottom.
Visualization of word plus positive embeddings for Sentence 2 with colored squares. Labeled 'dimension' from 0-4 on the left and words 'hinton', 'said', 'that', 'bengio', 'likes', 'neural', 'networks' at the bottom.

One critical aspect to observe here is that since the words ‘Bengio’, ‘likes’, ‘neural’, and ‘networks’ occupy different absolute positions in the two sentences—(0, 1, 2, 3) in the first and (3, 4, 5, 6) in the second—we add different position encodings to the word embeddings. As a result, the final sequence of vectors lacks the similarity they initially possessed. This transformation raises an interesting question—can we do something to simplify this process for our neural networks?

While carefully designed encoding schemes, such as the sinusoidal position encodings proposed in the “Attention Is All You Need” paper, represent a distinct approach to addressing this issue, our focus here is on relative position encodings. See the aside for a brief overview of sinusoidal encoding.

Relative Position Encodings

To implement relative position encodings, we introduce a minor modification: our encoding function now takes two parameters—the absolute position of a word in a sentence, and the position of the word for which we’re creating an encoding, in relation to the former:

Code snippet defining a function to calculate relative position encoding based on two input positions

For example, in the second sentence, if we are computing encodings relative to the word ‘Bengio’, then the word ‘Hinton’ would have a position index of ‘0 - 3 = -3’. Hence, we also need to account for negative indices when generating position encodings.

Code snippet creating a dictionary of relative position encodings with random 5-dimensional vectors for a range of -10 to 10.

By picking a reference word, computing the relative position encodings for all other words in relation to this reference, and then adding these encodings to the word embeddings, the resulting figures closely resemble those in Figure 1. We can clearly identify the containment of the first sentence within the second.

Code snippet calculating position encodings relative to the word 'Bengio' and then adding these encodings to word embeddings for two sentences.
Visualization of word plus positive relative to "Bengio" with colored squares. Labeled 'dimension' from 0-4 on the left and words 'bengio', 'likes', 'neural', and 'networks', at the bottom.
Visualization of word plus positive relative to "Bengio" with colored squares. Labeled 'dimension' from 0-4 on the left and words 'hinton', 'said', 'that', 'bengio', 'likes', 'neural', 'networks' at the bottom.

There is a catch, however. In the case of absolute position encodings, given a sentence of N words, there is only one way to compute the complete sentence embedding—by adding the word embeddings to the absolute position encodings, leading to a tensor of shape (N, 5). Given this encoded sequence, we can calculate the output vectors in a single step:

Code snippet passing 'full_encodings' to a model function to generate outputs.

For relative position encodings, as we compute the position embeddings relative to all words, there exist N distinct sentence embeddings. As a result, we must handle a tensor of shape (N, N, 5). The way we process the sentence in the relative position case differs from the absolute case, necessitating a novel approach.

Brute-force Relative Attention

One method to compute the model outputs, albeit at the cost of increased computational resources—specifically, N times more—is as follows:

Pseudo code snippet iterating over relative indices, calculating position encodings, generating full encodings, and appending model outputs to a list.

That is, we first compute the relative position encodings relative to the first word, use them to compute the full encodings, and then feed that to the model to compute the first output vector. Then, we repeat the process for the second word, and so on.

In an ideal world where computational resources are unlimited and latency is a non-issue, we could stop here. However, in the real world, we need to be more resourceful to efficiently use relative encodings.

Relative Attention

In this section, we will discuss the method presented in the Transformer-XL paper. The authors, in the process of improving the ability of Transformer models to take into account long-range dependencies, propose an interesting and efficient way to incorporate relative position encoding into self-attention.

Self-Attention

Before we dive into the details, let us begin by examining the original self-attention computation. We work with a sequence of vectors X, which we consider to be the full encodings or sum of word and (absolute) position embeddings. We denote the word and position embeddings using E and P, respectively.

Equations representing the projection of input to create Query, Key, and Value matrices using matrices W_q, W_k, and W_v.

The matrices W_q, W_k, and W_v project the input to create the Query, Key, and Value matrices.

Equations defining the matrices W_q, W_k, and W_v and their use in projecting input to create Query (Q), Key (K), and Value (V) matrices.

Then, given a single query vector, Q_i, the attention computation is given by:

Equation representing the attention mechanism, calculating the attention of Q_i with respect to K, and multiplying with matrix V.

We can further dissect the Query-Key product as:

Equation breaking down the calculation of attention weights using Q_i and K matrices, incorporating embeddings (E) and position encodings (P) with projection matrices W_q and W_k.

Modifying Self-Attention

To enhance the standard attention mechanism, the authors replace the P_i vectors with learned vectors u and v. The P^T matrix is replaced by the relative position encoding matrix R_{j-i}, leading to:

Equation for the calculation of attention weights using Q_i and K matrices, incorporating embeddings (E) and colored terms u, v, and R_{j-i} to highlight the relative position encoding.

Here, the full relative position embedding matrix, R, is an extension of the absolute position embedding matrix, P, encompassing vectors corresponding to negative position indices as well:

Matrix representation of R^{\top} showing the relative position encodings from P_{-N} to P_{N-1}.

The matrix R_{j-i} selects specific columns of the above matrix. For example, when i=0, we get:

Matrix representation of R_{j-0}^{\top} showing the relative position encodings from P_{0} to P_{N-1}.

For the case where i=1, R_{j-1} is defined as:

Matrix representation of R_{j-1}^{\top} showing the relative position encodings from P_{-1} to P_{N-2}.

And so on. Thus, it’s clear to see how we can compute the Q_i K^T product for any i.

Bringing it All Together

The above enhancements to the attention mechanism give us an efficient way to incorporate relative position encodings into self-attention, preserving the critical relationships between words while avoiding the computational overhead of a brute-force approach. By utilizing learned vectors and the relative position encoding matrix, we effectively address the complexity introduced by relative encodings. Here is the complete description of relative attention:

Set of equations detailing the attention mechanism process, from input embeddings to the final attention calculation, with emphasis on relative position encodings highlighted in red.

It is also important to note that while we introduced relative position encoding and relative attention in the context of language modelling, these ideas have proven effective in other domains, such as Automatic Speech Recognition—as seen in the Conformer, for example.

Summary

In this blog, we've explored the intricacies of relative position encoding and attention, highlighting their significance in natural language processing. From the foundational aspects of word embeddings to the transformative power of relative attention, these concepts are central to the evolution of neural networks.

If the advancements in AI and neural network architectures pique your interest, check out ABR's revolutionary state space neural network, the Legendre Memory Unit, on our product page to learn about all they can offer for time series processing. We provide benchmark results in published papers to show state-of-the-art performance on language modeling tasks with significantly fewer parameters. If you don't have time for reading papers, but want to get an intuition for what all the fuss is about, check out our non-technical introduction to the LMU. For more insights, research, and breakthroughs in the realm of artificial intelligence, be sure to stay connected with the ABR blog. Your journey into the ever-evolving world of AI is just beginning.

Appendix - Sinusoidal Position Encodings

The sinusoidal positional encodings proposed in the “Attention Is All You Need” paper have a unique property: any fixed offset in position can be represented as a linear function of the positional embeddings. In other words, given any two position encodings, there exists a linear transformation (represented by a matrix) that maps one to the other. Let us explore this encoding scheme in some detail.

The positional encoding at position n is given by a vector where the ith element (of the vector) is defined as follows:

Equations for sinusoidal position encoding using sine for even indices and cosine for odd indices, with a scaling factor based on the model's dimension.

Given a fixed offset a, we can use the trigonometric identities to define the position encoding at position n + a. For even indices, we have:

Equations showing the shifted sinusoidal position encoding, breaking down the relationship between original and shifted encodings using trigonometric identities.

For odd indices:

Equations detailing the shifted cosine position encoding, illustrating the relationship between original and shifted encodings using trigonometric identities.

The above equations define a linear transformation of the form:

Equation showing the shifted position encoding P_{n+a} as a result of matrix multiplication M with the original position encoding P_n.

where M takes the form of a block diagonal matrix.

The existence of a linear transformation between different positional encodings implies that the model could, in theory, learn to shift its attention from one position to another in the sequence by applying an appropriate linear transformation to the positional encodings.

Narsimha Chilkuri is a Machine Learning Engineer at Applied Brain Research (ABR) and an alumnus of the University of Waterloo. Transitioning from theoretical physics to AI, Narsimha's passion lies in leveraging artificial intelligence to solve complex challenges. His collaboration with industry experts and his involvement in projects at ABR underscore his commitment to advancing the AI frontier.

Similar content from the ABR blog