natural language processing 100, Cheat Sheet of Natural Language Processing (NLP)

natural language processing 100 questions

Typology: Cheat Sheet

2021/2022

Uploaded on 01/30/2022

zakaria-hamane
zakaria-hamane 🇺🇸

2 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
The NLP Cookbook: Modern Recipes for
Transformer based Deep Learning Architectures
SUSHANT SINGH1, AND AUSIF MAHMOOD2
Department of Computer Science & Engineering, University of Bridgepo rt, Connecticut, CT 06604, USA
Corresponding author: Sushant Singh (susha[email protected])
ABSTRACT In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic
and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural
Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal
Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have
achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP
architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate
model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data
size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to
extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has
also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and
examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal
performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy
of NLP designs, comparative evaluations, and future directions in NLP.
INDEX TERMS Deep Learning, Natural Language Processing (NLP), Natural Language Understanding (NLU), Natural
Language Generation (NLG), Information Retrieval (IR), Knowledge Distillation (KD), Pruning, Quantization
I. INTRODUCTION
Natural Language Processing (NLP) is a field of Machine
Learning dealing with linguistics that builds and develops
Language Models. Language Modeling (LM) determines the
likelihood of word sequences occurring in a sentence via
probabilistic and statistical techniques. Since human
languages involve sequences of words, the initial language
models were based on Recurrent Neural Networks (RNNs).
Because RNNs can lead to vanishing and exploding
gradients for long sequences, improved recurrent networks
like LSTMs and GRUs were utilized for improved
performance. Despite enhancements, LSTMs were found to
lack comprehension when relatively longer sequences were
involved. This is due to the reason that the entire history
known as a context, is being handled by a single state vector.
However, greater compute resources lead to an influx of
novel architectures causing a meteoric rise of Deep Learning
[1] based NLP models.
The breakthrough Transformer [2] architecture in 2017
overcame LSTM’s context limitation via the Attention
mechanism. Additionally, it provided greater throughput as
inputs are processed in parallel with no sequential
dependency. Subsequent launches of improved Transformer
based models like GPT-I [3] and BERT [4] in 2018 turned
out to be a climacteric year for the NLP world. These
architectures were trained on large datasets to create pre-
trained models. Thereafter transfer learning was used to fine-
tune these models for task-specific features resulting in
significant performance enhancement on several NLP tasks
[5],[6],[7],[8],[9],[10]. These tasks include but are not
limited to language modeling, sentiment analysis, question
answering, and natural language inference.
This accomplishment lacked the transfer learning’s primary
objective of achieving high model accuracy with minimal
fine-tuning samples. Also, model performance needs to be
generalized across several datasets and not be task or dataset-
specific [11],[12],[13]. However, the goal of high
generalization and transfer learning was being compromised
as an increasing amount of data was being used for both pre-
training and fine-tuning purposes. This clouded the decision
whether greater training data or an improved architecture
should be incorporated to build a better SOTA language
model. For instance, the subsequent XLNet [14] architecture
possessed novel yet intricate language modeling, that
provided a marginal improvement over a simplistic BERT
architecture that was trained on a mere ~10% of XLNet’s
data (113GB). Thereafter, with the induction of RoBERTa
[15], a large BERT-based model trained on significantly
more data than BERT (160GB), outperformed XLNet. Thus,
an architecture that is more generalizable and further is
trained on larger data, results in NLP benchmarks.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download natural language processing 100 and more Cheat Sheet Natural Language Processing (NLP) in PDF only on Docsity!

The NLP Cookbook: Modern Recipes for

Transformer based Deep Learning Architectures

SUSHANT SINGH

1

, AND AUSIF MAHMOOD

2

Department of Computer Science & Engineering, University of Bridgeport, Connecticut, CT 06604, USA

Corresponding author: Sushant Singh ([email protected])

ABSTRACT In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic

and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural

Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal

Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have

achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP

architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate

model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data

size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to

extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has

also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and

examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal

performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy

of NLP designs, comparative evaluations, and future directions in NLP.

INDEX TERMS Deep Learning, Natural Language Processing (NLP), Natural Language Understanding (NLU), Natural

Language Generation (NLG), Information Retrieval (IR), Knowledge Distillation (KD), Pruning, Quantization

I. INTRODUCTION

Natural Language Processing (NLP) is a field of Machine

Learning dealing with linguistics that builds and develops

Language Models. Language Modeling (LM) determines the

likelihood of word sequences occurring in a sentence via

probabilistic and statistical techniques. Since human

languages involve sequences of words, the initial language

models were based on Recurrent Neural Networks (RNNs).

Because RNNs can lead to vanishing and exploding

gradients for long sequences, improved recurrent networks

like LSTMs and GRUs were utilized for improved

performance. Despite enhancements, LSTMs were found to

lack comprehension when relatively longer sequences were

involved. This is due to the reason that the entire history

known as a context, is being handled by a single state vector.

However, greater compute resources lead to an influx of

novel architectures causing a meteoric rise of Deep Learning

[1] based NLP models.

The breakthrough Transformer [2] architecture in 2017

overcame LSTM’s context limitation via the Attention

mechanism. Additionally, it provided greater throughput as

inputs are processed in parallel with no sequential

dependency. Subsequent launches of improved Transformer

based models like GPT-I [3] and BERT [ 4 ] in 2018 turned

out to be a climacteric year for the NLP world. These

architectures were trained on large datasets to create pre-

trained models. Thereafter transfer learning was used to fine-

tune these models for task-specific features resulting in

significant performance enhancement on several NLP tasks

[ 5 ],[ 6 ],[ 7 ],[ 8 ],[ 9 ],[1 0 ]. These tasks include but are not

limited to language modeling, sentiment analysis, question

answering, and natural language inference.

This accomplishment lacked the transfer learning’s primary

objective of achieving high model accuracy with minimal

fine-tuning samples. Also, model performance needs to be

generalized across several datasets and not be task or dataset-

specific [1 1 ],[1 2 ],[1 3 ]. However, the goal of high

generalization and transfer learning was being compromised

as an increasing amount of data was being used for both pre-

training and fine-tuning purposes. This clouded the decision

whether greater training data or an improved architecture

should be incorporated to build a better SOTA language

model. For instance, the subsequent XLNet [1 4 ] architecture

possessed novel yet intricate language modeling, that

provided a marginal improvement over a simplistic BERT

architecture that was trained on a mere ~10% of XLNet’s

data (113GB). Thereafter, with the induction of RoBERTa

[1 5 ], a large BERT-based model trained on significantly

more data than BERT (160GB), outperformed XLNet. Thus,

an architecture that is more generalizable and further is

trained on larger data, results in NLP benchmarks.

The above-mentioned architectures are primarily language

understanding models, where a natural dialect is mapped to

a formal interpretation. Here the initial goal is the translation

of an input user utterance into a conventional phrase

representation. For Natural Language Understanding (NLU)

the intermediate representation for the above models’ end

goal is dictated by the downstream tasks.

Meanwhile, fine-tuning was transpiring to be progressively

challenging for task-specific roles in NLU models as it

required a greater sample size to learn a particular task,

which bereft such models from generalization [1 6 ]. This

triggered the advent of Natural Language Generation (NLG)

models that contrary to NLU training, generated dialect

utterances learned from their corresponding masked or

corrupted input semantics. Such models operate differently

from a routine downstream approach of cursory language

comprehension and are optimal for sequence-to-sequence

generation tasks, such as language translation. Models like

T5 [1 7 ], BART [1 8 ], mBART [ 19 ], T-NLG [2 0 ] were pre-

trained on a large corpus of corrupted text and generated its

corresponding cleaned text via denoising objective [2 1 ]. This

transition was useful as the additional fine-tuning layer for

NLU tasks was not required for NLG purposes. This further

enhanced prediction ability via zero or few-shot learning

which enabled sequence generation with minimal or no fine-

tuning. For instance, if a model’s semantic embedding space

is pre-trained with animal identification of “cat”, “lion” and

“chimpanzee”, it could still correctly predict “dog” without

fine-tuning. Despite superior sequence generation

capabilities, NLG model sizes surged exponentially with the

subsequent release of GPT-III [2 2 ] which was the largest

model before the release of GShard [ 23 ].

Since NLU and NLG’s exceptionally large-sized models

required several GPUs to load, this turned out costly and

resource prohibitive in most practical situations. Further,

when trained for several days or weeks on GPU clusters,

these colossal models came at an exorbitant energy cost. To

mitigate such computational costs [2 4 ], Knowledge

Distillation (KD) [2 5 ] based models like DistilBERT [2 6 ],

TinyBERT [2 7 ], MobileBERT [2 8 ] were introduced at

reduced inference cost and size. These smaller student

models capitalized on the inductive bias of larger teacher

models (BERT) to achieve faster training time. Similarly,

pruning and quantization [ 29 ] techniques got popular to

build economically sized models. Pruning can be classified

into 3 categories: weight pruning, layer pruning, and head

pruning where certain minimal contributing weights, layers,

and attention heads are removed from the model. Like

pruning, training-aware quantization is performed to achieve

less than 32-bit precision format thereby reducing model

size.

For higher performance, greater learning was required which

resulted in larger data storage and model size. Due to the

model’s enormity and implicit knowledge storage, its

learning ability had caveats in terms of efficient information

access. Current Knowledge Retrieval models like ORQA

[3 0 ], REALM [3 1 ], RAG [3 2 ], DPR [3 3 ] attempt to alleviate

implicit storage concerns of language models by providing

external access to interpretable modular knowledge. This

was achieved by supplementing the language model’s pre-

training with a ‘knowledge retriever’ that facilitated the

model to effectively retrieve and attend over explicit target

documents from a large corpus like Wikipedia.

Further, the Transformer model’s inability to handle input

sequences beyond a fixed token span inhibited them to

comprehend large textual bodies holistically. This was

particularly evident when related words were farther apart

than the input length. Hence, to enhance contextual

understanding, architectures like Transformer-XL [3 4 ],

Longformer [3 5 ], ETC [3 6 ], Big Bird [3 7 ], were introduced

with modified attention mechanisms to process longer

sequences.

Also, due to the surge in demand for NLP models to be

economically viable and readily available on edge devices,

innovative compressed models were launched based on

generic techniques. These are apart from the Distillation,

Pruning, and Quantization techniques described earlier. Such

models deploy a wide range of computing optimization

procedures ranging from hashing [3 8 ], sparse attention [ 39 ],

factorized embedding parameterization [4 0 ], replaced token

detection [4 1 ], inter-layer parameter sharing [4 2 ], or a

combination of the above mentioned.

II. RELATED REVIEWS/TAXONOMY

We propose a novel NLP based taxonomy providing a

unique classification of current NLP models from six

different perspectives:

NLU Models : NLU models excel in classification,

structured prediction, and/or query generation tasks. This

is accomplished through pre-training and fine-tuning

motivated by the downstream task.

NLG Models : Contrary to NLU models, these stand out

in sequence-to-sequence generation tasks. They generate

clean text via few and single-shot learning from

corresponding corrupted utterances.

Model Size Reduction : Use compression-based

techniques like KD, Pruning, and Quantization to make

large models economical and pragmatic. It's useful for the

real-time deployment of large language models to operate

on edge devices.

Information Retrieval (IR) : Contextual open domain

question answering (QA) is reliant on effective and

efficient document retrieval. Hence, IR systems via

superior lexical and semantical extraction of physical

This combined abstract representation of all the words is fed

to the decoder to compute the desired language-based task.

Like its preceding layers, the final layer’s corresponding

learnable parameters are 𝑈 𝑡+ 1

and 𝑉

𝑡+ 1

at input and output

respectively at the Encoder and 𝑈

𝑡+ 1

𝑡+ 1

at the Decoder.

Combining the weight matrices with hidden state and bias

can be expressed mathematically as follows:

Encoder:

𝑡+ 1

𝑡+ 1

𝑡+ 1

𝑡+ 1

𝑡

𝑡

𝑡+ 1

𝑡+ 1

𝑡+ 1

𝑡

Decoder:

𝑡+ 1

𝑡+ 1

𝑡+ 1

𝑡+ 1

𝑡

𝑡+ 1

𝑡+ 1

𝑡+ 1

𝑡+ 1

𝑡+ 1

Thereafter, the induction of Attention [4 8 ],[ 49 ] in 2014- 15

overcame the RNN Encoder-Decoder limitation that suffered

from prior input dependencies, making it challenging to infer

longer sequences and suffered from vanishing and exploding

gradients [5 0 ]. The attention mechanism eliminated the RNN

dependency by disabling the entire input context through one

final Encoder node. It weighs all inputs individually that feed

the decoder to create the target sequence. This results in a

greater contextual understanding leading to superior

predictions in target sequence generation. First, the

alignment determines the extent of match between the 𝑗

th

input and 𝑖

th

output which can be determined as

𝑡𝑗

= tanh(ℎ

𝑖− 1

𝑗

More precisely, the alignment scores take as input all

encoder output states and the previous decoded hidden state

which is expressed as:

𝐴𝑙𝑖𝑔𝑛

𝑐𝑜𝑚𝑏

. tanh(𝑊

𝑑𝑒𝑐

𝑑𝑒𝑐

𝑒𝑛𝑐

𝑒𝑛𝑐

The decoder's hidden state and encoder outputs are passed

via their respective linear layers along with their trainable

weights. The weight 𝛼

𝑡𝑗

for each encoded hidden

representation ℎ 𝑗

is computed as:

𝑡𝑗

exp (𝑒

𝑡𝑗

)

∑ exp (𝑒 𝑡𝑘

)

𝑇 𝑥

𝑘= 1

The resulting context vector in this attention mechanism is

determined by:

𝑡

𝑡𝑗

𝑗

𝑇 𝑥

𝑗= 1

𝑥

The Attention mechanism is essentially the generation of the

context vector computed from the various alignment scores

at different positions as shown in figure 3.

Luong’s Attention mechanism differs from the above-

mentioned Bahdanau’s in terms of alignment score

computation. It uses both global and local attention, where

the global attention uses all encoder output states while the

local attention focuses on a small subset of words. This helps

to achieve superior translation for lengthier sequences. These

attention designs led to the development of modern

Transformer architectures which use an enhanced attention

mechanism as described in the next section.

FIGURE 3. Attention Mechanism on Encoder-Decoder Model

IV. NLU ARCHITECTURES

NLU’s approach of transferring pre-trained neural language

representations demonstrated that pre-trained embeddings

improve downstream task results when compared to

embeddings learned from scratch [5 1 ],[5 2 ]. Subsequent

research works enhanced learning to capture contextualized

word representations and transferred them to neural models

[5 3 ],[5 4 ]. Recent efforts not limited to [5 5 ],[5 6 ],[5 7 ] have

further built on these ideas by adding end-to-end fine-tuning

of language models for downstream tasks in addition to

extraction of contextual word representations. This

engineering progression, coupled with large compute

availability has evolved NLU’s state of the art methodology

from transferring word embeddings to transferring entire

multi-billion parameter language models, achieving

unprecedented results across NLP tasks. Contemporary NLU

models leverage Transformers for modeling tasks and

exclusively use an Encoder or a Decoder-based approach as

per requirements. Such models are vividly explained in the

subsequent section.

IV-A TRANSFORMERS

IV-A.1. The Architecture

The original Transformer is a 6-layered Encoder-Decoder

model, that generates a target sequence via the Decoder from

the source sequence via the Encoder. The Encoder and

Decoder at a high level consist of a self-attention and a feed-

forward layer. In the Decoder an additional attention layer in

between enables it to map its relevant tokens to the Encoder

for translation purposes. Self Attention enables the look-up

of remaining input words at various positions to determine

the relevance of the currently processed word. This is

performed for all input words that help to achieve a superior

encoding and contextual understanding of all words.

Transformer architecture was built to induct parallelism in

RNN and LSTM’s sequential data where input tokens are fed

instantaneously and corresponding embeddings are

generated simultaneously via the Encoder. This embedding

maps a word (token) to a vector that can be pre-trained on

the fly, or to conserve time a pre-trained embedding space

like GloVe is implemented. However, similar tokens in

different sequences might have different interpretations

which are resolved via a positional encoder that generates

context-based word information concerning its position.

Thereafter the enhanced contextual representation is fed to

the attention layer which furthers contextualization by

generating attention vectors, that determine the relevance of

the 𝑖

𝑡ℎ

word in a sequence concerning other words. These

attention vectors are then fed to the feed-forward Neural

Network where they are transformed to a more digestible

form for the next ‘Encoder’ or Decoder’s ‘Encoder-Decoder

Attention’ block.

The latter is fed with Encoder output and Decoder input

embedding that performs attention between the two. This

determines the relevance of Transformer’s input tokens

concerning its target tokens as the decoder establishes actual

vector representation between the source and target

mapping. The decoder predicts the next word via softmax

which is executed over multiple time steps until the end of

the sentence token is generated. At each Transformer layer,

there are residual connections followed by a layer

normalization [5 8 ] step to speed up the training during

backpropagation. All of the transformer architectural details

are demonstrated in Figure 4.

IV-A. 2. Queries, Keys, and Values

The input to the Transformer’s Attention mechanism is

target token Query vector 𝑄, its corresponding source token

Key vector 𝐾, and Values 𝑉 which are embedding matrices.

Mapping of source and destination tokens in machine

translation can be quantified as to how similar each of their

tokens is in a sequence via inner dot product. Therefore, to

achieve accurate translation the key should match its

corresponding query, via a high dot product value between

the two. Assume 𝑄 ⋵ {𝐿 𝑄

, 𝐷} and 𝐾 ⋵ {𝐿

𝐾

, 𝐷} where 𝐿

𝑄

𝐾

represent target and source lengths, while 𝐷 denotes the word

embedding dimensionality. Softmax is implemented to

achieve a probability distribution where all Query, Key

similarities add up to one and make attention more focused

on the best-matched keys.

𝑆𝑀

𝑇

) where 𝑊

𝑆𝑀

𝑄

𝐾

Query assigns a probability to key for matching and often

values are similar to keys, therefore

𝐴𝑡𝑡

𝑇

𝑆𝑀

IV-A. 3. Multi-Headed Attention (MHA) and Masking

MHA enhances the model’s capacity to emphasize a

sequence’s different token positions by implementing

attention parallelly multiple times. The resulting individual

attention outputs or heads are concatenated and transformed

via a linear layer to the expected dimensions. Each of the

multiple heads enables attending the sequence parts from a

different perspective providing similar representational

forms for each token. This is performed as each token’s self-

attention vector might weigh the word it represents higher

than others due to the high resultant dot product. This is not

productive since the goal is to achieve similarly assessed

interaction with all tokens. Therefore self-attention is

computed 8 different times resulting in 8 separate attention

vectors for each token which are used to compute the final

attention vector via a weighted sum of all 8 vectors for each

token. The resultant multi-headed attention vectors are

computed in parallel which is fed to the feed-forward layer.

Each subsequent target token 𝑇

𝑡+ 1

is generated using as

many source tokens in the encoder (𝑆

0

𝑡+𝑛

). However,

in an autoregressive decoder only previous time stepped

target tokens are considered (𝑇

0

𝑡

), for future target

prediction purposes known as causal masking. This is

provided to enable maximal learning of the subsequently

translated target tokens. Therefore during parallelization via

matrix operations, it is ensured that the subsequent target

words are masked to zero, so the attention network cannot

see into the future. The Transformer described above

resulted in significant improvement in the NLP domain. This

leads to a plethora of high-performance architectures that we

describe in the subsequent sections.

FIGURE 4. The Multi-headed Transformer Architecture

IV-B EMBEDDINGS FROM LANGUAGE MODELS: ELMo

The goal of ELMo [ 59 ] is to generate a deep contextualized

word representation that could model (i) intricate syntactical

and semantical characteristics of word (ii) polysemy or

lexical ambiguity, words with similar pronunciations could

have different meanings at different contexts or locations.

These enhancements gave rise to contextually rich word

embeddings which were unavailable in the previous SOTA

models like GloVe. Unlike its predecessors that used a

predetermined embedding, ELMo considers all 𝑁 token

occurrences (𝑡

1

2

𝑁

) for each token 𝑡 in the entire

sequence before creating embeddings. The authors

hypothesize that the model could extract abstract linguistic

attributes in its architecture’s top layers via a task-specific

bi-directional LSTM.

This is possible by combining a forward and a backward

language model. At timestep 𝑘 − 1 , the forward language

model predicts the next token 𝑡

𝑘

given the input sequence’s

previous observed tokens via a joint probability distribution

GPT performs various tasks like classification, entailment,

similarity index, Multiple-Choice Questions (MCQ) as

shown in figure 6. The extraction phase distills features from

textual bodies before which the text is separated via the

‘Delimiter’ token during text pre-processing. This token is

not required for classification tasks since it does not need to

gauge the relationship between multiple sequences.

Moreover, Q&A or textual entailment tasks involve defined

inputs like ordered sentence pairs or triplets in a document.

For MCQ tasks, contextual alterations are required at input

to achieve the correct results. This is done via a Transformer

based Decoder training objective where input

transformations are fine-tuned for their respective answers.

IV-C BIDIRECTIONAL ENCODER REPRESENTATIONS

FROM TRANSFORMER: BERT

BERT is a stack of pre-trained Transformer Encoders that

overcomes prior models’ restrictive expressiveness i.e.,

GPT’s lack of bidirectional context and ELMo’s shallow

dual context’s concatenation. BERT’s deeper model

provides a token with several contexts with its multiple

layers and the bi-directional model provides a richer learning

environment. However, bi-directionality raises concerns that

tokens could implicitly foresee future tokens during pre-

training resulting in minimal learning and leading to trivial

predictions. To effectively train such a model, BERT

implements Masked Language Modeling (MLM) that masks

15% of all input tokens randomly in each input sequence.

This masked word prediction is the new requirement unlike

recreating the entire output sequence in a unidirectional LM.

BERT masks during pre-training, hence the [MASK] token

does not show during fine-tuning, creating a mismatch as the

“masked” tokens are not replaced. To overcome this

disparity, subtle modeling modifications are performed

during the pre-training phase. If a token 𝑇 𝑖

is chosen to be

masked, then 80% of the time it is replaced with the [MASK]

token, 10% of the time a random token is chosen and for the

remaining 10%, it remains unchanged. Thereafter 𝑇

𝑖

cross-

entropy loss will predict the original token, the unchanged

token step is employed to maintain a bias towards the correct

prediction. This methodology creates a state of randomness

and constant learning for the Transformer encoder which is

compelled to maintain a distributed contextual

representation of each token. Further, as random replacement

arises for a mere 1.5% of all tokens (10% of 15%), this does

not seem to impair the language model’s understanding

ability.

Language modeling could not explicitly comprehend the

association between multiple sequences; therefore it was

deemed sub-optimal for inference and Q&A tasks. To

overcome this, BERT was pre-trained with a monolingual

corpus for a binarized Next Sentence Prediction (NSP) task.

As shown in Figure 7, sentences 𝑌 (He came [MASK] from

home) and 𝑍 (Earth [MASK] around Sun) do not form any

continuity or relationship. Since 𝑍 is not the actual next

sentence following 𝑌, the output classification label

[NotNext] gets activated, and [IsNext] activates when

sequences are coherent.

FIGURE 7. The architecture of BERT’s MLM and NSP functionality

IV-D GENERALIZED AUTOREGRESSIVE PRETRAINING

FOR LANGUAGE UNDERSTANDING: XLNeT

XLNet captures the best of both worlds where it preserves

the benefits of Auto-Regressive (AR) modeling and

bidirectional contextual capture. To better comprehend why

XLNet outperforms BERT, consider the 5-token sequence

[San, Francisco, is, a, city]. The two tokens chosen for

prediction are [San, Francisco], hence BERT and XLNet

maximize 𝑙𝑜𝑔 𝑝(𝑆𝑎𝑛 𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜 | 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦) as follows:

𝐵𝐸𝑅𝑇

= log 𝑝 (𝑆𝑎𝑛

log 𝑝 (𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜|𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦)

𝑋𝐿𝑁𝑒𝑡

= log 𝑝 (𝑆𝑎𝑛| 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦) +

log 𝑝 (𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜|𝑆𝑎𝑛 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦)

The above can further be generalized for the target (𝒯) and

non-target token set (𝒩), BERT and XLNet will maximize

log 𝑝 (𝒯|𝒩) with the following different interpretability:

𝐵𝐸𝑅𝑇

= ∑ log 𝑝(𝑥|

𝑥∈𝒯

𝐵𝐸𝑅𝑇

= ∑ log 𝑝(𝑥|

𝑥∈𝒯

<𝑥

XLNet considers the target as well as the remaining tokens

for prediction, whereas BERT only considers the non-target

tokens. Hence, XLNet captures the inter-pair dependency

[San, Francisco] unlike BERT where either [San] or

[Francisco] leads to correct prediction. Further, via AR

XLNet performs factorized ordering on all possible token

permutations (𝐿! =5!) of sequence length 𝐿 in the set i.e.,

{[1, 2, 3, 4, 5], [1, 2, 5, 4, 3],.. ., [5, 4, 3, 2, 1]} ≅ [is, San,

Francisco, a, city] etc.

𝑧~𝒵

𝑇

[∑ log 𝑝

𝜃

𝑧

𝑡

∣ x

𝑧

<𝑡

𝑇

𝑡= 1

] ( 27 )

where set 𝒵

𝑇

contains all permutational sequences of length

𝑇 [ 1 , 2 ,.. , 𝑇] and 𝑥

𝑧

𝑡

is the reference token. Hence the target

learns from numerous combinations attaining a richer

contextualized learning. Further for all permutable

factorization orders, the model parameters are shared to build

knowledge and bidirectional context from all factorizations

as demonstrated via equation 27.

IV-D.1. Masking

There is a challenge to determine the word order in the

sequence as the token (𝑥 𝑧

𝑡

) determining the autoregression

is not considered. This word order is partially achieved via

positional encoding, however, for contextual understanding

XLNet employs masking. Consider a generated permutation

of [2, 1 , 3 ] in a 3 - token sequence where the first token i.e., 2

has no context hence all masking results in [0,0,0] in the 2

nd

row of the 3 × 3 masking matrix. Similarly, the 2

nd

and 3

rd

masks would result in [0,1,0] and [1,1, 0 ] in the 1

st

and 3

rd

row

of the Query Stream (QS) masking matrix where the token

cannot see itself. QS matrix with an all-one diagonal

inclusion constitutes Content Stream (CS) masking matrix

where each token can see itself. This 3 - token sequence

masking is demonstrated in figure 8 below.

FIGURE 8. Illustration of predicting x 2 in the 3-token sequence with

different factorization orders and its corresponding masking matrices

The first reference ‘2’ has no context which is gathered from

its corresponding ‘mem block’, a Transformer-XL-based

extended cached memory access. Thereafter it receives

context from token ‘3’ and ‘1’,’3’ for subsequent orderings.

IV-D. 2. Model Architecture

Figure 9 demonstrates the model’s two-stream attention

framework that consists of a content and query stream

attention process to achieve greater understanding via

contextualization. This process is initiated via target-aware

representation, where the target position is baked into the

input for subsequent token generation purposes.

(i) Target Aware Representation: A vanilla implementation

of Transformer based parametrization does not suffice for

complex permutation-based language modeling. This is

because the next token distribution 𝑝 𝜃

𝒵

𝑡

𝑧<𝑡

) is

independent of the target position i.e., 𝒵 𝑡

. Subsequently,

redundant distribution is generated, which is unable to

discover effective representations, hence target position-

aware re-parametrization for the next-token distribution is

proposed as follows:

𝜃

𝒵

𝑡

z

<𝑡

exp (𝑒

( 𝑥

)

𝑇

𝒉 𝜽

(𝐱 𝒛 <𝒕

))

∑ exp (𝑒(𝑥

)

𝑇

𝒉

𝜽

(𝐱

𝒛 <𝒕

))

𝑥

𝜃

𝒵

𝑡

𝑧

<𝑡

exp(𝑒(𝑥)

𝑇

𝒈 𝜽

(𝐱 𝒛

<𝒕

,𝓩 𝒕

))

∑ exp (𝑒(𝑥

)

𝑇

𝒈

𝜽

(𝐱

𝒛 <𝒕

,𝓩

𝒕

))

𝑥

where 𝑔

𝜃

(x

𝒛<𝑡

𝑡

) is a modified representation that

additionally considers the target position 𝒵

𝑡

as an input.

(ii) Two Stream Self Attention: The formulation of 𝑔

𝜃

remains a challenge despite the above resolution as the goal

is to rely on the target position 𝒵

𝑡

to gather contextual

information 𝑥

𝒛<𝑡

via attention, hence: (1) For 𝑔

𝜃

to predict

𝒵𝑡

, it should utilize the position of 𝒵

𝑡

only to incorporate

greater learning, not the content 𝑥

𝒵𝑡

(2) To predict other

tokens 𝑥

𝒵 𝑗

where 𝑗 > 𝑡, 𝑔

𝜃

should encode the context 𝑥

𝒵 𝑡

to provide full contextual understanding.

To further resolve the above conflict, the authors propose

two sets of hidden representation instead as follows:

❖ The hidden content representation ℎ

𝜃

𝒛<𝑡

𝒵 𝑡

that

encodes both context and content 𝑥

𝒵

𝑡

❖ The query representation 𝑔

𝜃

𝒛<𝑡

𝑡

𝒵

𝑡

which

solely accesses the contextual information 𝑥

𝒛<𝑡

and

position 𝒵

𝑡

without the content 𝑥

𝒵

𝑡

FIGURE 9. (Left): Standard Attention via Content Stream and Query

Stream Attention without access to the content. (Right): LM training

The above two attention courses are parametrically shared

and updated for every self-attention layer 𝑚 as:

𝒵

𝑡

(𝑚− 1 )

𝒵

≤𝑡

(𝑚− 1 )

𝒵

𝑡

(𝑚)

(Content Stream: utilize both 𝒵

𝑡

and 𝑥

𝒵

𝑡

𝒵

𝑡

(𝑚− 1 )

𝒵

<𝑡

(𝑚− 1 )

𝒵

𝑡

(𝑚)

(Query Stream: use 𝒵

𝑡

without seeing 𝑥

𝒵 𝑡

This dual attention is pictorially expressed in figure 9. For

simplicity purposes, consider the prediction of token 𝑡

𝑖

that

is not allowed to access its corresponding embedding from

the preceding layer. However, to predict 𝑡

𝑖+ 1

the token 𝑡

𝑖

needs to access its embedding and both operations must

occur in a single pass.

Therefore, two hidden representations are implemented

where ℎ

𝒵

𝑡

(𝑚)

is initialized via token embeddings and 𝑔

𝒵

𝑡

(𝑚)

through weighted transformations. From above equations

𝒵

𝑡

(𝑚)

can access the history including the current position

whereas 𝑔

𝒵

𝑡

(𝑚)

can access only previous ℎ

𝒵 𝑡

(𝑚)

positions.

The token prediction happens in the final layer via 𝑔

𝒵 𝑡

(𝑚)

For greater sequence length processing the memory blocks

V-A LANGUAGE MODELS ARE UNSUPERVISED MULTI-

TASK LEARNERS: GPT-II

GPT-II [62] was possibly the first model that dawned on the

rise of NLG models. It was trained in an unsupervised

manner capable of learning complex tasks including

Machine Translation, reading comprehension, and

summarization without explicit fine-tuning. Task-specific

training corresponding to its dataset was the core reason

behind the generalization deficiency witnessed in current

models. Hence robust models would likely require training

and performance gauges on a variety of task domains.

GPT-II incorporates a generic probabilistic model where

numerous tasks can be performed for the same input as

𝑝(𝑜𝑢𝑡𝑝𝑢𝑡|𝑖𝑛𝑝𝑢𝑡, 𝑡𝑎𝑠𝑘). The training and test set

performance improves as model size is scaled up and as a

result, it under fits on the huge WebText dataset. The 1.

billion parameter GPT-2 outperformed its predecessors on

most datasets in the previously mentioned tasks in a zero-

shot environment. It is an extension of the GPT-I decoder-

only architecture trained on significantly greater data.

V-B BIDIRECTIONAL AND AUTOREGRESSIVE

TRANSFORMERS: BART

A denoising autoencoder BART is a sequence-to-sequence

[63] model that incorporates two-stage pre-training: (1)

Corruption of original text via a random noising function,

and (2) Recreation of the text via training the model. Noising

flexibility is the major benefit of the model where random

transformations not limited to length alterations are applied

to the original text. Two such noising variations that stand

out are random order shuffling of the original sentence and a

filling scheme where texts of any spanned length are

randomly replaced by a single masked token. BART deploys

all possible document corruption schemes as shown below in

figure 11 , wherein the severest circumstance all source

information is lost and BART behaves like a language

model.

FIGURE 11. Denoised BART Model and its Noising Schemes

This forces the model to develop greater reasoning across

overall sequence length enabling greater input

transformations which results in superior generalization than

BERT. BART is pre-trained via optimization of a

reconstruction loss performed on corrupted input documents

i.e., cross-entropy between decoder’s output and original

document. For machine translation tasks, BART’s encoder

embedding layer is replaced with an arbitrarily initialized

encoder, that is trained end-to-end with the pre-trained model

as shown in Figure 12. This encoder maps its foreign

vocabulary to BART’s input which is denoised to its target

language English. The source encoder is trained in two

stages, that share the backpropagation of cross-entropy loss

from BART’s output. Firstly, most BART parameters are

frozen, and only the arbitrarily initialized encoder, BART’s

positional embeddings, and its encoder’s self-attention input

projection matrix are updated. Secondly, all model

parameters are jointly trained for few iterations. BART

achieves state-of-the-art performance on several text

generation tasks, fueling further exploration of NLG models.

It achieves comparative results on discriminative tasks when

compared with RoBERTa.

FIGURE 12. Denoised BART Model for fine-tuned MT tasks

V-C MULTILINGUAL DENOISING PRE-TRAINING FOR

NEURAL MACHINE TRANSLATION: mBART

V-C. 1. Supervised Machine Translation

mBART demonstrates that considerable performance gains

are achieved over prior techniques [64], [65] by

autoregressively pre-training BART, via sequence

reconstructed denoising objective across 25 languages from

the common crawl (CC-25) corpus [66]. mBART’s

parametric fine-tuning can be supervised or unsupervised,

for any linguistic pair without task-specific revision. For

instance, fine-tuning a language pair i.e. (German-English)

enables the model to translate from any language in the

monolingual pre-training set i.e. (French English), without

further training. Since each language contains tokens that

possess significant numerical variations, the corpus is

balanced via textual up/downsampling from each language 𝑖

with the ratio 𝜆

𝑖

𝑖

𝑖

𝑖

𝛼

𝑖

𝛼

𝑖

where 𝑝

𝑖

is each language’s percentage in the dataset with a

soothing parameter 𝛼 = 0. 7. The training data encompasses

𝐾 languages: 𝒞 = {𝒞

1

𝑘

} where each 𝒞

𝑖

is 𝑖

𝑡ℎ

language’s monolingual document collection. Consider a

text corrupting noising function 𝑔(𝑋) where the model is

trained to predict original text 𝑋, hence loss ℒ

𝜃

is maximized

as:

𝜃

= ∑ ∑ log 𝑃(𝑋 ∣

𝑋∈𝒞

𝑖

𝒞

𝑖

∈𝒞

where language 𝑖 has an instance 𝑋 and above distribution 𝑃

is defined via a sequence-to-sequence model.

V-C. 2. Unsupervised Machine Translation

mBART is evaluated on tasks where target bi-text or text

pairs are not available in these 3 different formats.

❖ None of any kind of bi-text is made available, here back-

translation (BT) [67],[68] is a familiar solution. mBART

offers a clean and effective initialization scheme for

such techniques.

❖ The bi-text for the target’s pair is made unavailable,

however, the pair is available in the target language’s bi-

text corpora for other language pairs.

❖ Bi text is not available for the target pair, however, is

available for translation from a different language to the

target language. This novel evaluation scheme

demonstrates mBART’s transfer learning capability

despite the absence of the source language’s bi-text

mBART is pre-trained for all 25 languages and fine-tuned

for the target language as shown in figure 13.

FIGURE 13. mBART Generative Model Pre-training & Fine-tuning

V-D EXPLORING THE LIMITS OF TRANSFER

LEARNING WITH A TEXT-TO-TEXT TRANSFORMER: T

This model was built by surveying and applying the most

effective transfer learning practices. Here all NLP tasks are

orchestrated within the same model and hyperparameters are

reframed into a unified text-to-text setup where text strings

are inputs and outputs. A high-quality, diverse and vast

dataset is required to measure the scaled-up effect of pre-

training in the 11 billion parameter T5. Therefore, Colossal

Clean Crawled Corpus (C4) was developed, twice as large as

Wikipedia.

The authors concluded that causal masking limits the

model’s capability to attend only till the 𝑖

𝑡ℎ

input entry of a

sequence, which turns detrimental. Hence T5 incorporates

fully visible masking during the sequence’s prefix section

(prefix LM) whereas causal masking is incorporated for

training the target’s prediction. The following conclusions

were made after surveying the current transfer learning

landscape.

❖ Model Configuration: Normally models with Encoder-

Decoder architectures outperformed decoder-based

language models.

❖ Pre-Training Goals: Denoising worked best for fill-in-

the-blank roles where the model is pre-trained to retrieve

input missing words at an acceptable computational cost

❖ In-Domain Datasets: In-domain data training turns out to

be effective, however pre-training small datasets

generally leads to overfitting.

❖ Training Approaches: A pre-train, fine-tune methodology

for multi-task learning could be effective, however, each

task’s training frequency needs to be monitored.

❖ Scaling Economically: To efficiently access the finite

computing resources, evaluation among model size

scaling, training time, and ensembled model quantity is

performed.

V-E TURING NATURAL LANGUAGE GENERATION: T-

NLG

T-NLG is a 78 layered Transformer based generative

language model, that outsizes the T5 with its 17 billion

trainable parameters. It possesses greater speedup than

Nvidia’s Megatron, which was based on interconnecting

multiple machines via low latency buses. T-NLG is a

progressively larger model, pre-trained with greater variety

and quantity of data. It provides superior results in

generalized downstream tasks with lesser fine-tuning

samples. Hence, its authors conceptualized training a huge

centralized multi-task model with its resources shared across

various tasks, rather than allocating each model for a task.

Consequently, the model effectively performs question

answering without prior context leading to enhanced zero-

shot learning. Zero Redundancy Optimizer (ZeRO) achieves

both model and data parallelism concurrently, which perhaps

is the primary reason to train T-NLG with high throughput.

V-F LANGUAGE MODELS ARE FEW-SHOT LEARNERS:

GPT-III

The GPT family (I, II, and III) are autoregressive language

models, based on transformer decoder blocks, unlike

denoising autoencoder-based BERT. GPT-3 is trained on

175 billion parameters from a dataset of 300 billion tokens

of text used for generating training examples for the model.

Since GPT-3 is 10 times the size of any previous language

model and for all tasks and purposes it employs few-shot

learning via a text interface, without gradient updates or fine-

tuning it achieves task agonism. It employs unsupervised

pre-training, where the language model acquires a wide

𝑠

𝑠

𝑖

log(𝑝

𝑖

𝑠𝑖

𝑖

FIGURE 15. Language Model’s Generalized Distilled Architecture

The standard model of vanilla knowledge distillation

integrates the distilled and the student loss as shown below,

= 𝛼 × ℒ

𝐷

𝑡

𝑠

) + 𝛽 ×

𝑠

𝑠

𝐷

𝑠

where 𝑊 ∈ student parameters and 𝛼, 𝛽 ∈

𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠. In the original paper weighted

average was used concerning 𝛼 and 𝛽, i.e., 𝛽 = 1 − 𝛼 and

for best results, it was observed that 𝛼 ≫ 𝛽.

VI-A.1. DistilBERT

DistilBERT, the student version of the teacher BERT

retained 97% of BERT’s language understanding

performance and was at inference time lighter, faster, and

required lesser training cost. Through KD, DistilBERT

reduces BERT size by 40%, is 60% faster and the

compressed model is small enough to be operated on edge

devices. The layer-depth of DistilBERT is slashed by half

when compared with BERT since both possess the same

dimensionality and possess generally an equivalent

architecture. Layer reduction was performed as its

normalization and linear optimization were computationally

ineffective in the final layers. To maximize the inductive bias

of large pre-trained models, DistilBERT introduced a triple

loss function which linearly combined the distillation (ℒ 𝐷

with the supervised training (ℒ 𝑚𝑙𝑚

) or the masked language

modeling loss. It was observed that supplementing the prior

loss with embedding cosine loss (ℒ 𝑐𝑜𝑠

) was beneficial as it

directionally aligned the teacher’s and student’s hidden state

vectors.

VI-A. 2. TinyBERT

To overcome the distillation complexity of the pre-training-

then-fine-tuning paradigm, TinyBERT introduced a lucid

knowledge transfer process by inducting 3 loss functions: (i)

Embedding Layer Output (ii) Attention Matrices, the Hidden

States from Transformer (iii) Output Logits. This not only

led TinyBERT to retain over 96% of BERT’s performance

at drastically reduced size but also deployed a meager 28%

of parameters and 31% of inference time across all BERT-

based distillation models. Further, it leveraged the untapped

extractable potential from BERT’s learned attention weights

[70], for ( 𝑀 + 1 )

𝑡ℎ

layer, knowledge acquired is enhanced

by minimizing:

𝑚𝑜𝑑𝑒𝑙

𝑚

𝑙𝑎𝑦𝑒𝑟

𝑚

𝑀+ 1

𝑚= 0

𝑔(𝑚)

where ℒ

𝑙𝑎𝑦𝑒𝑟

is the loss function of a Transformer or an

Embedding layer and hyperparameter 𝜆

𝑚

signifies the

importance of 𝑚

𝑡ℎ

layer’s distillation. BERT’s attention-

based enhancement for language understanding can be

incorporated in TinyBERT as:

𝑎𝑡𝑡𝑛

𝑖= 1

𝑖

𝑆

𝑖

𝑇

𝑖

𝑙×𝑙

where ℎ denotes the number of heads, 𝐴

𝑖

is the attention

matrix corresponding to student or teacher’s 𝑖

𝑡ℎ

head, 𝑙

denotes input text length along with mean squared error

(MSE) loss function. Further, TinyBERT distills knowledge

from the Transformer output layer and can be expressed as:

ℎ𝑖𝑑𝑛

𝑠

𝑇

𝑙×𝑑

𝑠

𝑙×𝑑

𝑇

𝑙×𝑑

where 𝐻

𝑠

𝑇

are the hidden states of the student and teacher

respectively, hidden sizes of the teacher and student models

are denoted via scalar values of 𝑑

and 𝑑, 𝑊

is a learnable

matrix that transforms the student network’s hidden states to

the teacher network’s space states. Similarly, TinyBERT

also performs distillation on embedding-layer:

𝑒𝑚𝑏𝑑

𝑠

𝑒

𝑇

where 𝐸

𝑠

and 𝐻

𝑇

are embedding matrices of student and

teacher networks, respectively. Apart from mimicking the

intermediate layer behavior, TinyBERT implements KD to

fit predictions of the teacher model via cross-entropy loss

between logits of the student and the teacher.

𝑝𝑟𝑒𝑑

𝑇

). log (𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ((

𝑠

Here 𝑧

𝑇

and 𝑧

𝑆

are the respective logits predicted by the

teacher and student models.

VI-A.3. MobileBERT

Unlike previous distilled models, MobileBERT achieves

task-agnostic compression from BERT achieving training

convergence via prediction and distillation loss. To train

such a deeply thin model, a unique inverted bottleneck

teacher model is designed that incorporates BERT (IB-

BERT) from where knowledge transfer distills to

MobileBERT. It is 4.3× smaller, 5.5× faster than BERT

achieving a competitive score that is 0.6 units lower than

BERT on GLUE-based inference tasks. Further, the low

latency of 62 ms on Pixel 4 phone can be attributed to the

replacement of Layer Normalization and gelu activation,

with the simpler Hadamard product (∘) based linear

transformation.

𝑛

For knowledge transfer, the mean squared error between

feature maps of MobileBERT’s and IB-BERT is

implemented as a transfer objective.

𝐹𝑀𝑇

𝑙

𝑡,𝑙,𝑛

𝑡𝑟

𝑡,𝑙,𝑛

𝑠𝑡

2

𝑁

𝑛= 1

𝑇

𝑡= 1

where 𝑙 is layer index, 𝑇 is sequence length, 𝑁 is the feature

map size. For TinyBERT to harness the attention capability

from BERT, KL-divergence is minimized between per-head

distributions of the two models, where 𝐴 denotes the number

of attention heads.

𝐴𝑇

𝑙

𝐾𝐿

𝑡,𝑙,𝑎

𝑡𝑟

𝑡,𝑙,𝑎

𝑠𝑡

𝐴

𝑎= 1

𝑇

𝑡= 1

Alternatively, a new KD loss can be implemented during

MobileBERT’s pre-training with a linear combination of

BERT’s MLM and NSP loss, where 𝛼 is a hyperparameter

between (0,1).

𝑃𝐷

𝑀𝐿𝑀

𝐾𝐷

𝑁𝑆𝑃

For the above-outlined objectives, 3 training strategies are

proposed:

(i) Auxiliary Knowledge Transfer: Intermediary transfer via

a linear combination of all layer transfer loss and distilled

pre-training loss.

(ii) Joint Knowledge Transfer : For superior results, 2

separate losses are proposed where MobileBERT is trained

with all layers that jointly transfer losses and perform pre-

trained distillation.

(iii) Progressive Knowledge Transfer: To minimize error

transfer from lower to higher layers, it is proposed to divide

knowledge transfer into 𝐿 layered 𝐿 stages where each layer

is trained progressively.

VI-B PRUNING

Pruning [71] is a methodology where certain weights, biases,

layers, and activations are zeroed out which are no longer a

part of the model’s backpropagation. This introduces

sparsity in such elements which are visible post ReLU layer

that converts negative values to zero

((𝑅𝑒𝐿𝑈(𝑥): 𝑚𝑎𝑥( 0 , 𝑥)). Iterative pruning learns the key

weights, eliminating the least critical ones based on threshold

values, and retraining the model enabling it to recuperate

from pruning by adapting to the remaining weights. NLP

models like BERT, RoBERTa, XLNet were pruned by 40%

and retained their performance by 98%, which is comparable

to DistilBERT.

VI-B.1 LAYER PRUNING

VI-B.1-A STRUCTURED DROPOUT

This architecture [72] randomly drops layers at training and

test time that enables sub-network selection of any desired

depth, since the network has been trained to be pruning

robust. This is an upgrade from current techniques that

require re-training a new model from scratch as opposed to

training a network from which multiple shallow models are

extracted. This sub-network sampling like Dropout [73] and

DropConnect [74] builds an efficient pruning robust network

if the smartly chosen simultaneous group of weights are

dropped. Formally, pruning robustness in regularizing

networks can be achieved by independently dropping each

weight via Bernoulli’s distribution where parameter p > 0

regulates the drop rate. This is comparable to the pointwise

product of weight matrix 𝑊 with an arbitrarily sampled {0,

  1. mask matrix 𝑀, 𝑊

𝑑

The most effective layer dropping strategy is to drop every

other layer, where pruning rate 𝑝 and dropping layers at

depth 𝑑 such that 𝑑 ≡ 0 (𝑚𝑜𝑑

). For 𝑁 groups with a

fixed drop ratio 𝑝, the average number of groups utilized

during training the network is 𝑁( 1 − 𝑝), hence pruning size

for 𝑟 groups, the ideal drop rate will be 𝑝

= 1 − 𝑟/𝑁. This

approach has been highly effective on numerous NLP tasks

and has led to models on size comparable to distilled versions

of BERT and demonstrate better performance.

VI-B.1-B POOR MAN’S BERT

Due to the over-parameterization of deep neural networks,

availability of all parameters is not required at inference

time, hence few layers are strategically dropped resulting in

competitive results for downstream tasks [75]. The odd-

alternate dropping strategy drove superior results than the

top and even alternate dropping for span 𝐾 = 2 across all

tasks. For instance, in a 12-layer network, dropping: top –

{11, 12}; even-alternate – {10, 12}; odd-alternate – {9, 11},

concluded in (i) dropping the final two layers consecutively

is more detrimental than eliminating alternate layers, and (ii)

preserving the final layer has greater significance than other

top layers.

FIGURE 16. Layer Pruning Strategies deployed by Language Models

At higher values of 𝐾, the alternate dropping approach

signifies a large drop in performance, hypothesized due to

the elimination of lower layers. The Symmetric approach

VI-B.3-B ARE 16 HEADS REALLY BETTER THAN ONE?

In multi-headed attention (MHA), consider a sequence of 𝑛

𝑑-dimensional vectors 𝑥 = 𝑥 1

𝑛

𝑑

, and query

vector 𝑞 ∈ ℝ

𝑑

. The MHA layer parameters

𝑞

𝑘

𝑣

𝑜

𝑑

×𝑑

and 𝑊

𝑜

𝑑×𝑑

, when 𝑑

For masking attention heads the original transformer

equation is modified as:

𝐴𝑡𝑡𝑛

𝒉

𝑊 𝑞

,𝑊

𝑘

𝑊 𝑣

𝑊 𝑜

𝑁

ℎ= 1

where 𝜉 ℎ

are masking variables with values between { 0 , 1 },

(𝑥) is the output of head ℎ for input 𝑥. The following

experiments yielded the best results [83] on pruning the

different number of heads at test times:

(i) Pruning just one head: If the model’s performance

significantly degrades while masking head ℎ, then ℎ is a

key head else it is redundant given the rest of the model.

A mere 8 (out of 96) heads trigger a significant change

in performance when removed from the model, out of

which half result in a higher BLEU score.

(ii) Pruning all heads except one: A single head for most

layers was deemed sufficient at test time, even for

networks with 12 or 16 attention heads, resulting in a

drastic parametric reduction. However, multiple

attention heads are a requirement for specific layers i.e.,

the final layer of the encoder-decoder attention, where

performance degrades by a massive 13.5 BLEU points

on a single head.

The expected sensitivity of the model to the masking 𝜉 is

evaluated for the proxy score for head significance.

𝑥~𝑋

𝑥~𝑋

𝑇

where 𝑋 is the data distribution, ℒ(𝑥) is the loss on sample

𝑥. If 𝐼 ℎ

is high, then modifying 𝜉

will likely have a

significant effect on the model, hence low 𝐼 ℎ

value heads are

iteratively pruned out.

VI-C QUANTIZATION

32 - bit floating-point (FP32) has been the predominant

numerical format for deep learning, however the current

surge for reduced bandwidth and compute resources has

propelled the implementation of lower-precision formats. It

has been demonstrated that weights and activation

representations via 8-bit integers (INT8) have not led to an

evident accuracy loss. For instance, BERT’s quantization to

16/8-bit weight format resulted in 4× model compression

with minimal accuracy loss, consequently, a scaled-up

BERT serves a billion CPU requests daily.

VI-C.1 LQ-NETS

This model [84] inducts simple to train network weights and

activations mechanism via joint training of a deep neural

network. It quantizes with variable bit precision capabilities

unlike fixed or manual schemes [85],[86]. Generally, a

quantized function can represent floating-point weights 𝑤,

activations 𝑎, in a few bits as:

𝑙

, if 𝑥 ∈ (𝑡

𝑙

𝑙+ 1

] where 𝑞

𝑙

Here 𝑞

𝑙

and (𝑡

𝑙

𝑙+ 1

] are quantization levels and intervals,

respectively. To preserve quick inference times, quantization

functions need to be compatible with bitwise operations,

which is achieved via uniform distribution that

maps floating-point numbers to their nearest fixed-point

integers with a normalization factor. The LQ learnable

quantization function can be expressed as:

𝐿𝑄

𝑇

𝑙

, if 𝑥 ∈ (𝑡

𝑙

𝑙+ 1

] ( 57 )

where 𝑣 ∈ ℝ

𝐾

is the learnable floating-point basis and 𝑒

𝑙

𝐾

for 𝑙 = ( 1 ,.. , 𝐿) enumerating 𝐾-bit binary

encodings from [− 1 ,.. , − 1 ] to [ 1 ,.. , 1 ]. The inner product

computation of quantized weights and activations is

computed by the following bitwise operations with weight

bit-width 𝐾

𝑤

𝐿𝑄

𝑤

𝑇

𝐿𝑄

𝑎

𝑖

𝑤

𝑗

𝑎

𝑖

𝑤

𝑗

𝑎

𝐾

𝑎

𝑗= 1

𝐾

𝑤

𝑖= 1

where 𝑤, 𝑎 ∈ ℝ

𝑛

encoded by vectors 𝑏

𝑖

𝑤

𝑗

𝑎

𝑁

where 𝑖 = 1 ,... , 𝐾

𝑤

and 𝑗 = 1 ,... , 𝐾

𝑎

and 𝑣

𝑤

𝐾 𝑤

𝑎

𝐾 𝑎

, ⨀ denotes bitwise inner product 𝑥𝑛𝑜𝑟 operation.

VI-C.2 QBERT

QBERT [87] deploys a two-way BERT quantization with

input 𝑥 ∈ 𝑋, its corresponding label y ∈ 𝑌, via cross entropy-

based loss function

𝑐

𝑛

1

𝑒

𝑖

𝑖

(𝑥

𝑖

,𝑦

𝑖

)

where 𝑊

𝑒

is the embedding table, with encoder layers

1

2

𝑛

and classifier 𝑊

𝑐

. Assigning the same bit size

representation to different encoder layers with varying

sensitivity attending to different structures [5] is sub-optimal

and it gets intricate for small target size (2/4 bits) requiring

ultra-low precision. Hence via Hessian Aware Quantization

(HAWQ) more bits are assigned to greater sensitive layers to

retain performance. Hessian matrix is computed via

computationally economical matrix-free iteration technique

where first layer encoder gradient 𝑔

1

for an arbitrary vector

𝑣 as:

1

𝑇

1

1

𝑇

1

1

𝑇

1

1

𝑇

1

1

where 𝐻

1

is Hessian matrix of the first encoder and 𝑣 is

independent to 𝑊

1

, this approach determines the top

eigenvalues for different layers and more aggressive

quantization is deployed for layers with smaller eigenvalues.

For further optimization via group-wise quantization, each

dense matrix is treated as a group with its quantization range

and is partitioned following each continuous output neuron.

VI-C.3 Q8BERT

To quantize weights and activations to 8-bits, symmetric

linear quantization is implemented [88], where 𝑆

𝑥

is the

quantized scaling factor for input 𝑥 and (𝑀 = 2

𝑏− 1

− 1 ) is

the highest quantized value when quantizing to 𝑏 bits.

𝑥

, 𝑀) ≔ 𝐶𝑙𝑎𝑚𝑝(⌊𝑥 ×𝑆

𝑥

𝐶𝑙𝑎𝑚𝑝(𝑥, 𝑎, 𝑏) = min (max(𝑥, 𝑎) , 𝑏)

Implementing a combination of fake quantization [89] and

Straight-Through Estimator (STE) [90], inference time

quantization is achieved during training with a full-precision

backpropagation enabling FP32 weights to overcome errors.

Here

𝜕𝑥

𝑞

𝜕𝑥

, where 𝑥

𝑞

is the result of fake quantizing 𝑥.

VII. INFORMATION RETRIEVAL

For knowledge-intensive tasks like efficient data updating,

and retrieval, huge implicit knowledge storage is required.

Standard language models are not adept at these tasks and do

not match up with task-specific architectures which can be

crucial for open-domain Q&A. For instance, BERT can

predict the missing word in the sentence, “The __ is the

currency of the US” (answer: “dollar”). However since this

knowledge is stored implicitly in its parameters, the size

substantially increases to store further data. This constraint

raises the network latency and turns out prohibitively

expensive to store information as storage space is limited due

to the size constraints of the network.

VII-A GOLDEN RETRIEVER

A conventional multi-hop based open-domain QA involves

question 𝑞 and from a large corpus containing relevant

contextual 𝑆 (gold) documents 𝑑 1

𝑠

that form a sequence

of reasoning via textual similarities that lead to a preferred

answer 𝑎. However, GoldEn Retriever’s [91] first-hop

generates a search query 𝑞 1

that retrieves document 𝑑 for a

given question 𝑞, thereafter for consequent reasoning steps

(𝑘 = 2 ,.. , 𝑆) a query 𝑞 𝑘

is generated from the question (𝑞)

and available context (𝑑 1

𝑘− 1

). GoldEn retrieves greater

contextual documents iteratively while concatenating the

retrieved context for its QA model to answer. It is

independent of the dataset and task-specific IR models where

indexing of additional documents or question types leads to

inefficiencies. A lightweight RNN model is adapted where

text spans are extracted from contextual data to potentially

reduce the large query space. The goal is to generate a search

query 𝑞 𝑘

that helps retrieve 𝑑

𝑘

for the following reasoning

step, based on a textual span from the context 𝐶 𝑘

, 𝑞 is

selected from a trained document reader.

𝑘

𝑘

𝑘

𝑘+ 1

𝑘

𝑛

𝑘

where 𝐺 𝑘

is the query generator and 𝐼𝑅

𝑛

𝑘

) are top n

retrieved documents via 𝑞 𝑘

VII-B ORQA

The components reader and retriever are trained jointly in an

end-to-end fashion where BERT is implemented for

parameter scoring. It can retrieve any text from an open

corpus and is not constrained by returning a fixed set of

documents like a typical IR model. The retrieval score

computation is the question’s 𝑞 dense inner product with

evidence block 𝑏.

𝑞

𝑞

𝑄

(𝑞)[𝐶𝐿𝑆] ( 64 )

𝑏

𝑏

𝐵

)[

]

𝑟𝑒𝑡𝑟

𝑞

𝑇

𝑏

where 𝑊

𝑞

and 𝑊

𝑏

matrices project the BERT output into

128 - dimensional vectors. Similarly, the reader is BERT’s

span variant of the reading model.

𝑠𝑡𝑎𝑟𝑡

𝑅

(𝑞, 𝑏)[𝑆𝑇𝐴𝑅𝑇(𝑠)], ( 67 )

𝑒𝑛𝑑

𝑅

(𝑞, 𝑏)[𝐸𝑁𝐷(𝑠)], ( 68 )

𝑟𝑒𝑎𝑑

[

𝑠𝑡𝑎𝑟𝑡

𝑒𝑛𝑑

]

The retrieval model is pre-trained with an Inverse Cloze Task

(ICT), where the sentence context is relevant semantically

and is used to extrapolate data missing from the sequence 𝑞.

𝐼𝐶𝑇

exp (𝑆

𝑟𝑒𝑡𝑟

exp (𝑆

𝑟𝑒𝑡𝑟

𝑏

∈𝐵𝐴𝑇𝐶𝐻

where 𝑞 is treated as pseudo-question, 𝑏 is text encircling 𝑞

and 𝐵𝐴𝑇𝐶𝐻 is a set of evidence blocks employed for

sampling negatives. Apart from learning word matching

features, it also learns abstract representations as pseudo-

question might or might not be present in the evidence. Post

ICT, learning is defined distribution over answer derivations.

𝑙𝑒𝑎𝑟𝑛

exp (𝑆(𝑏, 𝑠, 𝑞))

∑ ∑ exp (𝑆(𝑏

𝑠

∈𝑏

𝑏

∈𝑇𝑂𝑃(𝑘)

where 𝑇𝑂𝑃(𝑘) are top retrieved blocks based on 𝑆

𝑟𝑒𝑡𝑟

. In

this framework, evidence retrieval from complete Wikipedia

is implemented as a latent variable which is unfeasible to

train from scratch hence retriever is pre-trained with an ICT.

VII-C REALM

This framework explicitly attends to a vast corpus like

Wikipedia however, its retriever learns via backpropagation

and performs Maximum Inner Product Search (MIPS) via

cosine similarity to chose document appropriateness. The

retriever is designed to cache and asynchronously update

each document to overcome the computational challenge of

multi-million order retrieval of candidate documents.

In pre-training, the model needs to predict the randomly

masked tokens via the knowledge retrieval relevance score

𝑓(𝑥, 𝑧), the inner product of vector embeddings between 𝑥

and 𝑧 (MIPS). To implement a knowledge-based encoder,

the combination of input 𝑥 and retrieved document 𝑧 from a

corpus Ƶ is fed as a sequence to fine-tune the Transformer

𝑝( 𝑦 ∣∣ 𝑧, 𝑥 ) as shown in figure 17. This enables complete

cross attention between 𝑥 and 𝑧 that enables to predict the

output y where:

𝐼𝑛𝑝𝑢𝑡

𝑇

𝑑𝑜𝑐

𝑒𝑥𝑝 𝑓(𝑥,𝑧)

∑ exp 𝑓(𝑥,𝑧

)

𝑧′

𝑧∈Ƶ

Like ORQA, BERT is implemented for embedding:

𝐵𝐸𝑅𝑇

(𝑥) = [𝐶𝐿𝑆]𝑥[𝑆𝐸𝑃] ( 74 )

𝐵𝐸𝑅𝑇

1

2

) = [𝐶𝐿𝑆]𝑥

1

[𝑆𝐸𝑃]𝑥

2

[𝑆𝐸𝑃] ( 75 )

In the pre-training of the BERT’s masked language modeling

task, each mask in token 𝑥 needs to be predicted as:

where each instance contains one query 𝑞 𝑖

, one positive

(relevant) passage 𝑝

𝑖

with 𝑛 negative (irrelevant) passages

𝑖,𝑗

. The loss function can be optimized as the negative log-

likelihood of the positive passage.

𝑖

𝑖

𝑖, 1

𝑖,𝑛

𝑠𝑖𝑚(𝑞

𝑖

,𝑝

𝑖

)

𝑠𝑖𝑚(𝑞

𝑖

,𝑝

𝑖

)

𝑠𝑖𝑚(𝑞

𝑖

,𝑝

𝑖,𝑗

) 𝑛

𝑗= 1

VIII. LONG SEQUENCE MODELS

Vanilla Transformers break input sequences into chunks if

their length exceeds 512 tokens, which results in loss of

context when related words exist in different chunks. This

constraint results in a lack of contextual information leading

to inefficient prediction and compromised performance and

dawned the rise of such models.

VIII-A DEEPER SELF-ATTENTION

This 64 layered Transformer [93] was built based on the

discovery that it possessed greater character level modeling

of longer-range sequences. The information was swiftly

transmitted over random distances as compared to RNN’s

unitary step progression. However, the three following

supporting loss parameters were added to the vanilla

Transformer which accelerated convergence and provided

the ability to train deeper networks.

FIGURE 19. Accelerated convergence via multiple target token

prediction across multiple positions through intermediate layers

(i) 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑎𝑐𝑟𝑜𝑠𝑠 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠: Generally

causal prediction occurs at a single position in the final

layer, however in this case all positions are used for

prediction. These auxiliary losses compel the model to

predict on smaller contexts and accelerate training

without weight decay.

(ii) 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑛 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝐿𝑎𝑦𝑒𝑟: Apart from

the final layer, predictions from all intermediate layers

are added for a given sequence, as training progresses,

lower layers weightage is progressively reduced. For 𝑛

layers, the contribution of 𝑙

𝑡ℎ

intermediate layer ceases

to exist after completing 𝑙/ 2 𝑛 of the training.

(iii) 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑇𝑎𝑟𝑔𝑒𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛: The model is modified

to generate two or greater predictions of future

characters where a separate classifier is introduced for

every new target. The extra target losses are weighed

in half before being added to a corresponding layer

loss.

The above 3 implementations are expressed in figure 19. For

sequence length 𝐿, the language model computes joint

probability autoregressive distribution over token sequences.

0 :𝐿

0

𝑖

𝐿

𝑖= 1

0 :𝑖− 1

VIII-B TRANSFORMER-XL

To mitigate context fragmentation in vanilla Transformers,

XL incorporates lengthier dependencies where it reuses and

caches the prior hidden states from where data is propagated

via recurrence. Given a corpus of tokens 𝑥 =

1

2

𝑇

), a language model computes the joint

probability 𝑃

autoregressively, where the context 𝑥

<𝑡

is

encoded into a fixed size hidden state.

𝑡

<𝑡

𝑡

(a) Attention Caching - I (b) Attention Caching II

FIGURE 20. Elongated context capture combining (a) and (b)

Assume two consecutive sentences of length 𝐿, 𝑠

𝜏

[𝑥

𝜏, 1

𝜏,𝐿

] and 𝑠

𝜏+ 1

= [𝑥

𝜏+ 1 , 1

𝜏+ 1 ,𝐿

] where 𝑛

𝑡ℎ

layer

hidden state sequence produced by the 𝜏

𝑡ℎ

segment 𝑠

𝜏

as

𝜏

𝑛

𝐿×𝑑

, where 𝑑 is the hidden dimension. The 𝑛

𝑡ℎ

hidden layer state for the segment 𝑠

𝜏+ 1

is computed as

follows:

𝑟+ 1

~𝑛− 1

[

𝑟

𝑛− 1

𝑟+ 1

𝑛− 1

]

𝑟+ 1

𝑛

𝑟+ 1

𝑛

𝑟+ 1

𝑛

𝑟+ 1

𝑛− 1

𝑄

𝑇

𝑟+ 1

~𝑛− 1

𝐾

𝑇

𝑟+ 1

~𝑛− 1

𝑉

𝑇

𝑟+ 1

𝑛

𝑟+ 1

𝑛

𝑟+ 1

𝑛

𝑟+ 1

𝑛

where 𝑆𝐺(·) represents stop-gradient , [ℎ

𝑢

𝑣

] is the two

hidden sequence concatenation, and 𝑊 the model

parameters. The key distinction from the original

Transformer lies in modeling the key 𝑘

𝑟+ 1

𝑛

and value 𝑣

𝑟+ 1

𝑛

concerning the extended context ℎ 𝑟+ 1

~𝑛− 1

and hence preceding

𝑟

𝑛− 1

are cached. This can be demonstrated from figure 20

above where prior attention span is cached by the latter

forming an elongated caching mechanism.

Such recurrence is applied to every two consecutive

segments to create a segment level recurrence via hidden

states. In the original transformer the attention score within

the same segment between query (𝑞 𝑖

) and key (𝑘

𝑖

) vector is:

𝑖,𝑗

𝑎𝑏𝑠

𝑥

𝑖

𝑇

𝑞

𝑇

𝑘

𝑗

𝑥

𝑖

𝑇

𝑞

𝑇

𝑘

𝑗

𝑖

𝑇

𝑞

𝑇

𝑘

𝑗

𝑖

𝑇

𝑞

𝑇

𝑘

𝑗

From a perspective of relative positional encoding, the above

equation is remodeled in the following manner

𝑖,𝑗

𝑟𝑒𝑙

𝑥

𝑖

𝑇

𝑞

𝑇

𝑘,𝐸

𝑗

𝑥

𝑖

𝑇

𝑞

𝑇

𝑘,𝑅

𝒊−𝒋

𝑻

𝑘,𝐸

𝑗

𝑻

𝑘,𝑅

𝒊−𝒋

VIII-C LONGFORMER

This architecture provides sparsity to the full attention matrix

while identifying input location pairs attending one another

and implements three attention configurations:

(i) 𝑆𝑙𝑖𝑑𝑖𝑛𝑔 𝑊𝑖𝑛𝑑𝑜𝑤: For a fixed window size 𝑤, each

token attends to a sequence length (n) of 𝑤/ 2 on either

side. This leads to the computational complexity of

𝑂(𝑛 × 𝑤) that scales linearly with input sequence

length and for efficiency purposes 𝑤 < 𝑛. A stacked ′𝑙′

layered transformer enables receptivity sized ′𝑙 × 𝑤′

over the entire input ′𝑤′ across all layers. Different ′𝑤′

values can be chosen for efficiency or performance.

(ii) 𝐷𝑖𝑙𝑎𝑡𝑒𝑑 𝑆𝑙𝑖𝑑𝑖𝑛𝑔 𝑊𝑖𝑛𝑑𝑜𝑤: To conserve computation

and extend the receptive field size to ′𝑙 × 𝑑 × 𝑤′,

where ′𝑑′ variable-sized gaps are inducted for dilations

in window size ′𝑤′. Enhanced performance is achieved

via enabling few dilation-free heads (smaller window

size) for attention on local context (lower layers) and

remaining dilated heads (increased window size)

attending longer context (higher layers).

(iii) 𝐺𝑙𝑜𝑏𝑎𝑙 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛: The prior two implementations do

not possess enough flexibility for task-precise learning.

Hence “𝑔𝑙𝑜𝑏𝑎𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛” is implemented on few pre-

designated input tokens (𝑛) where a token attends to

all sequence tokens and all such tokens attend to it. This

preserves the local and global attention complexity to

Its attention complexity is the sum of local and global

attention versus RoBERTa’s quadratic complexity which is

explained by the following mathematical expressions.

(a) Full Attention (b) Sliding Window (c) Dilated Sliding (d) Global Attention

FIGURE 21. Longformer’s different Sparse Attention configurations

𝑙𝑜𝑐𝑎𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 = (𝑛 × 𝑤)

𝑔𝑙𝑜𝑏𝑎𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 = ( 2 × 𝑛 × 𝑠)

0

0

0

0

0

+ 2 𝑠) × 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑒𝑟 𝐿𝑎𝑦𝑒𝑟𝑠

Global attention enables chunk-less document processing,

however, its space-time complexity will be greater than

RoBERTa, if sequence length exceeds the window size.

0

2

0

0

VIII-D EXTENDED TRANSFORMER CONSTRUCTION:

ETC

ETC is an adaptation of the Longformer design which

receives global (𝑛

𝑔

) and long (𝑛

𝑙

) inputs where 𝑛

𝑔

𝑙

. It

computes four global-local attention variations: global-to-

global (𝑔 2 𝑔), global-to-long (𝑔 2 𝑙), long-to-global (𝑙 2 𝑔), and

long-to-long (𝑙 2 𝑙) to achieve long sequence processing.

Global inputs and the other three variations possess limitless

attention to compensate for 𝑙 2 𝑙′𝑠 fixed radius span to achieve

a balance between performance and computational cost.

Further, it replaces the absolute with relative position

encodings which provide information of input tokens

concerning each other.

VIII-E BIG BIRD

Mathematically Big Bird proves randomly sparse attention

can be Turing complete and behaves like a Longformer aided

with random attention. It is designed such as (i) a global

token group 𝑔 attending to all sequence parts (ii) there exists

a group of 𝑟 random keys that each query 𝑞

𝑖

attends to (iii) a

local neighbor window 𝑤 block that each local node attends

to. Big Bird’s global tokens are constructed using a two-fold

approach (i) Big Bird-ITC : Implementing Internal

Transformer Construction (ITC) where few current tokens

are made global that attend over the complete sequence. (ii)

Big Bird-ETC : Implementing Extended Transformer

Construction (ETC), essential additional global tokens 𝑔 are

included [𝐶𝐿𝑆] that attend to all existing tokens.

Its definitive attention process consists of the following

properties: queries attend to 𝑟 random keys where each query

attends to 𝑤/ 2 tokens to the left and right of its location and

have 𝑔 global tokens which are derived from current tokens

or can be supplemented when needed.

IX. COMPUTATIONALLY EFFICIENT ARCHITECTURES

IX-A SPARSE TRANSFORMER

This model’s economical performance is due to the

alienation from the full self-attention procedure that is

modified across several attention steps. The model’s output

results are derived from a factor of the full input array i.e.,

(√𝑁) where 𝑁 ⋵ 𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝐿𝑒𝑛𝑔𝑡ℎ as expressed in Figure

  1. This leads to a lower attention complexity of 𝑂(𝑁√𝑁) in