



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
natural language processing 100 questions
Typology: Cheat Sheet
1 / 27
This page cannot be seen from the preview
Don't miss anything!




















1
2
Department of Computer Science & Engineering, University of Bridgeport, Connecticut, CT 06604, USA
Corresponding author: Sushant Singh ([email protected])
ABSTRACT In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic
and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural
Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal
Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have
achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP
architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate
model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data
size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to
extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has
also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and
examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal
performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy
of NLP designs, comparative evaluations, and future directions in NLP.
INDEX TERMS Deep Learning, Natural Language Processing (NLP), Natural Language Understanding (NLU), Natural
Language Generation (NLG), Information Retrieval (IR), Knowledge Distillation (KD), Pruning, Quantization
I. INTRODUCTION
Natural Language Processing (NLP) is a field of Machine
Learning dealing with linguistics that builds and develops
Language Models. Language Modeling (LM) determines the
likelihood of word sequences occurring in a sentence via
probabilistic and statistical techniques. Since human
languages involve sequences of words, the initial language
models were based on Recurrent Neural Networks (RNNs).
Because RNNs can lead to vanishing and exploding
gradients for long sequences, improved recurrent networks
like LSTMs and GRUs were utilized for improved
performance. Despite enhancements, LSTMs were found to
lack comprehension when relatively longer sequences were
involved. This is due to the reason that the entire history
known as a context, is being handled by a single state vector.
However, greater compute resources lead to an influx of
novel architectures causing a meteoric rise of Deep Learning
[1] based NLP models.
The breakthrough Transformer [2] architecture in 2017
overcame LSTM’s context limitation via the Attention
mechanism. Additionally, it provided greater throughput as
inputs are processed in parallel with no sequential
dependency. Subsequent launches of improved Transformer
based models like GPT-I [3] and BERT [ 4 ] in 2018 turned
out to be a climacteric year for the NLP world. These
architectures were trained on large datasets to create pre-
trained models. Thereafter transfer learning was used to fine-
tune these models for task-specific features resulting in
significant performance enhancement on several NLP tasks
[ 5 ],[ 6 ],[ 7 ],[ 8 ],[ 9 ],[1 0 ]. These tasks include but are not
limited to language modeling, sentiment analysis, question
answering, and natural language inference.
This accomplishment lacked the transfer learning’s primary
objective of achieving high model accuracy with minimal
fine-tuning samples. Also, model performance needs to be
generalized across several datasets and not be task or dataset-
specific [1 1 ],[1 2 ],[1 3 ]. However, the goal of high
generalization and transfer learning was being compromised
as an increasing amount of data was being used for both pre-
training and fine-tuning purposes. This clouded the decision
whether greater training data or an improved architecture
should be incorporated to build a better SOTA language
model. For instance, the subsequent XLNet [1 4 ] architecture
possessed novel yet intricate language modeling, that
provided a marginal improvement over a simplistic BERT
architecture that was trained on a mere ~10% of XLNet’s
data (113GB). Thereafter, with the induction of RoBERTa
[1 5 ], a large BERT-based model trained on significantly
more data than BERT (160GB), outperformed XLNet. Thus,
an architecture that is more generalizable and further is
trained on larger data, results in NLP benchmarks.
The above-mentioned architectures are primarily language
understanding models, where a natural dialect is mapped to
a formal interpretation. Here the initial goal is the translation
of an input user utterance into a conventional phrase
representation. For Natural Language Understanding (NLU)
the intermediate representation for the above models’ end
goal is dictated by the downstream tasks.
Meanwhile, fine-tuning was transpiring to be progressively
challenging for task-specific roles in NLU models as it
required a greater sample size to learn a particular task,
which bereft such models from generalization [1 6 ]. This
triggered the advent of Natural Language Generation (NLG)
models that contrary to NLU training, generated dialect
utterances learned from their corresponding masked or
corrupted input semantics. Such models operate differently
from a routine downstream approach of cursory language
comprehension and are optimal for sequence-to-sequence
generation tasks, such as language translation. Models like
T5 [1 7 ], BART [1 8 ], mBART [ 19 ], T-NLG [2 0 ] were pre-
trained on a large corpus of corrupted text and generated its
corresponding cleaned text via denoising objective [2 1 ]. This
transition was useful as the additional fine-tuning layer for
NLU tasks was not required for NLG purposes. This further
enhanced prediction ability via zero or few-shot learning
which enabled sequence generation with minimal or no fine-
tuning. For instance, if a model’s semantic embedding space
is pre-trained with animal identification of “cat”, “lion” and
“chimpanzee”, it could still correctly predict “dog” without
fine-tuning. Despite superior sequence generation
capabilities, NLG model sizes surged exponentially with the
subsequent release of GPT-III [2 2 ] which was the largest
model before the release of GShard [ 23 ].
Since NLU and NLG’s exceptionally large-sized models
required several GPUs to load, this turned out costly and
resource prohibitive in most practical situations. Further,
when trained for several days or weeks on GPU clusters,
these colossal models came at an exorbitant energy cost. To
mitigate such computational costs [2 4 ], Knowledge
Distillation (KD) [2 5 ] based models like DistilBERT [2 6 ],
TinyBERT [2 7 ], MobileBERT [2 8 ] were introduced at
reduced inference cost and size. These smaller student
models capitalized on the inductive bias of larger teacher
models (BERT) to achieve faster training time. Similarly,
pruning and quantization [ 29 ] techniques got popular to
build economically sized models. Pruning can be classified
into 3 categories: weight pruning, layer pruning, and head
pruning where certain minimal contributing weights, layers,
and attention heads are removed from the model. Like
pruning, training-aware quantization is performed to achieve
less than 32-bit precision format thereby reducing model
size.
For higher performance, greater learning was required which
resulted in larger data storage and model size. Due to the
model’s enormity and implicit knowledge storage, its
learning ability had caveats in terms of efficient information
access. Current Knowledge Retrieval models like ORQA
[3 0 ], REALM [3 1 ], RAG [3 2 ], DPR [3 3 ] attempt to alleviate
implicit storage concerns of language models by providing
external access to interpretable modular knowledge. This
was achieved by supplementing the language model’s pre-
training with a ‘knowledge retriever’ that facilitated the
model to effectively retrieve and attend over explicit target
documents from a large corpus like Wikipedia.
Further, the Transformer model’s inability to handle input
sequences beyond a fixed token span inhibited them to
comprehend large textual bodies holistically. This was
particularly evident when related words were farther apart
than the input length. Hence, to enhance contextual
understanding, architectures like Transformer-XL [3 4 ],
Longformer [3 5 ], ETC [3 6 ], Big Bird [3 7 ], were introduced
with modified attention mechanisms to process longer
sequences.
Also, due to the surge in demand for NLP models to be
economically viable and readily available on edge devices,
innovative compressed models were launched based on
generic techniques. These are apart from the Distillation,
Pruning, and Quantization techniques described earlier. Such
models deploy a wide range of computing optimization
procedures ranging from hashing [3 8 ], sparse attention [ 39 ],
factorized embedding parameterization [4 0 ], replaced token
detection [4 1 ], inter-layer parameter sharing [4 2 ], or a
combination of the above mentioned.
II. RELATED REVIEWS/TAXONOMY
We propose a novel NLP based taxonomy providing a
unique classification of current NLP models from six
different perspectives:
➢ NLU Models : NLU models excel in classification,
structured prediction, and/or query generation tasks. This
is accomplished through pre-training and fine-tuning
motivated by the downstream task.
➢ NLG Models : Contrary to NLU models, these stand out
in sequence-to-sequence generation tasks. They generate
clean text via few and single-shot learning from
corresponding corrupted utterances.
➢ Model Size Reduction : Use compression-based
techniques like KD, Pruning, and Quantization to make
large models economical and pragmatic. It's useful for the
real-time deployment of large language models to operate
on edge devices.
➢ Information Retrieval (IR) : Contextual open domain
question answering (QA) is reliant on effective and
efficient document retrieval. Hence, IR systems via
superior lexical and semantical extraction of physical
This combined abstract representation of all the words is fed
to the decoder to compute the desired language-based task.
Like its preceding layers, the final layer’s corresponding
learnable parameters are 𝑈 𝑡+ 1
and 𝑉
𝑡+ 1
at input and output
respectively at the Encoder and 𝑈
′
𝑡+ 1
𝑡+ 1
at the Decoder.
Combining the weight matrices with hidden state and bias
can be expressed mathematically as follows:
Encoder:
𝑡+ 1
𝑡+ 1
𝑡+ 1
𝑡+ 1
𝑡
𝑡
𝑡+ 1
𝑡+ 1
𝑡+ 1
𝑡
Decoder:
𝑡+ 1
′
𝑡+ 1
𝑡+ 1
′
𝑡+ 1
′
𝑡
𝑡+ 1
𝑡+ 1
′
𝑡+ 1
′
𝑡+ 1
𝑡+ 1
Thereafter, the induction of Attention [4 8 ],[ 49 ] in 2014- 15
overcame the RNN Encoder-Decoder limitation that suffered
from prior input dependencies, making it challenging to infer
longer sequences and suffered from vanishing and exploding
gradients [5 0 ]. The attention mechanism eliminated the RNN
dependency by disabling the entire input context through one
final Encoder node. It weighs all inputs individually that feed
the decoder to create the target sequence. This results in a
greater contextual understanding leading to superior
predictions in target sequence generation. First, the
alignment determines the extent of match between the 𝑗
th
input and 𝑖
th
output which can be determined as
𝑡𝑗
= tanh(ℎ
𝑖− 1
𝑗
More precisely, the alignment scores take as input all
encoder output states and the previous decoded hidden state
which is expressed as:
𝐴𝑙𝑖𝑔𝑛
𝑐𝑜𝑚𝑏
. tanh(𝑊
𝑑𝑒𝑐
𝑑𝑒𝑐
𝑒𝑛𝑐
𝑒𝑛𝑐
The decoder's hidden state and encoder outputs are passed
via their respective linear layers along with their trainable
weights. The weight 𝛼
𝑡𝑗
for each encoded hidden
representation ℎ 𝑗
is computed as:
𝑡𝑗
exp (𝑒
𝑡𝑗
)
∑ exp (𝑒 𝑡𝑘
)
𝑇 𝑥
𝑘= 1
The resulting context vector in this attention mechanism is
determined by:
𝑡
𝑡𝑗
𝑗
𝑇 𝑥
𝑗= 1
𝑥
The Attention mechanism is essentially the generation of the
context vector computed from the various alignment scores
at different positions as shown in figure 3.
Luong’s Attention mechanism differs from the above-
mentioned Bahdanau’s in terms of alignment score
computation. It uses both global and local attention, where
the global attention uses all encoder output states while the
local attention focuses on a small subset of words. This helps
to achieve superior translation for lengthier sequences. These
attention designs led to the development of modern
Transformer architectures which use an enhanced attention
mechanism as described in the next section.
FIGURE 3. Attention Mechanism on Encoder-Decoder Model
IV. NLU ARCHITECTURES
NLU’s approach of transferring pre-trained neural language
representations demonstrated that pre-trained embeddings
improve downstream task results when compared to
embeddings learned from scratch [5 1 ],[5 2 ]. Subsequent
research works enhanced learning to capture contextualized
word representations and transferred them to neural models
[5 3 ],[5 4 ]. Recent efforts not limited to [5 5 ],[5 6 ],[5 7 ] have
further built on these ideas by adding end-to-end fine-tuning
of language models for downstream tasks in addition to
extraction of contextual word representations. This
engineering progression, coupled with large compute
availability has evolved NLU’s state of the art methodology
from transferring word embeddings to transferring entire
multi-billion parameter language models, achieving
unprecedented results across NLP tasks. Contemporary NLU
models leverage Transformers for modeling tasks and
exclusively use an Encoder or a Decoder-based approach as
per requirements. Such models are vividly explained in the
subsequent section.
IV-A TRANSFORMERS
IV-A.1. The Architecture
The original Transformer is a 6-layered Encoder-Decoder
model, that generates a target sequence via the Decoder from
the source sequence via the Encoder. The Encoder and
Decoder at a high level consist of a self-attention and a feed-
forward layer. In the Decoder an additional attention layer in
between enables it to map its relevant tokens to the Encoder
for translation purposes. Self Attention enables the look-up
of remaining input words at various positions to determine
the relevance of the currently processed word. This is
performed for all input words that help to achieve a superior
encoding and contextual understanding of all words.
Transformer architecture was built to induct parallelism in
RNN and LSTM’s sequential data where input tokens are fed
instantaneously and corresponding embeddings are
generated simultaneously via the Encoder. This embedding
maps a word (token) to a vector that can be pre-trained on
the fly, or to conserve time a pre-trained embedding space
like GloVe is implemented. However, similar tokens in
different sequences might have different interpretations
which are resolved via a positional encoder that generates
context-based word information concerning its position.
Thereafter the enhanced contextual representation is fed to
the attention layer which furthers contextualization by
generating attention vectors, that determine the relevance of
the 𝑖
𝑡ℎ
word in a sequence concerning other words. These
attention vectors are then fed to the feed-forward Neural
Network where they are transformed to a more digestible
form for the next ‘Encoder’ or Decoder’s ‘Encoder-Decoder
Attention’ block.
The latter is fed with Encoder output and Decoder input
embedding that performs attention between the two. This
determines the relevance of Transformer’s input tokens
concerning its target tokens as the decoder establishes actual
vector representation between the source and target
mapping. The decoder predicts the next word via softmax
which is executed over multiple time steps until the end of
the sentence token is generated. At each Transformer layer,
there are residual connections followed by a layer
normalization [5 8 ] step to speed up the training during
backpropagation. All of the transformer architectural details
are demonstrated in Figure 4.
IV-A. 2. Queries, Keys, and Values
The input to the Transformer’s Attention mechanism is
target token Query vector 𝑄, its corresponding source token
Key vector 𝐾, and Values 𝑉 which are embedding matrices.
Mapping of source and destination tokens in machine
translation can be quantified as to how similar each of their
tokens is in a sequence via inner dot product. Therefore, to
achieve accurate translation the key should match its
corresponding query, via a high dot product value between
the two. Assume 𝑄 ⋵ {𝐿 𝑄
, 𝐷} and 𝐾 ⋵ {𝐿
𝐾
, 𝐷} where 𝐿
𝑄
𝐾
represent target and source lengths, while 𝐷 denotes the word
embedding dimensionality. Softmax is implemented to
achieve a probability distribution where all Query, Key
similarities add up to one and make attention more focused
on the best-matched keys.
𝑆𝑀
𝑇
) where 𝑊
𝑆𝑀
𝑄
𝐾
Query assigns a probability to key for matching and often
values are similar to keys, therefore
𝐴𝑡𝑡
𝑇
𝑆𝑀
IV-A. 3. Multi-Headed Attention (MHA) and Masking
MHA enhances the model’s capacity to emphasize a
sequence’s different token positions by implementing
attention parallelly multiple times. The resulting individual
attention outputs or heads are concatenated and transformed
via a linear layer to the expected dimensions. Each of the
multiple heads enables attending the sequence parts from a
different perspective providing similar representational
forms for each token. This is performed as each token’s self-
attention vector might weigh the word it represents higher
than others due to the high resultant dot product. This is not
productive since the goal is to achieve similarly assessed
interaction with all tokens. Therefore self-attention is
computed 8 different times resulting in 8 separate attention
vectors for each token which are used to compute the final
attention vector via a weighted sum of all 8 vectors for each
token. The resultant multi-headed attention vectors are
computed in parallel which is fed to the feed-forward layer.
Each subsequent target token 𝑇
𝑡+ 1
is generated using as
many source tokens in the encoder (𝑆
0
𝑡+𝑛
). However,
in an autoregressive decoder only previous time stepped
target tokens are considered (𝑇
0
𝑡
), for future target
prediction purposes known as causal masking. This is
provided to enable maximal learning of the subsequently
translated target tokens. Therefore during parallelization via
matrix operations, it is ensured that the subsequent target
words are masked to zero, so the attention network cannot
see into the future. The Transformer described above
resulted in significant improvement in the NLP domain. This
leads to a plethora of high-performance architectures that we
describe in the subsequent sections.
FIGURE 4. The Multi-headed Transformer Architecture
IV-B EMBEDDINGS FROM LANGUAGE MODELS: ELMo
The goal of ELMo [ 59 ] is to generate a deep contextualized
word representation that could model (i) intricate syntactical
and semantical characteristics of word (ii) polysemy or
lexical ambiguity, words with similar pronunciations could
have different meanings at different contexts or locations.
These enhancements gave rise to contextually rich word
embeddings which were unavailable in the previous SOTA
models like GloVe. Unlike its predecessors that used a
predetermined embedding, ELMo considers all 𝑁 token
occurrences (𝑡
1
2
𝑁
) for each token 𝑡 in the entire
sequence before creating embeddings. The authors
hypothesize that the model could extract abstract linguistic
attributes in its architecture’s top layers via a task-specific
bi-directional LSTM.
This is possible by combining a forward and a backward
language model. At timestep 𝑘 − 1 , the forward language
model predicts the next token 𝑡
𝑘
given the input sequence’s
previous observed tokens via a joint probability distribution
GPT performs various tasks like classification, entailment,
similarity index, Multiple-Choice Questions (MCQ) as
shown in figure 6. The extraction phase distills features from
textual bodies before which the text is separated via the
‘Delimiter’ token during text pre-processing. This token is
not required for classification tasks since it does not need to
gauge the relationship between multiple sequences.
Moreover, Q&A or textual entailment tasks involve defined
inputs like ordered sentence pairs or triplets in a document.
For MCQ tasks, contextual alterations are required at input
to achieve the correct results. This is done via a Transformer
based Decoder training objective where input
transformations are fine-tuned for their respective answers.
IV-C BIDIRECTIONAL ENCODER REPRESENTATIONS
FROM TRANSFORMER: BERT
BERT is a stack of pre-trained Transformer Encoders that
overcomes prior models’ restrictive expressiveness i.e.,
GPT’s lack of bidirectional context and ELMo’s shallow
dual context’s concatenation. BERT’s deeper model
provides a token with several contexts with its multiple
layers and the bi-directional model provides a richer learning
environment. However, bi-directionality raises concerns that
tokens could implicitly foresee future tokens during pre-
training resulting in minimal learning and leading to trivial
predictions. To effectively train such a model, BERT
implements Masked Language Modeling (MLM) that masks
15% of all input tokens randomly in each input sequence.
This masked word prediction is the new requirement unlike
recreating the entire output sequence in a unidirectional LM.
BERT masks during pre-training, hence the [MASK] token
does not show during fine-tuning, creating a mismatch as the
“masked” tokens are not replaced. To overcome this
disparity, subtle modeling modifications are performed
during the pre-training phase. If a token 𝑇 𝑖
is chosen to be
masked, then 80% of the time it is replaced with the [MASK]
token, 10% of the time a random token is chosen and for the
remaining 10%, it remains unchanged. Thereafter 𝑇
𝑖
cross-
entropy loss will predict the original token, the unchanged
token step is employed to maintain a bias towards the correct
prediction. This methodology creates a state of randomness
and constant learning for the Transformer encoder which is
compelled to maintain a distributed contextual
representation of each token. Further, as random replacement
arises for a mere 1.5% of all tokens (10% of 15%), this does
not seem to impair the language model’s understanding
ability.
Language modeling could not explicitly comprehend the
association between multiple sequences; therefore it was
deemed sub-optimal for inference and Q&A tasks. To
overcome this, BERT was pre-trained with a monolingual
corpus for a binarized Next Sentence Prediction (NSP) task.
As shown in Figure 7, sentences 𝑌 (He came [MASK] from
home) and 𝑍 (Earth [MASK] around Sun) do not form any
continuity or relationship. Since 𝑍 is not the actual next
sentence following 𝑌, the output classification label
[NotNext] gets activated, and [IsNext] activates when
sequences are coherent.
FIGURE 7. The architecture of BERT’s MLM and NSP functionality
IV-D GENERALIZED AUTOREGRESSIVE PRETRAINING
FOR LANGUAGE UNDERSTANDING: XLNeT
XLNet captures the best of both worlds where it preserves
the benefits of Auto-Regressive (AR) modeling and
bidirectional contextual capture. To better comprehend why
XLNet outperforms BERT, consider the 5-token sequence
[San, Francisco, is, a, city]. The two tokens chosen for
prediction are [San, Francisco], hence BERT and XLNet
maximize 𝑙𝑜𝑔 𝑝(𝑆𝑎𝑛 𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜 | 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦) as follows:
𝐵𝐸𝑅𝑇
= log 𝑝 (𝑆𝑎𝑛
log 𝑝 (𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜|𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦)
𝑋𝐿𝑁𝑒𝑡
= log 𝑝 (𝑆𝑎𝑛| 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦) +
log 𝑝 (𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜|𝑆𝑎𝑛 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦)
The above can further be generalized for the target (𝒯) and
non-target token set (𝒩), BERT and XLNet will maximize
log 𝑝 (𝒯|𝒩) with the following different interpretability:
𝐵𝐸𝑅𝑇
= ∑ log 𝑝(𝑥|
𝑥∈𝒯
𝐵𝐸𝑅𝑇
= ∑ log 𝑝(𝑥|
𝑥∈𝒯
<𝑥
XLNet considers the target as well as the remaining tokens
for prediction, whereas BERT only considers the non-target
tokens. Hence, XLNet captures the inter-pair dependency
[San, Francisco] unlike BERT where either [San] or
[Francisco] leads to correct prediction. Further, via AR
XLNet performs factorized ordering on all possible token
permutations (𝐿! =5!) of sequence length 𝐿 in the set i.e.,
{[1, 2, 3, 4, 5], [1, 2, 5, 4, 3],.. ., [5, 4, 3, 2, 1]} ≅ [is, San,
Francisco, a, city] etc.
𝑧~𝒵
𝑇
[∑ log 𝑝
𝜃
𝑧
𝑡
∣ x
𝑧
<𝑡
𝑇
𝑡= 1
where set 𝒵
𝑇
contains all permutational sequences of length
𝑇 [ 1 , 2 ,.. , 𝑇] and 𝑥
𝑧
𝑡
is the reference token. Hence the target
learns from numerous combinations attaining a richer
contextualized learning. Further for all permutable
factorization orders, the model parameters are shared to build
knowledge and bidirectional context from all factorizations
as demonstrated via equation 27.
IV-D.1. Masking
There is a challenge to determine the word order in the
sequence as the token (𝑥 𝑧
𝑡
) determining the autoregression
is not considered. This word order is partially achieved via
positional encoding, however, for contextual understanding
XLNet employs masking. Consider a generated permutation
of [2, 1 , 3 ] in a 3 - token sequence where the first token i.e., 2
has no context hence all masking results in [0,0,0] in the 2
nd
row of the 3 × 3 masking matrix. Similarly, the 2
nd
and 3
rd
masks would result in [0,1,0] and [1,1, 0 ] in the 1
st
and 3
rd
row
of the Query Stream (QS) masking matrix where the token
cannot see itself. QS matrix with an all-one diagonal
inclusion constitutes Content Stream (CS) masking matrix
where each token can see itself. This 3 - token sequence
masking is demonstrated in figure 8 below.
FIGURE 8. Illustration of predicting x 2 in the 3-token sequence with
different factorization orders and its corresponding masking matrices
The first reference ‘2’ has no context which is gathered from
its corresponding ‘mem block’, a Transformer-XL-based
extended cached memory access. Thereafter it receives
context from token ‘3’ and ‘1’,’3’ for subsequent orderings.
IV-D. 2. Model Architecture
Figure 9 demonstrates the model’s two-stream attention
framework that consists of a content and query stream
attention process to achieve greater understanding via
contextualization. This process is initiated via target-aware
representation, where the target position is baked into the
input for subsequent token generation purposes.
(i) Target Aware Representation: A vanilla implementation
of Transformer based parametrization does not suffice for
complex permutation-based language modeling. This is
because the next token distribution 𝑝 𝜃
𝒵
𝑡
𝑧<𝑡
) is
independent of the target position i.e., 𝒵 𝑡
. Subsequently,
redundant distribution is generated, which is unable to
discover effective representations, hence target position-
aware re-parametrization for the next-token distribution is
proposed as follows:
𝜃
𝒵
𝑡
z
<𝑡
exp (𝑒
( 𝑥
)
𝑇
𝒉 𝜽
(𝐱 𝒛 <𝒕
))
∑ exp (𝑒(𝑥
′
)
𝑇
𝒉
𝜽
(𝐱
𝒛 <𝒕
))
𝑥
′
𝜃
𝒵
𝑡
𝑧
<𝑡
exp(𝑒(𝑥)
𝑇
𝒈 𝜽
(𝐱 𝒛
<𝒕
,𝓩 𝒕
))
∑ exp (𝑒(𝑥
′
)
𝑇
𝒈
𝜽
(𝐱
𝒛 <𝒕
,𝓩
𝒕
))
𝑥
′
where 𝑔
𝜃
(x
𝒛<𝑡
𝑡
) is a modified representation that
additionally considers the target position 𝒵
𝑡
as an input.
(ii) Two Stream Self Attention: The formulation of 𝑔
𝜃
remains a challenge despite the above resolution as the goal
is to rely on the target position 𝒵
𝑡
to gather contextual
information 𝑥
𝒛<𝑡
via attention, hence: (1) For 𝑔
𝜃
to predict
𝒵𝑡
, it should utilize the position of 𝒵
𝑡
only to incorporate
greater learning, not the content 𝑥
𝒵𝑡
(2) To predict other
tokens 𝑥
𝒵 𝑗
where 𝑗 > 𝑡, 𝑔
𝜃
should encode the context 𝑥
𝒵 𝑡
to provide full contextual understanding.
To further resolve the above conflict, the authors propose
two sets of hidden representation instead as follows:
❖ The hidden content representation ℎ
𝜃
𝒛<𝑡
𝒵 𝑡
that
encodes both context and content 𝑥
𝒵
𝑡
❖ The query representation 𝑔
𝜃
𝒛<𝑡
𝑡
𝒵
𝑡
which
solely accesses the contextual information 𝑥
𝒛<𝑡
and
position 𝒵
𝑡
without the content 𝑥
𝒵
𝑡
FIGURE 9. (Left): Standard Attention via Content Stream and Query
Stream Attention without access to the content. (Right): LM training
The above two attention courses are parametrically shared
and updated for every self-attention layer 𝑚 as:
𝒵
𝑡
(𝑚− 1 )
𝒵
≤𝑡
(𝑚− 1 )
𝒵
𝑡
(𝑚)
(Content Stream: utilize both 𝒵
𝑡
and 𝑥
𝒵
𝑡
𝒵
𝑡
(𝑚− 1 )
𝒵
<𝑡
(𝑚− 1 )
𝒵
𝑡
(𝑚)
(Query Stream: use 𝒵
𝑡
without seeing 𝑥
𝒵 𝑡
This dual attention is pictorially expressed in figure 9. For
simplicity purposes, consider the prediction of token 𝑡
𝑖
that
is not allowed to access its corresponding embedding from
the preceding layer. However, to predict 𝑡
𝑖+ 1
the token 𝑡
𝑖
needs to access its embedding and both operations must
occur in a single pass.
Therefore, two hidden representations are implemented
where ℎ
𝒵
𝑡
(𝑚)
is initialized via token embeddings and 𝑔
𝒵
𝑡
(𝑚)
through weighted transformations. From above equations
𝒵
𝑡
(𝑚)
can access the history including the current position
whereas 𝑔
𝒵
𝑡
(𝑚)
can access only previous ℎ
𝒵 𝑡
(𝑚)
positions.
The token prediction happens in the final layer via 𝑔
𝒵 𝑡
(𝑚)
For greater sequence length processing the memory blocks
V-A LANGUAGE MODELS ARE UNSUPERVISED MULTI-
TASK LEARNERS: GPT-II
GPT-II [62] was possibly the first model that dawned on the
rise of NLG models. It was trained in an unsupervised
manner capable of learning complex tasks including
Machine Translation, reading comprehension, and
summarization without explicit fine-tuning. Task-specific
training corresponding to its dataset was the core reason
behind the generalization deficiency witnessed in current
models. Hence robust models would likely require training
and performance gauges on a variety of task domains.
GPT-II incorporates a generic probabilistic model where
numerous tasks can be performed for the same input as
𝑝(𝑜𝑢𝑡𝑝𝑢𝑡|𝑖𝑛𝑝𝑢𝑡, 𝑡𝑎𝑠𝑘). The training and test set
performance improves as model size is scaled up and as a
result, it under fits on the huge WebText dataset. The 1.
billion parameter GPT-2 outperformed its predecessors on
most datasets in the previously mentioned tasks in a zero-
shot environment. It is an extension of the GPT-I decoder-
only architecture trained on significantly greater data.
V-B BIDIRECTIONAL AND AUTOREGRESSIVE
TRANSFORMERS: BART
A denoising autoencoder BART is a sequence-to-sequence
[63] model that incorporates two-stage pre-training: (1)
Corruption of original text via a random noising function,
and (2) Recreation of the text via training the model. Noising
flexibility is the major benefit of the model where random
transformations not limited to length alterations are applied
to the original text. Two such noising variations that stand
out are random order shuffling of the original sentence and a
filling scheme where texts of any spanned length are
randomly replaced by a single masked token. BART deploys
all possible document corruption schemes as shown below in
figure 11 , wherein the severest circumstance all source
information is lost and BART behaves like a language
model.
FIGURE 11. Denoised BART Model and its Noising Schemes
This forces the model to develop greater reasoning across
overall sequence length enabling greater input
transformations which results in superior generalization than
BERT. BART is pre-trained via optimization of a
reconstruction loss performed on corrupted input documents
i.e., cross-entropy between decoder’s output and original
document. For machine translation tasks, BART’s encoder
embedding layer is replaced with an arbitrarily initialized
encoder, that is trained end-to-end with the pre-trained model
as shown in Figure 12. This encoder maps its foreign
vocabulary to BART’s input which is denoised to its target
language English. The source encoder is trained in two
stages, that share the backpropagation of cross-entropy loss
from BART’s output. Firstly, most BART parameters are
frozen, and only the arbitrarily initialized encoder, BART’s
positional embeddings, and its encoder’s self-attention input
projection matrix are updated. Secondly, all model
parameters are jointly trained for few iterations. BART
achieves state-of-the-art performance on several text
generation tasks, fueling further exploration of NLG models.
It achieves comparative results on discriminative tasks when
compared with RoBERTa.
FIGURE 12. Denoised BART Model for fine-tuned MT tasks
V-C MULTILINGUAL DENOISING PRE-TRAINING FOR
NEURAL MACHINE TRANSLATION: mBART
V-C. 1. Supervised Machine Translation
mBART demonstrates that considerable performance gains
are achieved over prior techniques [64], [65] by
autoregressively pre-training BART, via sequence
reconstructed denoising objective across 25 languages from
the common crawl (CC-25) corpus [66]. mBART’s
parametric fine-tuning can be supervised or unsupervised,
for any linguistic pair without task-specific revision. For
instance, fine-tuning a language pair i.e. (German-English)
enables the model to translate from any language in the
monolingual pre-training set i.e. (French English), without
further training. Since each language contains tokens that
possess significant numerical variations, the corpus is
balanced via textual up/downsampling from each language 𝑖
with the ratio 𝜆
𝑖
𝑖
𝑖
𝑖
𝛼
𝑖
𝛼
𝑖
where 𝑝
𝑖
is each language’s percentage in the dataset with a
soothing parameter 𝛼 = 0. 7. The training data encompasses
𝐾 languages: 𝒞 = {𝒞
1
𝑘
} where each 𝒞
𝑖
is 𝑖
𝑡ℎ
language’s monolingual document collection. Consider a
text corrupting noising function 𝑔(𝑋) where the model is
trained to predict original text 𝑋, hence loss ℒ
𝜃
is maximized
as:
𝜃
= ∑ ∑ log 𝑃(𝑋 ∣
𝑋∈𝒞
𝑖
𝒞
𝑖
∈𝒞
where language 𝑖 has an instance 𝑋 and above distribution 𝑃
is defined via a sequence-to-sequence model.
V-C. 2. Unsupervised Machine Translation
mBART is evaluated on tasks where target bi-text or text
pairs are not available in these 3 different formats.
❖ None of any kind of bi-text is made available, here back-
translation (BT) [67],[68] is a familiar solution. mBART
offers a clean and effective initialization scheme for
such techniques.
❖ The bi-text for the target’s pair is made unavailable,
however, the pair is available in the target language’s bi-
text corpora for other language pairs.
❖ Bi text is not available for the target pair, however, is
available for translation from a different language to the
target language. This novel evaluation scheme
demonstrates mBART’s transfer learning capability
despite the absence of the source language’s bi-text
mBART is pre-trained for all 25 languages and fine-tuned
for the target language as shown in figure 13.
FIGURE 13. mBART Generative Model Pre-training & Fine-tuning
V-D EXPLORING THE LIMITS OF TRANSFER
LEARNING WITH A TEXT-TO-TEXT TRANSFORMER: T
This model was built by surveying and applying the most
effective transfer learning practices. Here all NLP tasks are
orchestrated within the same model and hyperparameters are
reframed into a unified text-to-text setup where text strings
are inputs and outputs. A high-quality, diverse and vast
dataset is required to measure the scaled-up effect of pre-
training in the 11 billion parameter T5. Therefore, Colossal
Clean Crawled Corpus (C4) was developed, twice as large as
Wikipedia.
The authors concluded that causal masking limits the
model’s capability to attend only till the 𝑖
𝑡ℎ
input entry of a
sequence, which turns detrimental. Hence T5 incorporates
fully visible masking during the sequence’s prefix section
(prefix LM) whereas causal masking is incorporated for
training the target’s prediction. The following conclusions
were made after surveying the current transfer learning
landscape.
❖ Model Configuration: Normally models with Encoder-
Decoder architectures outperformed decoder-based
language models.
❖ Pre-Training Goals: Denoising worked best for fill-in-
the-blank roles where the model is pre-trained to retrieve
input missing words at an acceptable computational cost
❖ In-Domain Datasets: In-domain data training turns out to
be effective, however pre-training small datasets
generally leads to overfitting.
❖ Training Approaches: A pre-train, fine-tune methodology
for multi-task learning could be effective, however, each
task’s training frequency needs to be monitored.
❖ Scaling Economically: To efficiently access the finite
computing resources, evaluation among model size
scaling, training time, and ensembled model quantity is
performed.
V-E TURING NATURAL LANGUAGE GENERATION: T-
NLG
T-NLG is a 78 layered Transformer based generative
language model, that outsizes the T5 with its 17 billion
trainable parameters. It possesses greater speedup than
Nvidia’s Megatron, which was based on interconnecting
multiple machines via low latency buses. T-NLG is a
progressively larger model, pre-trained with greater variety
and quantity of data. It provides superior results in
generalized downstream tasks with lesser fine-tuning
samples. Hence, its authors conceptualized training a huge
centralized multi-task model with its resources shared across
various tasks, rather than allocating each model for a task.
Consequently, the model effectively performs question
answering without prior context leading to enhanced zero-
shot learning. Zero Redundancy Optimizer (ZeRO) achieves
both model and data parallelism concurrently, which perhaps
is the primary reason to train T-NLG with high throughput.
V-F LANGUAGE MODELS ARE FEW-SHOT LEARNERS:
GPT-III
The GPT family (I, II, and III) are autoregressive language
models, based on transformer decoder blocks, unlike
denoising autoencoder-based BERT. GPT-3 is trained on
175 billion parameters from a dataset of 300 billion tokens
of text used for generating training examples for the model.
Since GPT-3 is 10 times the size of any previous language
model and for all tasks and purposes it employs few-shot
learning via a text interface, without gradient updates or fine-
tuning it achieves task agonism. It employs unsupervised
pre-training, where the language model acquires a wide
𝑠
𝑠
𝑖
log(𝑝
𝑖
𝑠𝑖
𝑖
FIGURE 15. Language Model’s Generalized Distilled Architecture
The standard model of vanilla knowledge distillation
integrates the distilled and the student loss as shown below,
𝐷
𝑡
𝑠
𝑠
𝑠
𝐷
𝑠
where 𝑊 ∈ student parameters and 𝛼, 𝛽 ∈
𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠. In the original paper weighted
average was used concerning 𝛼 and 𝛽, i.e., 𝛽 = 1 − 𝛼 and
for best results, it was observed that 𝛼 ≫ 𝛽.
VI-A.1. DistilBERT
DistilBERT, the student version of the teacher BERT
retained 97% of BERT’s language understanding
performance and was at inference time lighter, faster, and
required lesser training cost. Through KD, DistilBERT
reduces BERT size by 40%, is 60% faster and the
compressed model is small enough to be operated on edge
devices. The layer-depth of DistilBERT is slashed by half
when compared with BERT since both possess the same
dimensionality and possess generally an equivalent
architecture. Layer reduction was performed as its
normalization and linear optimization were computationally
ineffective in the final layers. To maximize the inductive bias
of large pre-trained models, DistilBERT introduced a triple
loss function which linearly combined the distillation (ℒ 𝐷
with the supervised training (ℒ 𝑚𝑙𝑚
) or the masked language
modeling loss. It was observed that supplementing the prior
loss with embedding cosine loss (ℒ 𝑐𝑜𝑠
) was beneficial as it
directionally aligned the teacher’s and student’s hidden state
vectors.
VI-A. 2. TinyBERT
To overcome the distillation complexity of the pre-training-
then-fine-tuning paradigm, TinyBERT introduced a lucid
knowledge transfer process by inducting 3 loss functions: (i)
Embedding Layer Output (ii) Attention Matrices, the Hidden
States from Transformer (iii) Output Logits. This not only
led TinyBERT to retain over 96% of BERT’s performance
at drastically reduced size but also deployed a meager 28%
of parameters and 31% of inference time across all BERT-
based distillation models. Further, it leveraged the untapped
extractable potential from BERT’s learned attention weights
[70], for ( 𝑀 + 1 )
𝑡ℎ
layer, knowledge acquired is enhanced
by minimizing:
𝑚𝑜𝑑𝑒𝑙
𝑚
𝑙𝑎𝑦𝑒𝑟
𝑚
𝑀+ 1
𝑚= 0
𝑔(𝑚)
where ℒ
𝑙𝑎𝑦𝑒𝑟
is the loss function of a Transformer or an
Embedding layer and hyperparameter 𝜆
𝑚
signifies the
importance of 𝑚
𝑡ℎ
layer’s distillation. BERT’s attention-
based enhancement for language understanding can be
incorporated in TinyBERT as:
𝑎𝑡𝑡𝑛
ℎ
𝑖= 1
𝑖
𝑆
𝑖
𝑇
𝑖
𝑙×𝑙
where ℎ denotes the number of heads, 𝐴
𝑖
is the attention
matrix corresponding to student or teacher’s 𝑖
𝑡ℎ
head, 𝑙
denotes input text length along with mean squared error
(MSE) loss function. Further, TinyBERT distills knowledge
from the Transformer output layer and can be expressed as:
ℎ𝑖𝑑𝑛
𝑠
ℎ
𝑇
ℎ
𝑙×𝑑
′
𝑠
𝑙×𝑑
′
𝑇
𝑙×𝑑
′
where 𝐻
𝑠
𝑇
are the hidden states of the student and teacher
respectively, hidden sizes of the teacher and student models
are denoted via scalar values of 𝑑
′
and 𝑑, 𝑊
ℎ
is a learnable
matrix that transforms the student network’s hidden states to
the teacher network’s space states. Similarly, TinyBERT
also performs distillation on embedding-layer:
𝑒𝑚𝑏𝑑
𝑠
𝑒
𝑇
where 𝐸
𝑠
and 𝐻
𝑇
are embedding matrices of student and
teacher networks, respectively. Apart from mimicking the
intermediate layer behavior, TinyBERT implements KD to
fit predictions of the teacher model via cross-entropy loss
between logits of the student and the teacher.
𝑝𝑟𝑒𝑑
𝑇
). log (𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ((
𝑠
Here 𝑧
𝑇
and 𝑧
𝑆
are the respective logits predicted by the
teacher and student models.
VI-A.3. MobileBERT
Unlike previous distilled models, MobileBERT achieves
task-agnostic compression from BERT achieving training
convergence via prediction and distillation loss. To train
such a deeply thin model, a unique inverted bottleneck
teacher model is designed that incorporates BERT (IB-
BERT) from where knowledge transfer distills to
MobileBERT. It is 4.3× smaller, 5.5× faster than BERT
achieving a competitive score that is 0.6 units lower than
BERT on GLUE-based inference tasks. Further, the low
latency of 62 ms on Pixel 4 phone can be attributed to the
replacement of Layer Normalization and gelu activation,
with the simpler Hadamard product (∘) based linear
transformation.
𝑛
For knowledge transfer, the mean squared error between
feature maps of MobileBERT’s and IB-BERT is
implemented as a transfer objective.
𝐹𝑀𝑇
𝑙
𝑡,𝑙,𝑛
𝑡𝑟
𝑡,𝑙,𝑛
𝑠𝑡
2
𝑁
𝑛= 1
𝑇
𝑡= 1
where 𝑙 is layer index, 𝑇 is sequence length, 𝑁 is the feature
map size. For TinyBERT to harness the attention capability
from BERT, KL-divergence is minimized between per-head
distributions of the two models, where 𝐴 denotes the number
of attention heads.
𝐴𝑇
𝑙
𝐾𝐿
𝑡,𝑙,𝑎
𝑡𝑟
𝑡,𝑙,𝑎
𝑠𝑡
𝐴
𝑎= 1
𝑇
𝑡= 1
Alternatively, a new KD loss can be implemented during
MobileBERT’s pre-training with a linear combination of
BERT’s MLM and NSP loss, where 𝛼 is a hyperparameter
between (0,1).
𝑃𝐷
𝑀𝐿𝑀
𝐾𝐷
𝑁𝑆𝑃
For the above-outlined objectives, 3 training strategies are
proposed:
(i) Auxiliary Knowledge Transfer: Intermediary transfer via
a linear combination of all layer transfer loss and distilled
pre-training loss.
(ii) Joint Knowledge Transfer : For superior results, 2
separate losses are proposed where MobileBERT is trained
with all layers that jointly transfer losses and perform pre-
trained distillation.
(iii) Progressive Knowledge Transfer: To minimize error
transfer from lower to higher layers, it is proposed to divide
knowledge transfer into 𝐿 layered 𝐿 stages where each layer
is trained progressively.
VI-B PRUNING
Pruning [71] is a methodology where certain weights, biases,
layers, and activations are zeroed out which are no longer a
part of the model’s backpropagation. This introduces
sparsity in such elements which are visible post ReLU layer
that converts negative values to zero
((𝑅𝑒𝐿𝑈(𝑥): 𝑚𝑎𝑥( 0 , 𝑥)). Iterative pruning learns the key
weights, eliminating the least critical ones based on threshold
values, and retraining the model enabling it to recuperate
from pruning by adapting to the remaining weights. NLP
models like BERT, RoBERTa, XLNet were pruned by 40%
and retained their performance by 98%, which is comparable
to DistilBERT.
VI-B.1 LAYER PRUNING
VI-B.1-A STRUCTURED DROPOUT
This architecture [72] randomly drops layers at training and
test time that enables sub-network selection of any desired
depth, since the network has been trained to be pruning
robust. This is an upgrade from current techniques that
require re-training a new model from scratch as opposed to
training a network from which multiple shallow models are
extracted. This sub-network sampling like Dropout [73] and
DropConnect [74] builds an efficient pruning robust network
if the smartly chosen simultaneous group of weights are
dropped. Formally, pruning robustness in regularizing
networks can be achieved by independently dropping each
weight via Bernoulli’s distribution where parameter p > 0
regulates the drop rate. This is comparable to the pointwise
product of weight matrix 𝑊 with an arbitrarily sampled {0,
𝑑
The most effective layer dropping strategy is to drop every
other layer, where pruning rate 𝑝 and dropping layers at
depth 𝑑 such that 𝑑 ≡ 0 (𝑚𝑜𝑑
). For 𝑁 groups with a
fixed drop ratio 𝑝, the average number of groups utilized
during training the network is 𝑁( 1 − 𝑝), hence pruning size
for 𝑟 groups, the ideal drop rate will be 𝑝
∗
= 1 − 𝑟/𝑁. This
approach has been highly effective on numerous NLP tasks
and has led to models on size comparable to distilled versions
of BERT and demonstrate better performance.
VI-B.1-B POOR MAN’S BERT
Due to the over-parameterization of deep neural networks,
availability of all parameters is not required at inference
time, hence few layers are strategically dropped resulting in
competitive results for downstream tasks [75]. The odd-
alternate dropping strategy drove superior results than the
top and even alternate dropping for span 𝐾 = 2 across all
tasks. For instance, in a 12-layer network, dropping: top –
{11, 12}; even-alternate – {10, 12}; odd-alternate – {9, 11},
concluded in (i) dropping the final two layers consecutively
is more detrimental than eliminating alternate layers, and (ii)
preserving the final layer has greater significance than other
top layers.
FIGURE 16. Layer Pruning Strategies deployed by Language Models
At higher values of 𝐾, the alternate dropping approach
signifies a large drop in performance, hypothesized due to
the elimination of lower layers. The Symmetric approach
VI-B.3-B ARE 16 HEADS REALLY BETTER THAN ONE?
In multi-headed attention (MHA), consider a sequence of 𝑛
𝑑-dimensional vectors 𝑥 = 𝑥 1
𝑛
𝑑
, and query
vector 𝑞 ∈ ℝ
𝑑
. The MHA layer parameters
𝑞
ℎ
𝑘
ℎ
𝑣
ℎ
𝑜
ℎ
𝑑
ℎ
×𝑑
and 𝑊
𝑜
ℎ
𝑑×𝑑
ℎ
, when 𝑑
ℎ
For masking attention heads the original transformer
equation is modified as:
𝐴𝑡𝑡𝑛
𝒉
𝑊 𝑞
ℎ
,𝑊
𝑘
ℎ
𝑊 𝑣
ℎ
𝑊 𝑜
ℎ
𝑁
ℎ
ℎ= 1
where 𝜉 ℎ
are masking variables with values between { 0 , 1 },
ℎ
(𝑥) is the output of head ℎ for input 𝑥. The following
experiments yielded the best results [83] on pruning the
different number of heads at test times:
(i) Pruning just one head: If the model’s performance
significantly degrades while masking head ℎ, then ℎ is a
key head else it is redundant given the rest of the model.
A mere 8 (out of 96) heads trigger a significant change
in performance when removed from the model, out of
which half result in a higher BLEU score.
(ii) Pruning all heads except one: A single head for most
layers was deemed sufficient at test time, even for
networks with 12 or 16 attention heads, resulting in a
drastic parametric reduction. However, multiple
attention heads are a requirement for specific layers i.e.,
the final layer of the encoder-decoder attention, where
performance degrades by a massive 13.5 BLEU points
on a single head.
The expected sensitivity of the model to the masking 𝜉 is
evaluated for the proxy score for head significance.
ℎ
𝑥~𝑋
ℎ
ℎ
𝑥~𝑋
ℎ
𝑇
ℎ
where 𝑋 is the data distribution, ℒ(𝑥) is the loss on sample
𝑥. If 𝐼 ℎ
is high, then modifying 𝜉
ℎ
will likely have a
significant effect on the model, hence low 𝐼 ℎ
value heads are
iteratively pruned out.
VI-C QUANTIZATION
32 - bit floating-point (FP32) has been the predominant
numerical format for deep learning, however the current
surge for reduced bandwidth and compute resources has
propelled the implementation of lower-precision formats. It
has been demonstrated that weights and activation
representations via 8-bit integers (INT8) have not led to an
evident accuracy loss. For instance, BERT’s quantization to
16/8-bit weight format resulted in 4× model compression
with minimal accuracy loss, consequently, a scaled-up
BERT serves a billion CPU requests daily.
VI-C.1 LQ-NETS
This model [84] inducts simple to train network weights and
activations mechanism via joint training of a deep neural
network. It quantizes with variable bit precision capabilities
unlike fixed or manual schemes [85],[86]. Generally, a
quantized function can represent floating-point weights 𝑤,
activations 𝑎, in a few bits as:
𝑙
, if 𝑥 ∈ (𝑡
𝑙
𝑙+ 1
] where 𝑞
𝑙
Here 𝑞
𝑙
and (𝑡
𝑙
𝑙+ 1
] are quantization levels and intervals,
respectively. To preserve quick inference times, quantization
functions need to be compatible with bitwise operations,
which is achieved via uniform distribution that
maps floating-point numbers to their nearest fixed-point
integers with a normalization factor. The LQ learnable
quantization function can be expressed as:
𝐿𝑄
𝑇
𝑙
, if 𝑥 ∈ (𝑡
𝑙
𝑙+ 1
where 𝑣 ∈ ℝ
𝐾
is the learnable floating-point basis and 𝑒
𝑙
𝐾
for 𝑙 = ( 1 ,.. , 𝐿) enumerating 𝐾-bit binary
encodings from [− 1 ,.. , − 1 ] to [ 1 ,.. , 1 ]. The inner product
computation of quantized weights and activations is
computed by the following bitwise operations with weight
bit-width 𝐾
𝑤
𝐿𝑄
𝑤
𝑇
𝐿𝑄
𝑎
𝑖
𝑤
𝑗
𝑎
𝑖
𝑤
𝑗
𝑎
𝐾
𝑎
𝑗= 1
𝐾
𝑤
𝑖= 1
where 𝑤, 𝑎 ∈ ℝ
𝑛
encoded by vectors 𝑏
𝑖
𝑤
𝑗
𝑎
𝑁
where 𝑖 = 1 ,... , 𝐾
𝑤
and 𝑗 = 1 ,... , 𝐾
𝑎
and 𝑣
𝑤
𝐾 𝑤
𝑎
𝐾 𝑎
, ⨀ denotes bitwise inner product 𝑥𝑛𝑜𝑟 operation.
VI-C.2 QBERT
QBERT [87] deploys a two-way BERT quantization with
input 𝑥 ∈ 𝑋, its corresponding label y ∈ 𝑌, via cross entropy-
based loss function
𝑐
𝑛
1
𝑒
𝑖
𝑖
(𝑥
𝑖
,𝑦
𝑖
)
where 𝑊
𝑒
is the embedding table, with encoder layers
1
2
𝑛
and classifier 𝑊
𝑐
. Assigning the same bit size
representation to different encoder layers with varying
sensitivity attending to different structures [5] is sub-optimal
and it gets intricate for small target size (2/4 bits) requiring
ultra-low precision. Hence via Hessian Aware Quantization
(HAWQ) more bits are assigned to greater sensitive layers to
retain performance. Hessian matrix is computed via
computationally economical matrix-free iteration technique
where first layer encoder gradient 𝑔
1
for an arbitrary vector
𝑣 as:
1
𝑇
1
1
𝑇
1
1
𝑇
1
1
𝑇
1
1
where 𝐻
1
is Hessian matrix of the first encoder and 𝑣 is
independent to 𝑊
1
, this approach determines the top
eigenvalues for different layers and more aggressive
quantization is deployed for layers with smaller eigenvalues.
For further optimization via group-wise quantization, each
dense matrix is treated as a group with its quantization range
and is partitioned following each continuous output neuron.
VI-C.3 Q8BERT
To quantize weights and activations to 8-bits, symmetric
linear quantization is implemented [88], where 𝑆
𝑥
is the
quantized scaling factor for input 𝑥 and (𝑀 = 2
𝑏− 1
− 1 ) is
the highest quantized value when quantizing to 𝑏 bits.
𝑥
𝑥
𝐶𝑙𝑎𝑚𝑝(𝑥, 𝑎, 𝑏) = min (max(𝑥, 𝑎) , 𝑏)
Implementing a combination of fake quantization [89] and
Straight-Through Estimator (STE) [90], inference time
quantization is achieved during training with a full-precision
backpropagation enabling FP32 weights to overcome errors.
Here
𝜕𝑥
𝑞
𝜕𝑥
, where 𝑥
𝑞
is the result of fake quantizing 𝑥.
VII. INFORMATION RETRIEVAL
For knowledge-intensive tasks like efficient data updating,
and retrieval, huge implicit knowledge storage is required.
Standard language models are not adept at these tasks and do
not match up with task-specific architectures which can be
crucial for open-domain Q&A. For instance, BERT can
predict the missing word in the sentence, “The __ is the
currency of the US” (answer: “dollar”). However since this
knowledge is stored implicitly in its parameters, the size
substantially increases to store further data. This constraint
raises the network latency and turns out prohibitively
expensive to store information as storage space is limited due
to the size constraints of the network.
VII-A GOLDEN RETRIEVER
A conventional multi-hop based open-domain QA involves
question 𝑞 and from a large corpus containing relevant
contextual 𝑆 (gold) documents 𝑑 1
𝑠
that form a sequence
of reasoning via textual similarities that lead to a preferred
answer 𝑎. However, GoldEn Retriever’s [91] first-hop
generates a search query 𝑞 1
that retrieves document 𝑑 for a
given question 𝑞, thereafter for consequent reasoning steps
(𝑘 = 2 ,.. , 𝑆) a query 𝑞 𝑘
is generated from the question (𝑞)
and available context (𝑑 1
𝑘− 1
). GoldEn retrieves greater
contextual documents iteratively while concatenating the
retrieved context for its QA model to answer. It is
independent of the dataset and task-specific IR models where
indexing of additional documents or question types leads to
inefficiencies. A lightweight RNN model is adapted where
text spans are extracted from contextual data to potentially
reduce the large query space. The goal is to generate a search
query 𝑞 𝑘
that helps retrieve 𝑑
𝑘
for the following reasoning
step, based on a textual span from the context 𝐶 𝑘
, 𝑞 is
selected from a trained document reader.
𝑘
𝑘
𝑘
𝑘+ 1
𝑘
𝑛
𝑘
where 𝐺 𝑘
is the query generator and 𝐼𝑅
𝑛
𝑘
) are top n
retrieved documents via 𝑞 𝑘
VII-B ORQA
The components reader and retriever are trained jointly in an
end-to-end fashion where BERT is implemented for
parameter scoring. It can retrieve any text from an open
corpus and is not constrained by returning a fixed set of
documents like a typical IR model. The retrieval score
computation is the question’s 𝑞 dense inner product with
evidence block 𝑏.
𝑞
𝑞
𝑄
𝑏
𝑏
𝐵
𝑟𝑒𝑡𝑟
𝑞
𝑇
𝑏
where 𝑊
𝑞
and 𝑊
𝑏
matrices project the BERT output into
128 - dimensional vectors. Similarly, the reader is BERT’s
span variant of the reading model.
𝑠𝑡𝑎𝑟𝑡
𝑅
𝑒𝑛𝑑
𝑅
𝑟𝑒𝑎𝑑
𝑠𝑡𝑎𝑟𝑡
𝑒𝑛𝑑
The retrieval model is pre-trained with an Inverse Cloze Task
(ICT), where the sentence context is relevant semantically
and is used to extrapolate data missing from the sequence 𝑞.
𝐼𝐶𝑇
exp (𝑆
𝑟𝑒𝑡𝑟
exp (𝑆
𝑟𝑒𝑡𝑟
′
𝑏
′
∈𝐵𝐴𝑇𝐶𝐻
where 𝑞 is treated as pseudo-question, 𝑏 is text encircling 𝑞
and 𝐵𝐴𝑇𝐶𝐻 is a set of evidence blocks employed for
sampling negatives. Apart from learning word matching
features, it also learns abstract representations as pseudo-
question might or might not be present in the evidence. Post
ICT, learning is defined distribution over answer derivations.
𝑙𝑒𝑎𝑟𝑛
exp (𝑆(𝑏, 𝑠, 𝑞))
∑ ∑ exp (𝑆(𝑏
′
′
𝑠
′
∈𝑏
′
𝑏
′
∈𝑇𝑂𝑃(𝑘)
where 𝑇𝑂𝑃(𝑘) are top retrieved blocks based on 𝑆
𝑟𝑒𝑡𝑟
. In
this framework, evidence retrieval from complete Wikipedia
is implemented as a latent variable which is unfeasible to
train from scratch hence retriever is pre-trained with an ICT.
VII-C REALM
This framework explicitly attends to a vast corpus like
Wikipedia however, its retriever learns via backpropagation
and performs Maximum Inner Product Search (MIPS) via
cosine similarity to chose document appropriateness. The
retriever is designed to cache and asynchronously update
each document to overcome the computational challenge of
multi-million order retrieval of candidate documents.
In pre-training, the model needs to predict the randomly
masked tokens via the knowledge retrieval relevance score
𝑓(𝑥, 𝑧), the inner product of vector embeddings between 𝑥
and 𝑧 (MIPS). To implement a knowledge-based encoder,
the combination of input 𝑥 and retrieved document 𝑧 from a
corpus Ƶ is fed as a sequence to fine-tune the Transformer
𝑝( 𝑦 ∣∣ 𝑧, 𝑥 ) as shown in figure 17. This enables complete
cross attention between 𝑥 and 𝑧 that enables to predict the
output y where:
𝐼𝑛𝑝𝑢𝑡
𝑇
𝑑𝑜𝑐
𝑒𝑥𝑝 𝑓(𝑥,𝑧)
∑ exp 𝑓(𝑥,𝑧
′
)
𝑧′
𝑧∈Ƶ
Like ORQA, BERT is implemented for embedding:
𝐵𝐸𝑅𝑇
𝐵𝐸𝑅𝑇
1
2
1
2
In the pre-training of the BERT’s masked language modeling
task, each mask in token 𝑥 needs to be predicted as:
where each instance contains one query 𝑞 𝑖
, one positive
(relevant) passage 𝑝
𝑖
with 𝑛 negative (irrelevant) passages
𝑖,𝑗
−
. The loss function can be optimized as the negative log-
likelihood of the positive passage.
𝑖
𝑖
𝑖, 1
−
𝑖,𝑛
−
𝑠𝑖𝑚(𝑞
𝑖
,𝑝
𝑖
)
𝑠𝑖𝑚(𝑞
𝑖
,𝑝
𝑖
)
𝑠𝑖𝑚(𝑞
𝑖
,𝑝
𝑖,𝑗
−
) 𝑛
𝑗= 1
VIII. LONG SEQUENCE MODELS
Vanilla Transformers break input sequences into chunks if
their length exceeds 512 tokens, which results in loss of
context when related words exist in different chunks. This
constraint results in a lack of contextual information leading
to inefficient prediction and compromised performance and
dawned the rise of such models.
VIII-A DEEPER SELF-ATTENTION
This 64 layered Transformer [93] was built based on the
discovery that it possessed greater character level modeling
of longer-range sequences. The information was swiftly
transmitted over random distances as compared to RNN’s
unitary step progression. However, the three following
supporting loss parameters were added to the vanilla
Transformer which accelerated convergence and provided
the ability to train deeper networks.
FIGURE 19. Accelerated convergence via multiple target token
prediction across multiple positions through intermediate layers
(i) 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑎𝑐𝑟𝑜𝑠𝑠 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠: Generally
causal prediction occurs at a single position in the final
layer, however in this case all positions are used for
prediction. These auxiliary losses compel the model to
predict on smaller contexts and accelerate training
without weight decay.
(ii) 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑛 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝐿𝑎𝑦𝑒𝑟: Apart from
the final layer, predictions from all intermediate layers
are added for a given sequence, as training progresses,
lower layers weightage is progressively reduced. For 𝑛
layers, the contribution of 𝑙
𝑡ℎ
intermediate layer ceases
to exist after completing 𝑙/ 2 𝑛 of the training.
(iii) 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑇𝑎𝑟𝑔𝑒𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛: The model is modified
to generate two or greater predictions of future
characters where a separate classifier is introduced for
every new target. The extra target losses are weighed
in half before being added to a corresponding layer
loss.
The above 3 implementations are expressed in figure 19. For
sequence length 𝐿, the language model computes joint
probability autoregressive distribution over token sequences.
0 :𝐿
0
𝑖
𝐿
𝑖= 1
0 :𝑖− 1
VIII-B TRANSFORMER-XL
To mitigate context fragmentation in vanilla Transformers,
XL incorporates lengthier dependencies where it reuses and
caches the prior hidden states from where data is propagated
via recurrence. Given a corpus of tokens 𝑥 =
1
2
𝑇
), a language model computes the joint
probability 𝑃
autoregressively, where the context 𝑥
<𝑡
is
encoded into a fixed size hidden state.
𝑡
<𝑡
𝑡
(a) Attention Caching - I (b) Attention Caching – II
FIGURE 20. Elongated context capture combining (a) and (b)
Assume two consecutive sentences of length 𝐿, 𝑠
𝜏
𝜏, 1
𝜏,𝐿
] and 𝑠
𝜏+ 1
𝜏+ 1 , 1
𝜏+ 1 ,𝐿
] where 𝑛
𝑡ℎ
layer
hidden state sequence produced by the 𝜏
𝑡ℎ
segment 𝑠
𝜏
as
𝜏
𝑛
𝐿×𝑑
, where 𝑑 is the hidden dimension. The 𝑛
𝑡ℎ
hidden layer state for the segment 𝑠
𝜏+ 1
is computed as
follows:
𝑟+ 1
~𝑛− 1
𝑟
𝑛− 1
𝑟+ 1
𝑛− 1
𝑟+ 1
𝑛
𝑟+ 1
𝑛
𝑟+ 1
𝑛
𝑟+ 1
𝑛− 1
𝑄
𝑇
𝑟+ 1
~𝑛− 1
𝐾
𝑇
𝑟+ 1
~𝑛− 1
𝑉
𝑇
𝑟+ 1
𝑛
𝑟+ 1
𝑛
𝑟+ 1
𝑛
𝑟+ 1
𝑛
where 𝑆𝐺(·) represents stop-gradient , [ℎ
𝑢
𝑣
] is the two
hidden sequence concatenation, and 𝑊 the model
parameters. The key distinction from the original
Transformer lies in modeling the key 𝑘
𝑟+ 1
𝑛
and value 𝑣
𝑟+ 1
𝑛
concerning the extended context ℎ 𝑟+ 1
~𝑛− 1
and hence preceding
𝑟
𝑛− 1
are cached. This can be demonstrated from figure 20
above where prior attention span is cached by the latter
forming an elongated caching mechanism.
Such recurrence is applied to every two consecutive
segments to create a segment level recurrence via hidden
states. In the original transformer the attention score within
the same segment between query (𝑞 𝑖
) and key (𝑘
𝑖
) vector is:
𝑖,𝑗
𝑎𝑏𝑠
𝑥
𝑖
𝑇
𝑞
𝑇
𝑘
𝑗
𝑥
𝑖
𝑇
𝑞
𝑇
𝑘
𝑗
𝑖
𝑇
𝑞
𝑇
𝑘
𝑗
𝑖
𝑇
𝑞
𝑇
𝑘
𝑗
From a perspective of relative positional encoding, the above
equation is remodeled in the following manner
𝑖,𝑗
𝑟𝑒𝑙
𝑥
𝑖
𝑇
𝑞
𝑇
𝑘,𝐸
𝑗
𝑥
𝑖
𝑇
𝑞
𝑇
𝑘,𝑅
𝒊−𝒋
𝑻
𝑘,𝐸
𝑗
𝑻
𝑘,𝑅
𝒊−𝒋
VIII-C LONGFORMER
This architecture provides sparsity to the full attention matrix
while identifying input location pairs attending one another
and implements three attention configurations:
(i) 𝑆𝑙𝑖𝑑𝑖𝑛𝑔 𝑊𝑖𝑛𝑑𝑜𝑤: For a fixed window size 𝑤, each
token attends to a sequence length (n) of 𝑤/ 2 on either
side. This leads to the computational complexity of
𝑂(𝑛 × 𝑤) that scales linearly with input sequence
length and for efficiency purposes 𝑤 < 𝑛. A stacked ′𝑙′
layered transformer enables receptivity sized ′𝑙 × 𝑤′
over the entire input ′𝑤′ across all layers. Different ′𝑤′
values can be chosen for efficiency or performance.
(ii) 𝐷𝑖𝑙𝑎𝑡𝑒𝑑 𝑆𝑙𝑖𝑑𝑖𝑛𝑔 𝑊𝑖𝑛𝑑𝑜𝑤: To conserve computation
and extend the receptive field size to ′𝑙 × 𝑑 × 𝑤′,
where ′𝑑′ variable-sized gaps are inducted for dilations
in window size ′𝑤′. Enhanced performance is achieved
via enabling few dilation-free heads (smaller window
size) for attention on local context (lower layers) and
remaining dilated heads (increased window size)
attending longer context (higher layers).
(iii) 𝐺𝑙𝑜𝑏𝑎𝑙 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛: The prior two implementations do
not possess enough flexibility for task-precise learning.
Hence “𝑔𝑙𝑜𝑏𝑎𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛” is implemented on few pre-
designated input tokens (𝑛) where a token attends to
all sequence tokens and all such tokens attend to it. This
preserves the local and global attention complexity to
Its attention complexity is the sum of local and global
attention versus RoBERTa’s quadratic complexity which is
explained by the following mathematical expressions.
(a) Full Attention (b) Sliding Window (c) Dilated Sliding (d) Global Attention
FIGURE 21. Longformer’s different Sparse Attention configurations
0
0
0
0
0
Global attention enables chunk-less document processing,
however, its space-time complexity will be greater than
RoBERTa, if sequence length exceeds the window size.
0
2
0
0
VIII-D EXTENDED TRANSFORMER CONSTRUCTION:
ETC
ETC is an adaptation of the Longformer design which
receives global (𝑛
𝑔
) and long (𝑛
𝑙
) inputs where 𝑛
𝑔
𝑙
. It
computes four global-local attention variations: global-to-
global (𝑔 2 𝑔), global-to-long (𝑔 2 𝑙), long-to-global (𝑙 2 𝑔), and
long-to-long (𝑙 2 𝑙) to achieve long sequence processing.
Global inputs and the other three variations possess limitless
attention to compensate for 𝑙 2 𝑙′𝑠 fixed radius span to achieve
a balance between performance and computational cost.
Further, it replaces the absolute with relative position
encodings which provide information of input tokens
concerning each other.
VIII-E BIG BIRD
Mathematically Big Bird proves randomly sparse attention
can be Turing complete and behaves like a Longformer aided
with random attention. It is designed such as (i) a global
token group 𝑔 attending to all sequence parts (ii) there exists
a group of 𝑟 random keys that each query 𝑞
𝑖
attends to (iii) a
local neighbor window 𝑤 block that each local node attends
to. Big Bird’s global tokens are constructed using a two-fold
approach (i) Big Bird-ITC : Implementing Internal
Transformer Construction (ITC) where few current tokens
are made global that attend over the complete sequence. (ii)
Big Bird-ETC : Implementing Extended Transformer
Construction (ETC), essential additional global tokens 𝑔 are
included [𝐶𝐿𝑆] that attend to all existing tokens.
Its definitive attention process consists of the following
properties: queries attend to 𝑟 random keys where each query
attends to 𝑤/ 2 tokens to the left and right of its location and
have 𝑔 global tokens which are derived from current tokens
or can be supplemented when needed.
IX. COMPUTATIONALLY EFFICIENT ARCHITECTURES
IX-A SPARSE TRANSFORMER
This model’s economical performance is due to the
alienation from the full self-attention procedure that is
modified across several attention steps. The model’s output
results are derived from a factor of the full input array i.e.,
(√𝑁) where 𝑁 ⋵ 𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝐿𝑒𝑛𝑔𝑡ℎ as expressed in Figure