natural language processing 100 | Cheat Sheet Natural Language Processing (NLP)

The NLP Cookbook: Modern Recipes for

Transformer based Deep Learning Architectures

SUSHANT SINGH1, AND AUSIF MAHMOOD2

Department of Computer Science & Engineering, University of Bridgepo rt, Connecticut, CT 06604, USA

Corresponding author: Sushant Singh (susha[email protected])

ABSTRACT In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic

and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural

Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal

Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have

achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP

architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate

model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data

size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to

extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has

also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and

examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal

performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy

of NLP designs, comparative evaluations, and future directions in NLP.

INDEX TERMS Deep Learning, Natural Language Processing (NLP), Natural Language Understanding (NLU), Natural

Language Generation (NLG), Information Retrieval (IR), Knowledge Distillation (KD), Pruning, Quantization

I. INTRODUCTION

Natural Language Processing (NLP) is a field of Machine

Learning dealing with linguistics that builds and develops

Language Models. Language Modeling (LM) determines the

likelihood of word sequences occurring in a sentence via

probabilistic and statistical techniques. Since human

languages involve sequences of words, the initial language

models were based on Recurrent Neural Networks (RNNs).

Because RNNs can lead to vanishing and exploding

gradients for long sequences, improved recurrent networks

like LSTMs and GRUs were utilized for improved

performance. Despite enhancements, LSTMs were found to

lack comprehension when relatively longer sequences were

involved. This is due to the reason that the entire history

known as a context, is being handled by a single state vector.

However, greater compute resources lead to an influx of

novel architectures causing a meteoric rise of Deep Learning

[1] based NLP models.

The breakthrough Transformer [2] architecture in 2017

overcame LSTM’s context limitation via the Attention

mechanism. Additionally, it provided greater throughput as

inputs are processed in parallel with no sequential

dependency. Subsequent launches of improved Transformer

based models like GPT-I [3] and BERT [4] in 2018 turned

out to be a climacteric year for the NLP world. These

architectures were trained on large datasets to create pre-

trained models. Thereafter transfer learning was used to fine-

tune these models for task-specific features resulting in

significant performance enhancement on several NLP tasks

[5],[6],[7],[8],[9],[10]. These tasks include but are not

limited to language modeling, sentiment analysis, question

answering, and natural language inference.

This accomplishment lacked the transfer learning’s primary

objective of achieving high model accuracy with minimal

fine-tuning samples. Also, model performance needs to be

generalized across several datasets and not be task or dataset-

specific [11],[12],[13]. However, the goal of high

generalization and transfer learning was being compromised

as an increasing amount of data was being used for both pre-

training and fine-tuning purposes. This clouded the decision

whether greater training data or an improved architecture

should be incorporated to build a better SOTA language

model. For instance, the subsequent XLNet [14] architecture

possessed novel yet intricate language modeling, that

provided a marginal improvement over a simplistic BERT

architecture that was trained on a mere ~10% of XLNet’s

data (113GB). Thereafter, with the induction of RoBERTa

[15], a large BERT-based model trained on significantly

more data than BERT (160GB), outperformed XLNet. Thus,

an architecture that is more generalizable and further is

trained on larger data, results in NLP benchmarks.

natural language processing 100, Cheat Sheet of Natural Language Processing (NLP)

Related documents

Partial preview of the text

Download natural language processing 100 and more Cheat Sheet Natural Language Processing (NLP) in PDF only on Docsity!

The NLP Cookbook: Modern Recipes for

Transformer based Deep Learning Architectures

SUSHANT SINGH

, AND AUSIF MAHMOOD

] ( 27 )

= 𝛼 × ℒ

) + 𝛽 ×

] ( 57 )

, 𝑀) ≔ 𝐶𝑙𝑎𝑚𝑝(⌊𝑥 ×𝑆

(𝑞)[𝐶𝐿𝑆] ( 64 )

)[

]

(𝑞, 𝑏)[𝑆𝑇𝐴𝑅𝑇(𝑠)], ( 67 )

(𝑞, 𝑏)[𝐸𝑁𝐷(𝑠)], ( 68 )

[

]

(𝑥) = [𝐶𝐿𝑆]𝑥[𝑆𝐸𝑃] ( 74 )

) = [𝐶𝐿𝑆]𝑥

[𝑆𝐸𝑃]𝑥

[𝑆𝐸𝑃] ( 75 )

[𝑥

= [𝑥

[

]

𝑙𝑜𝑐𝑎𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 = (𝑛 × 𝑤)

𝑔𝑙𝑜𝑏𝑎𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 = ( 2 × 𝑛 × 𝑠)

+ 2 𝑠) × 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑒𝑟 𝐿𝑎𝑦𝑒𝑟𝑠