Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Computational Linguistics, Schemi e mappe concettuali di Linguistica

Theoretical Expansion: Each chapter includes historical insights (e.g., the Chomsky vs. Corpus Linguistics debate), detailed mathematical explanations for statistics and deep learning, and critical analyses of models. Examples and Case Studies: I have included numerous practical examples of linguistic analysis, ambiguity scenarios, and real-world applications to make the concepts more intuitive. Glossary and Appendices: At the end of the document, you will find a comprehensive glossary of terms and a series of reflection exercises based on the practical sessions of the course. Professional Structure: The file follows a logical order, from the basics of data collection to the frontiers of Large Language Models and computational pipelines.

Tipologia: Schemi e mappe concettuali

2024/2025

In vendita dal 05/01/2026

Benedetta1-
Benedetta1- 🇮🇹

6 documenti

1 / 86

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
COMPUTATIONAL LINGUISTICS:
Extended Course Notes
Author:
Manus AI
Source:
University Course Slides (
2024
2025
)
Language:
English
Chapter 1: Introduction to Computational
Linguistics (CL)
Computational Linguistics (CL) stands as a pivotal interdisciplinary
eld, residing at the
con
uence of computer science, arti
cial intelligence, and theoretical linguistics. Its
fundamental objective is the computational modeling of human language, encompassing
both its written and spoken forms, to enable computers to process, understand, and
generate natural language. This chapter provides a comprehensive introduction to CL,
tracing its historical trajectory, clarifying its relationship with the closely allied
eld of
Natural Language Processing (NLP), and undertaking a deep theoretical dive into the
inherent linguistic challenges that continue to drive research in the discipline.
1
.
1
De
ning Computational Linguistics and its Scope
Computational Linguistics is, at its core, a scienti
c discipline. It seeks to develop formal,
computational models of linguistic phenomena, ranging from phonetics and morphology
to syntax, semantics, and pragmatics. The models developed in CL are not merely tools for
practical application but are often designed to test and validate linguistic theories. By
attempting to formalize the complex, often ambiguous, rules of human language into
algorithms, computational linguists gain profound insights into the structure and function
of language itself
16
.
The scope of CL is vast, covering all levels of linguistic analysis:
1
.
Phonology and Phonetics:
Modeling the sound structure of language, crucial for
speech recognition and synthesis.
2
.
Morphology:
Analyzing word structure, including in
ection (e.g.,
run
$\rightarrow$
running
) and derivation (e.g.,
happy
$\rightarrow$
unhappy
).
3
.
Syntax:
Developing grammars and parsers to analyze the structural relationships
between words in a sentence.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56

Anteprima parziale del testo

Scarica Computational Linguistics e più Schemi e mappe concettuali in PDF di Linguistica solo su Docsity!

COMPUTATIONAL LINGUISTICS:

Extended Course Notes

Author: Manus AI

Source: University Course Slides ( 2024 ‒ 2025 )

Language: English

Chapter 1 : Introduction to Computational

Linguistics (CL)

Computational Linguistics (CL) stands as a pivotal interdisciplinary field, residing at the confluence of computer science, artificial intelligence, and theoretical linguistics. Its fundamental objective is the computational modeling of human language, encompassing both its written and spoken forms, to enable computers to process, understand, and generate natural language. This chapter provides a comprehensive introduction to CL, tracing its historical trajectory, clarifying its relationship with the closely allied field of Natural Language Processing (NLP), and undertaking a deep theoretical dive into the inherent linguistic challenges that continue to drive research in the discipline.

1. 1 Defining Computational Linguistics and its Scope

Computational Linguistics is, at its core, a scientific discipline. It seeks to develop formal, computational models of linguistic phenomena, ranging from phonetics and morphology to syntax, semantics, and pragmatics. The models developed in CL are not merely tools for practical application but are often designed to test and validate linguistic theories. By attempting to formalize the complex, often ambiguous, rules of human language into algorithms, computational linguists gain profound insights into the structure and function of language itself 16.

The scope of CL is vast, covering all levels of linguistic analysis:

  1. Phonology and Phonetics: Modeling the sound structure of language, crucial for speech recognition and synthesis.
  2. Morphology: Analyzing word structure, including inflection (e.g.,run $\rightarrow$ running) and derivation (e.g.,happy $\rightarrow$unhappy).
  3. Syntax: Developing grammars and parsers to analyze the structural relationships between words in a sentence.
  1. Semantics: Modeling the meaning of words, phrases, and sentences, including lexical semantics (word meaning) and compositional semantics (how word meanings combine).
  2. Pragmatics and Discourse: Analyzing language use in context, including understanding intent, coherence in multi-sentence texts, and dialogue management.

The interdisciplinary nature of CL means it draws heavily on formalisms from computer science (e.g., finite-state automata, context-free grammars, machine learning), mathematics (e.g., probability theory, linear algebra), and cognitive science (e.g., theories of language acquisition and processing).

1. 2 A Historical Trajectory of Computational Linguistics

The history of Computational Linguistics is a fascinating narrative of ambition, disappointment, and revolutionary technological shifts. It can be broadly segmented into four major eras, each defined by the dominant methodology and the prevailing technological landscape.

1. 2. 1 The Rule-Based Era ( 1940 s ‒ Mid- 1960 s): The Genesis in

Machine Translation

The birth of CL is inextricably linked to the Cold War and the urgent need for automatic Machine Translation (MT). The field’s formal beginning is often traced to the Georgetown-IBM Experiment in 1954 , which demonstrated the feasibility of translating a few dozen Russian sentences into English using a small set of hand-coded rules 16.

This early period was characterized by rationalist or rule-based approaches. The underlying assumption, heavily influenced by early linguistic theories, was that language could be fully described by a finite set of explicit, deterministic rules. Systems relied on extensive, manually crafted dictionaries and grammatical rules to analyze (parse) and generate language.

A key theoretical development was the work of Noam Chomsky, whose concept of Generative Grammar provided a formal, mathematical framework for describing the syntax of natural language. While Chomsky himself was skeptical of the computational tractability of his models, his work profoundly influenced the formalization efforts within CL.

1. 2. 2 The AI and Theoretical Linguistics Era (Mid- 1960 s ‒ Mid- 1980 s):

Disillusionment and Divergence

This period was marked by a significant setback: the ALPAC Report (Automatic Language Processing Advisory Committee) of 1966. Commissioned by the U.S. government, the report

word embeddings (e.g., Word 2 Vec, GloVe), which represented words as dense vectors in a continuous space, capturing semantic and syntactic relationships.

Subsequent breakthroughs involved more complex architectures:

  • Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM): Capable of processing sequential data, improving performance on tasks like sequence labeling and machine translation.
  • The Transformer Architecture ( 2017 ): Introduced the attention mechanism , allowing models to weigh the importance of different words in a sequence, regardless of their distance. This architecture became the foundation for large-scale pre-trained models.
  • Large Language Models (LLMs): Models like BERT, GPT, and T 5 , pre-trained on massive amounts of text data, which can be fine-tuned for a wide variety of downstream tasks with unprecedented performance.

This era has seen the most significant practical advances, leading to the widespread deployment of sophisticated language technologies.

1. 3 Computational Linguistics vs. Natural Language

Processing (NLP)

The terms Computational Linguistics (CL) and Natural Language Processing (NLP) are often used interchangeably, particularly in industry and popular media. However, in academic and research contexts, a subtle but important distinction persists, primarily concerning their goals and methodologies.

Computational Linguistics (CL) is fundamentally a scientific discipline. Its primary goal is to use computational methods to understand the underlying principles and mechanisms of human language. CL is driven by linguistic questions:How can we formally model ambiguity? What is the most efficient way to represent semantic meaning? The computational models serve as a means to test linguistic hypotheses.

Natural Language Processing (NLP) is primarily an engineering discipline. Its primary goal is to build practical, robust systems that can perform useful tasks involving language. NLP is driven by application questions:How can we build a better machine translator? How can we automatically summarize a document? The focus is on achieving the highest possible performance on a specific, real-world task, often prioritizing engineering efficiency over deep linguistic insight.

While the two fields share techniques (especially in the modern era, where deep learning models dominate both), their ultimate objectives differ. A CL researcher might develop a new parsing algorithm to prove a theory about syntactic structure, while an NLP engineer might use an existing, off-the-shelf parser to improve a chatbot's performance.

The relationship is best described as a symbiotic hierarchy : CL provides the theoretical foundations and formal models, while NLP takes these models and applies them to create practical technologies.

In the current era of Large Language Models (LLMs), the distinction has become increasingly blurred. LLMs, which are products of NLP engineering, are now so powerful that they are being used as tools by CL researchers to test new linguistic hypotheses, effectively closing the loop between the two disciplines.

1. 4 Fundamental Linguistic Challenges in Computational

Linguistics

Despite the monumental progress achieved by statistical and neural models, human language presents a set of inherent challenges that remain central to CL research. These challenges stem from the very nature of language: its complexity, its reliance on context, and its grounding in human cognition and world knowledge.

1. 4. 1 The Challenge of Ambiguity

Ambiguity is arguably the single greatest hurdle in computational language understanding. Human speakers resolve ambiguity effortlessly using context and world knowledge, but for

Feature Computational Linguistics (CL) Natural Language Processing (NLP)

Primary Goal

Scientific understanding of language and linguistic phenomena.

Engineering robust systems for practical language tasks.

Focus

Theoretical models, formalisms, and linguistic validity.

Performance, efficiency, and real-world application.

Methodology

Often involves testing linguistic theories using computational tools.

Often involves applying and optimizing machine learning algorithms.

Output

Formal grammars, computational models of linguistic processes, scientific papers.

Software, applications (e.g., chatbots, MT systems, sentiment analyzers).

Discipline Leaning Linguistics and CognitiveScience. Computer Science andArtificial Intelligence.

1. 4. 2 The Challenge of Context and Coreference

Language is inherently contextual. The meaning of a sentence often depends on the preceding sentences, the speaker's identity, the time, and the location of the utterance.

Coreference Resolution is the task of finding all expressions that refer to the same entity in a text. This is crucial for understanding narrative flow and coherence.

A CL system must correctly identify thather andShe refer toDr. Smith, andit refers toThe paper. This is complicated by factors like gender, number, and the distance between the referring expressions.

1. 4. 3 The Challenge of World Knowledge and Inference

A significant portion of human language understanding relies on common sense and world knowledge —information that is not explicitly stated in the text but is assumed to be known by the reader.

To resolve the pronounit, a system must know that trophies and suitcases have sizes, and that if something is "too big," it is the larger object that is the problem. In this case,it refers to thetrophy.

In this case,it refers to thesuitcase.

This is known as the Winograd Schema Challenge , and it highlights the limitation of purely statistical models that lack a deep, symbolic representation of the world. While LLMs have

Type of Ambiguity Example Sentence Ambiguous Interpretation 1

Ambiguous Interpretation 2

Lexical "I saw a bat." A flying mammal. A piece of sportsequipment.

Syntactic

"He announced the plan to the staff on Monday."

The announcement happened on Monday.

The plan was about the staff on Monday.

Referential "John told Bill he waswrong." He refers to John. He refers to Bill.

"Dr. Smith presented her findings. She then took questions from the audience. The paper it self was groundbreaking."

"The trophy didn't fit into the suitcase because it was too big."

"The trophy didn't fit into the suitcase because it was too small."

shown remarkable progress in implicitly encoding vast amounts of world knowledge, they still struggle with complex, non-obvious inferences.

1. 4. 4 The Challenge of Variability and Low-Resource Languages

Human language is characterized by immense variability:

  • Dialects and Sociolects:^ Regional and social variations in language use.
  • Style and Register:^ Differences between formal academic writing, casual conversation, and social media posts.
  • Non-Standard Language:^ Misspellings, grammatical errors, and creative language use (e.g., in text messages).

A robust CL system must be able to handle this variability. Furthermore, the vast majority of CL/NLP research has focused on high-resource languages (primarily English, followed by a few others) for which massive amounts of annotated data are available.

Low-Resource Languages (LRLs) —languages with limited digital text, few speakers, or scarce linguistic expertise—present a major challenge. Developing effective language technologies for LRLs requires innovative techniques like transfer learning (applying models trained on high-resource languages) and unsupervised learning (learning from unannotated text), which are active areas of CL research.

1. 5 The Interplay of Theory and Practice

The evolution of CL demonstrates a continuous, cyclical interplay between theoretical linguistic insight and practical computational implementation.

The early rule-based systems were a direct attempt to implement formal linguistic theories, but they failed due to the sheer complexity of language. The statistical revolution succeeded by largely ignoring deep linguistic theory in favor of empirical performance. The current deep learning era represents a synthesis: while the models are statistical, they implicitly learn and encode complex linguistic hierarchies (e.g., syntax and semantics) that were once the exclusive domain of theoretical linguistics.

The future of Computational Linguistics lies in bridging the remaining gaps, particularly in areas like:

  • Explainability (XAI):^ Making the decisions of complex neural models transparent and interpretable in linguistic terms.
  • Robustness:^ Creating models that are not easily fooled by adversarial examples or minor variations in input.
  • Cognitive Plausibility:^ Developing models that not only perform well but also process

generate human language. The quality, quantity, and structure of this data directly determine the capabilities and limitations of any resulting NLP system.

2. Types of Linguistic Data: Raw, Annotated, and

Structured

Linguistic data, the raw material of CL, can be broadly classified based on its level of processing and structure. The most fundamental distinction is between raw, unanalyzed language and language that has been enriched with explicit linguistic information.

2. 1. Raw Data: The Foundation

Raw data consists of language in its natural, unanalyzed form. For text, this is typically a sequence of characters, words, or sentences (e.g., a collection of web pages, books, or social media posts). For speech, it is the raw audio signal.

While raw data is the most abundant and easiest to acquire, its use in CL presents significant challenges:

  1. Noise and Variability: Real-world language is messy. Raw data contains misspellings, grammatical errors, non-standard abbreviations, dialectal variations, and formatting inconsistencies. Models trained on raw data must be robust enough to handle this inherent noise.
  2. Lack of Explicit Structure: The linguistic structure (syntax, semantics, discourse) is implicit. Extracting this structure requires unsupervised learning techniques, which are often less accurate than supervised methods.
  3. Ethical and Privacy Concerns: Large-scale raw data collection, especially from the web or social media, raises serious ethical questions regarding privacy, bias, and intellectual property.

Despite these challenges, raw data is crucial for unsupervised learning and, most notably, for training modern large language models (LLMs) , which rely on petabytes of raw text to learn statistical patterns of language use.

2. 2. Annotated Data: Adding Linguistic Knowledge

Annotated data is raw data that has been manually or semi-automatically enriched with explicit linguistic tags or labels. This process, known as annotation or data labeling , is the primary method for creating resources for supervised machine learning tasks. The annotations transform the implicit structure of language into an explicit, machine-readable format.

The creation of high-quality annotated data is a complex, expensive, and time-consuming process, often requiring trained linguists. The process is governed by an annotation scheme , a set of formal guidelines and rules that define the categories, tags, and conventions to be used.

A critical metric for the quality of annotated data is Inter-Annotator Agreement (IAA). IAA measures the degree of consensus among different human annotators applying the same scheme to the same data. Low IAA suggests that the annotation scheme is ambiguous, the task is inherently difficult, or the annotators are insufficiently trained. Common metrics for IAA include Cohen's Kappa ($\kappa$) and Fleiss' Kappa. High IAA is a prerequisite for creating reliable "gold standard" datasets.

2. 3. Levels of Annotation: From Sound to Discourse

Annotation can target any level of linguistic analysis, providing the specific features needed for different NLP tasks.

2. 3. 1. Morphological and Lexical Annotation

This level focuses on the word and its internal structure.

  • Part-of-Speech (POS) Tagging: Assigning a grammatical category to each token. The Penn Treebank tagset is a widely used standard.Example: "The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN."
  • Lemmatization and Stemming: Reducing inflected forms to a base form (lemma) or root (stem). This is crucial for reducing data sparsity and improving information retrieval.

Feature Raw Data Annotated Data Trade-off in CL

Quantity

Very High (Terabytes/Petabytes )

Low to Medium (Megabytes/Gigabyte s)

Quantity vs. Quality

Cost Low (Acquisition) High (Human Labor) Scalability vs.Specificity

Use Case

Unsupervised Learning, Pre- training LLMs

Supervised Learning, Fine-tuning, Evaluation

Generality vs. Performance

Structure Implicit, Statistical Explicit, Rule-based/Categorical^ Discovery vs.Definition

meticulous process, with the goal of creating a resource that is both useful for specific research questions and generalizable to the language as a whole.

3. 1. The Challenge of Representativeness

The concept of representativeness is arguably the most critical and complex issue in corpus design. A corpus is representative if its contents accurately reflect the properties of the language variety or domain it is intended to model 16.

The theoretical difficulty lies in defining the population of a language. Unlike a finite population (e.g., all registered voters), the set of all possible utterances in a language is infinite and constantly evolving. Therefore, a corpus can only ever be a sample of this infinite population.

To achieve representativeness, corpus designers must consider two primary dimensions:

  1. Domain/Genre Coverage: The corpus must include texts from all relevant genres (e.g., fiction, news, academic, conversational, legal) in proportions that reflect their frequency in the target language use.
  2. Author/Speaker Variation: The corpus should capture variation across different authors, speakers, regions, ages, and social groups to avoid bias towards a single demographic.

A corpus that is not representative can lead to misleading or biased conclusions. For instance, a model trained exclusively on a corpus of formal academic papers will perform poorly on informal social media text, a phenomenon known as domain mismatch.

3. 2. Corpus Balance and Stratified Sampling

Corpus balance is the practical strategy used to achieve representativeness. It involves ensuring that the different components (or strata) of the language population are included in the corpus in appropriate proportions.

The most common technique for achieving balance is Stratified Sampling. This involves:

  1. Defining Strata: Identifying key variables that define the language population (e.g., medium: written/spoken; genre: news/fiction/academic; time period: 1990 s/ 2000 s).
  2. Determining Proportions: Estimating the relative frequency of each stratum in the target population. This is often based on external data (e.g., publishing statistics, media consumption surveys).
  3. Sampling: Selecting texts from each stratum according to the determined proportions.

For example, if a designer estimates that 60 % of the target language use is written news and 40 % is spoken conversation, a balanced corpus of 100 million words would aim for 60 million words of news text and 40 million words of transcribed speech.

Example of Sampling Bias:

The original Brown Corpus, a foundational resource, was criticized for its bias towards American English and formal, written prose from the year 1961. While revolutionary for its time, its lack of spoken data and diachronic depth limited its representativeness of the broader English language. Modern corpora, such as the British National Corpus (BNC), explicitly aim for balance by including a 90 % written and 10 % spoken component, further stratified by genre and context.

4. Classification and Typology of Corpora

Corpora can be classified based on their content, structure, and purpose. Understanding these classifications is essential for selecting the right resource for a given CL task.

4. 1. Based on Language and Alignment

Parallel Corpora are the bedrock of statistical and neural Machine Translation. The alignment process, which links a source sentence to its target translation, is often complex and requires sophisticated algorithms or manual verification. The quality of this alignment is paramount for training effective MT systems.

4. 2. Based on Time and Medium

Type Description Primary Use Case Example

Monolingual Texts in a singlelanguage.

General language research, language modeling.

British National Corpus (BNC)

Parallel

Texts in one language aligned sentence-by- sentence with their human translations in one or more other languages.

Machine Translation (MT), contrastive linguistics.

Europarl Corpus, OpenSubtitles Corpus

Comparable

Texts in multiple languages that are similar in genre, topic, and time, but are not translations of each other.

Cross-linguistic studies, terminology extraction.

Collections of news articles on the same event in different languages.

  • Lexical Databases (e.g., WordNet):^ WordNet is a large lexical database of English. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms ( synsets ), each expressing a distinct concept. Synsets are interlinked by conceptual- semantic and lexical relations (e.g.,hyponymy (is-a-kind-of),meronymy (is-a-part-of)). This structure allows NLP systems to reason about word meaning and semantic similarity.
  • Frame Semantics Resources (e.g., FrameNet):^ FrameNet is a resource based on Frame Semantics, which views meaning as being organized around conceptual structures called frames. A frame evokes a set of Frame Elements (semantic roles). For example, theCommerce_buy frame involves aBuyer, aSeller,Goods, andMoney. FrameNet provides annotated sentences illustrating these frames and their elements.
  • Ontologies and Knowledge Graphs (e.g., BabelNet):^ These resources formally represent concepts and their relationships in a graph structure. BabelNet, for instance, is a multilingual semantic network that links concepts from Wikipedia and WordNet, providing a massive, cross-lingual knowledge base.

5. 2. Benchmark Datasets and Evaluation Campaigns

Benchmark datasets are specialized, often highly-annotated, datasets used exclusively for the purpose of evaluating and comparing the performance of different NLP models on a specific task. They are the standardized yardstick of the field.

A typical benchmark setup involves:

  1. Training Set: Used to train the model.
  2. Development (Dev) Set: Used for hyperparameter tuning and iterative model refinement.
  3. Test Set (Held-out Data): Used only once, at the very end, to provide an unbiased estimate of the model's generalization performance.

The use of common benchmarks is crucial for scientific progress in CL, as it ensures reproducibility and allows researchers to directly compare the efficacy of different algorithms.

Evaluation Campaigns (e.g., SemEval , CoNLL Shared Tasks ) are organized events centered around a specific benchmark task. They provide a common dataset, a clear evaluation metric, and a timeline for submission, fostering a competitive environment that rapidly advances the state-of-the-art for that particular problem. The distinction between a general corpus and a benchmark is that the benchmark is narrowly focused on a single phenomenon (e.g., sentiment analysis, question answering) and is designed to be a challenging test, not necessarily a representative sample of the entire language.

6. Conclusion: The Future of Linguistic Data

Linguistic data is the indispensable resource for modern Computational Linguistics. The journey from raw text to highly structured, annotated corpora and specialized benchmarks reflects the increasing sophistication of NLP models and the growing demand for systems that can handle the nuances of human language.

The field continues to grapple with fundamental challenges related to data:

  • Data Scarcity for Low-Resource Languages:^ The vast majority of linguistic data is available for a handful of high-resource languages (e.g., English, Chinese, Spanish). Developing methods to build robust NLP systems for the world's thousands of low- resource languages remains a critical area of research.
  • Ethical Data Sourcing and Bias:^ Ensuring that corpora are collected ethically, respect privacy, and are free from harmful societal biases (e.g., gender, racial, or political) is paramount. A biased corpus will inevitably lead to a biased model, perpetuating and amplifying societal inequalities.
  • Multimodality and Grounding:^ As NLP moves towards systems that interact with the real world, the need for multimodal corpora—linking language to vision, action, and physical context—will only increase.

In summary, the design, collection, annotation, and maintenance of linguistic data are not merely technical prerequisites but central scientific endeavors in Computational Linguistics. The careful consideration of data types, the rigorous pursuit of representativeness, and the systematic classification of corpora are the pillars upon which the next generation of language technologies will be built.

References

[ 8 ] Sinclair, J. ( 1996 ). The Empty Lexicon. International Journal of Corpus Linguistics, 1 ( 1 ), 99 - 104. [ 8 ] Biber, D. ( 1993 ). Representativeness in Corpus Design. Literary and Linguistic Computing, 8 ( 4 ), 243 - 257.

[ 8 ] Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. ( 1993 ). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19 ( 2 ), 313 - 330. [ 8 ] Ide, N., & Pustejovsky, J. ( 2010 ). The Handbook of Linguistic Annotation. Springer.

[ 8 ] Baker, C. F., Fillmore, C. J., & Lowe, J. B. ( 1998 ). The Berkeley FrameNet project. In Proceedings of the 36 th Annual Meeting of the Association for Computational Linguistics and 17 th International Conference on Computational Linguistics (Vol. 1 , pp. 86 - 90 ). [ 8 ] Resnik, P., & Yarowsky, D. ( 1999 ). A survey of monolingual and parallel corpora for word sense disambiguation. In Proceedings of the ACL Workshop on Word Sense Disambiguation (pp. 1 - 8 ).

loud. The dog's bark was louder.", the total number of tokens would be 14 , but the number of types would be less than 14 due to repetitions like "The", "'s", "was", and "loud".

The primary purpose of tokenization is to transform unstructured text into a structured sequence that can be indexed, counted, and analyzed by computational algorithms. It is the first and most crucial step in the NLP pipeline, influencing all subsequent tasks, including part-of-speech tagging, parsing, and machine translation 16.

3. 2. 2 Challenges in Tokenization

While simple tokenization might involve splitting text by whitespace, real-world language presents numerous challenges that require sophisticated rules or statistical models 16 :

Challenge Example Standard Tokenization Issue

Computational Solution

Punctuation U.S.A.,end.

Should the period be part of the token or a separate token?

Context-sensitive rules (e.g., period after an abbreviation is part of the token; otherwise, it's a separator).

Contractions don't,I'm

Shoulddon't be one token or two (do and n't)?

Splitting into base word and clitic (do + n't) for morphological analysis.

Hyphenation^ state-of-the-art,worker co-

Should hyphenated words be one token or multiple?

Domain-specific rules; often kept as one token unless the hyphen is a line- break artifact.

Multi-word Expressions (MWEs)^ New York,take off

Should MWEs be treated as a single semantic unit?

Named Entity Recognition (NER) or statistical methods to identify and treat MWEs as single tokens.

Case Sensitivity^ Apple (company) vs.apple (fruit)

Should tokens be case-normalized (lowercased) or kept as is?

Depends on the task; typically lowcased for frequency analysis, but preserved for NER.

3. 2. 3 Types of Tokenization

Tokenization can occur at various levels of granularity:

  1. Word Tokenization: The most common form, aiming to isolate words. As discussed, this is complex due to the challenges above.
  2. Sentence Tokenization: The process of dividing a text into sentences. This is essential for tasks that operate on a sentence level, such as parsing or summarization. The primary challenge is distinguishing between sentence-ending punctuation (period, question mark, exclamation mark) and internal punctuation (e.g., in abbreviations or decimal numbers).
  3. Subword Tokenization: A modern approach, particularly prevalent in deep learning models (like BERT or GPT), where words are broken down into smaller, meaningful units (subwords). Techniques like Byte Pair Encoding (BPE) or WordPiece are used. This addresses the Out-of-Vocabulary (OOV) problem by allowing the model to compose unknown words from known subword units, and it also manages the vocabulary size more efficiently.

3. 3 Frequency Analysis and Zipf's Laws

Once a corpus has been tokenized, the next logical step is to count the frequency of each type. This frequency distribution is a fundamental characteristic of any natural language and is governed by a remarkable statistical regularity known as Zipf's Law 16.

3. 3. 1 Term Frequency and Rank

The term frequency ($f$) of a type is simply the number of times that type appears in the corpus. When all types in a corpus are sorted in descending order of their frequency, they are assigned a rank ($r$). The most frequent word has rank $r= 1 $, the second most frequent has $r= 2 $, and so on.

3. 3. 2 Zipf's First Law (Frequency vs. Rank)

Zipf's Law, named after the linguist George Kingsley Zipf, describes the inverse relationship between the frequency of a word and its rank in a large corpus. The law states that the frequency of any word is inversely proportional to its rank 16.

Mathematically, this relationship can be expressed as:

f \propto \frac{ 1 }{r}

Or, more precisely, as an equality involving a constant $C$: