














































































Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Theoretical Expansion: Each chapter includes historical insights (e.g., the Chomsky vs. Corpus Linguistics debate), detailed mathematical explanations for statistics and deep learning, and critical analyses of models. Examples and Case Studies: I have included numerous practical examples of linguistic analysis, ambiguity scenarios, and real-world applications to make the concepts more intuitive. Glossary and Appendices: At the end of the document, you will find a comprehensive glossary of terms and a series of reflection exercises based on the practical sessions of the course. Professional Structure: The file follows a logical order, from the basics of data collection to the frontiers of Large Language Models and computational pipelines.
Tipologia: Schemi e mappe concettuali
1 / 86
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!















































































Author: Manus AI
Source: University Course Slides ( 2024 ‒ 2025 )
Language: English
Computational Linguistics (CL) stands as a pivotal interdisciplinary field, residing at the confluence of computer science, artificial intelligence, and theoretical linguistics. Its fundamental objective is the computational modeling of human language, encompassing both its written and spoken forms, to enable computers to process, understand, and generate natural language. This chapter provides a comprehensive introduction to CL, tracing its historical trajectory, clarifying its relationship with the closely allied field of Natural Language Processing (NLP), and undertaking a deep theoretical dive into the inherent linguistic challenges that continue to drive research in the discipline.
Computational Linguistics is, at its core, a scientific discipline. It seeks to develop formal, computational models of linguistic phenomena, ranging from phonetics and morphology to syntax, semantics, and pragmatics. The models developed in CL are not merely tools for practical application but are often designed to test and validate linguistic theories. By attempting to formalize the complex, often ambiguous, rules of human language into algorithms, computational linguists gain profound insights into the structure and function of language itself 16.
The scope of CL is vast, covering all levels of linguistic analysis:
The interdisciplinary nature of CL means it draws heavily on formalisms from computer science (e.g., finite-state automata, context-free grammars, machine learning), mathematics (e.g., probability theory, linear algebra), and cognitive science (e.g., theories of language acquisition and processing).
The history of Computational Linguistics is a fascinating narrative of ambition, disappointment, and revolutionary technological shifts. It can be broadly segmented into four major eras, each defined by the dominant methodology and the prevailing technological landscape.
The birth of CL is inextricably linked to the Cold War and the urgent need for automatic Machine Translation (MT). The field’s formal beginning is often traced to the Georgetown-IBM Experiment in 1954 , which demonstrated the feasibility of translating a few dozen Russian sentences into English using a small set of hand-coded rules 16.
This early period was characterized by rationalist or rule-based approaches. The underlying assumption, heavily influenced by early linguistic theories, was that language could be fully described by a finite set of explicit, deterministic rules. Systems relied on extensive, manually crafted dictionaries and grammatical rules to analyze (parse) and generate language.
A key theoretical development was the work of Noam Chomsky, whose concept of Generative Grammar provided a formal, mathematical framework for describing the syntax of natural language. While Chomsky himself was skeptical of the computational tractability of his models, his work profoundly influenced the formalization efforts within CL.
This period was marked by a significant setback: the ALPAC Report (Automatic Language Processing Advisory Committee) of 1966. Commissioned by the U.S. government, the report
word embeddings (e.g., Word 2 Vec, GloVe), which represented words as dense vectors in a continuous space, capturing semantic and syntactic relationships.
Subsequent breakthroughs involved more complex architectures:
This era has seen the most significant practical advances, leading to the widespread deployment of sophisticated language technologies.
The terms Computational Linguistics (CL) and Natural Language Processing (NLP) are often used interchangeably, particularly in industry and popular media. However, in academic and research contexts, a subtle but important distinction persists, primarily concerning their goals and methodologies.
Computational Linguistics (CL) is fundamentally a scientific discipline. Its primary goal is to use computational methods to understand the underlying principles and mechanisms of human language. CL is driven by linguistic questions:How can we formally model ambiguity? What is the most efficient way to represent semantic meaning? The computational models serve as a means to test linguistic hypotheses.
Natural Language Processing (NLP) is primarily an engineering discipline. Its primary goal is to build practical, robust systems that can perform useful tasks involving language. NLP is driven by application questions:How can we build a better machine translator? How can we automatically summarize a document? The focus is on achieving the highest possible performance on a specific, real-world task, often prioritizing engineering efficiency over deep linguistic insight.
While the two fields share techniques (especially in the modern era, where deep learning models dominate both), their ultimate objectives differ. A CL researcher might develop a new parsing algorithm to prove a theory about syntactic structure, while an NLP engineer might use an existing, off-the-shelf parser to improve a chatbot's performance.
The relationship is best described as a symbiotic hierarchy : CL provides the theoretical foundations and formal models, while NLP takes these models and applies them to create practical technologies.
In the current era of Large Language Models (LLMs), the distinction has become increasingly blurred. LLMs, which are products of NLP engineering, are now so powerful that they are being used as tools by CL researchers to test new linguistic hypotheses, effectively closing the loop between the two disciplines.
Despite the monumental progress achieved by statistical and neural models, human language presents a set of inherent challenges that remain central to CL research. These challenges stem from the very nature of language: its complexity, its reliance on context, and its grounding in human cognition and world knowledge.
Ambiguity is arguably the single greatest hurdle in computational language understanding. Human speakers resolve ambiguity effortlessly using context and world knowledge, but for
Feature Computational Linguistics (CL) Natural Language Processing (NLP)
Primary Goal
Scientific understanding of language and linguistic phenomena.
Engineering robust systems for practical language tasks.
Focus
Theoretical models, formalisms, and linguistic validity.
Performance, efficiency, and real-world application.
Methodology
Often involves testing linguistic theories using computational tools.
Often involves applying and optimizing machine learning algorithms.
Output
Formal grammars, computational models of linguistic processes, scientific papers.
Software, applications (e.g., chatbots, MT systems, sentiment analyzers).
Discipline Leaning Linguistics and CognitiveScience. Computer Science andArtificial Intelligence.
Language is inherently contextual. The meaning of a sentence often depends on the preceding sentences, the speaker's identity, the time, and the location of the utterance.
Coreference Resolution is the task of finding all expressions that refer to the same entity in a text. This is crucial for understanding narrative flow and coherence.
A CL system must correctly identify thather andShe refer toDr. Smith, andit refers toThe paper. This is complicated by factors like gender, number, and the distance between the referring expressions.
A significant portion of human language understanding relies on common sense and world knowledge —information that is not explicitly stated in the text but is assumed to be known by the reader.
To resolve the pronounit, a system must know that trophies and suitcases have sizes, and that if something is "too big," it is the larger object that is the problem. In this case,it refers to thetrophy.
In this case,it refers to thesuitcase.
This is known as the Winograd Schema Challenge , and it highlights the limitation of purely statistical models that lack a deep, symbolic representation of the world. While LLMs have
Type of Ambiguity Example Sentence Ambiguous Interpretation 1
Ambiguous Interpretation 2
Lexical "I saw a bat." A flying mammal. A piece of sportsequipment.
Syntactic
"He announced the plan to the staff on Monday."
The announcement happened on Monday.
The plan was about the staff on Monday.
Referential "John told Bill he waswrong." He refers to John. He refers to Bill.
"Dr. Smith presented her findings. She then took questions from the audience. The paper it self was groundbreaking."
"The trophy didn't fit into the suitcase because it was too big."
"The trophy didn't fit into the suitcase because it was too small."
shown remarkable progress in implicitly encoding vast amounts of world knowledge, they still struggle with complex, non-obvious inferences.
Human language is characterized by immense variability:
A robust CL system must be able to handle this variability. Furthermore, the vast majority of CL/NLP research has focused on high-resource languages (primarily English, followed by a few others) for which massive amounts of annotated data are available.
Low-Resource Languages (LRLs) —languages with limited digital text, few speakers, or scarce linguistic expertise—present a major challenge. Developing effective language technologies for LRLs requires innovative techniques like transfer learning (applying models trained on high-resource languages) and unsupervised learning (learning from unannotated text), which are active areas of CL research.
The evolution of CL demonstrates a continuous, cyclical interplay between theoretical linguistic insight and practical computational implementation.
The early rule-based systems were a direct attempt to implement formal linguistic theories, but they failed due to the sheer complexity of language. The statistical revolution succeeded by largely ignoring deep linguistic theory in favor of empirical performance. The current deep learning era represents a synthesis: while the models are statistical, they implicitly learn and encode complex linguistic hierarchies (e.g., syntax and semantics) that were once the exclusive domain of theoretical linguistics.
The future of Computational Linguistics lies in bridging the remaining gaps, particularly in areas like:
generate human language. The quality, quantity, and structure of this data directly determine the capabilities and limitations of any resulting NLP system.
Linguistic data, the raw material of CL, can be broadly classified based on its level of processing and structure. The most fundamental distinction is between raw, unanalyzed language and language that has been enriched with explicit linguistic information.
Raw data consists of language in its natural, unanalyzed form. For text, this is typically a sequence of characters, words, or sentences (e.g., a collection of web pages, books, or social media posts). For speech, it is the raw audio signal.
While raw data is the most abundant and easiest to acquire, its use in CL presents significant challenges:
Despite these challenges, raw data is crucial for unsupervised learning and, most notably, for training modern large language models (LLMs) , which rely on petabytes of raw text to learn statistical patterns of language use.
Annotated data is raw data that has been manually or semi-automatically enriched with explicit linguistic tags or labels. This process, known as annotation or data labeling , is the primary method for creating resources for supervised machine learning tasks. The annotations transform the implicit structure of language into an explicit, machine-readable format.
The creation of high-quality annotated data is a complex, expensive, and time-consuming process, often requiring trained linguists. The process is governed by an annotation scheme , a set of formal guidelines and rules that define the categories, tags, and conventions to be used.
A critical metric for the quality of annotated data is Inter-Annotator Agreement (IAA). IAA measures the degree of consensus among different human annotators applying the same scheme to the same data. Low IAA suggests that the annotation scheme is ambiguous, the task is inherently difficult, or the annotators are insufficiently trained. Common metrics for IAA include Cohen's Kappa ($\kappa$) and Fleiss' Kappa. High IAA is a prerequisite for creating reliable "gold standard" datasets.
Annotation can target any level of linguistic analysis, providing the specific features needed for different NLP tasks.
This level focuses on the word and its internal structure.
Feature Raw Data Annotated Data Trade-off in CL
Quantity
Very High (Terabytes/Petabytes )
Low to Medium (Megabytes/Gigabyte s)
Quantity vs. Quality
Cost Low (Acquisition) High (Human Labor) Scalability vs.Specificity
Use Case
Unsupervised Learning, Pre- training LLMs
Supervised Learning, Fine-tuning, Evaluation
Generality vs. Performance
Structure Implicit, Statistical Explicit, Rule-based/Categorical^ Discovery vs.Definition
meticulous process, with the goal of creating a resource that is both useful for specific research questions and generalizable to the language as a whole.
The concept of representativeness is arguably the most critical and complex issue in corpus design. A corpus is representative if its contents accurately reflect the properties of the language variety or domain it is intended to model 16.
The theoretical difficulty lies in defining the population of a language. Unlike a finite population (e.g., all registered voters), the set of all possible utterances in a language is infinite and constantly evolving. Therefore, a corpus can only ever be a sample of this infinite population.
To achieve representativeness, corpus designers must consider two primary dimensions:
A corpus that is not representative can lead to misleading or biased conclusions. For instance, a model trained exclusively on a corpus of formal academic papers will perform poorly on informal social media text, a phenomenon known as domain mismatch.
Corpus balance is the practical strategy used to achieve representativeness. It involves ensuring that the different components (or strata) of the language population are included in the corpus in appropriate proportions.
The most common technique for achieving balance is Stratified Sampling. This involves:
For example, if a designer estimates that 60 % of the target language use is written news and 40 % is spoken conversation, a balanced corpus of 100 million words would aim for 60 million words of news text and 40 million words of transcribed speech.
Example of Sampling Bias:
The original Brown Corpus, a foundational resource, was criticized for its bias towards American English and formal, written prose from the year 1961. While revolutionary for its time, its lack of spoken data and diachronic depth limited its representativeness of the broader English language. Modern corpora, such as the British National Corpus (BNC), explicitly aim for balance by including a 90 % written and 10 % spoken component, further stratified by genre and context.
Corpora can be classified based on their content, structure, and purpose. Understanding these classifications is essential for selecting the right resource for a given CL task.
Parallel Corpora are the bedrock of statistical and neural Machine Translation. The alignment process, which links a source sentence to its target translation, is often complex and requires sophisticated algorithms or manual verification. The quality of this alignment is paramount for training effective MT systems.
Type Description Primary Use Case Example
Monolingual Texts in a singlelanguage.
General language research, language modeling.
British National Corpus (BNC)
Parallel
Texts in one language aligned sentence-by- sentence with their human translations in one or more other languages.
Machine Translation (MT), contrastive linguistics.
Europarl Corpus, OpenSubtitles Corpus
Comparable
Texts in multiple languages that are similar in genre, topic, and time, but are not translations of each other.
Cross-linguistic studies, terminology extraction.
Collections of news articles on the same event in different languages.
Benchmark datasets are specialized, often highly-annotated, datasets used exclusively for the purpose of evaluating and comparing the performance of different NLP models on a specific task. They are the standardized yardstick of the field.
A typical benchmark setup involves:
The use of common benchmarks is crucial for scientific progress in CL, as it ensures reproducibility and allows researchers to directly compare the efficacy of different algorithms.
Evaluation Campaigns (e.g., SemEval , CoNLL Shared Tasks ) are organized events centered around a specific benchmark task. They provide a common dataset, a clear evaluation metric, and a timeline for submission, fostering a competitive environment that rapidly advances the state-of-the-art for that particular problem. The distinction between a general corpus and a benchmark is that the benchmark is narrowly focused on a single phenomenon (e.g., sentiment analysis, question answering) and is designed to be a challenging test, not necessarily a representative sample of the entire language.
Linguistic data is the indispensable resource for modern Computational Linguistics. The journey from raw text to highly structured, annotated corpora and specialized benchmarks reflects the increasing sophistication of NLP models and the growing demand for systems that can handle the nuances of human language.
The field continues to grapple with fundamental challenges related to data:
In summary, the design, collection, annotation, and maintenance of linguistic data are not merely technical prerequisites but central scientific endeavors in Computational Linguistics. The careful consideration of data types, the rigorous pursuit of representativeness, and the systematic classification of corpora are the pillars upon which the next generation of language technologies will be built.
[ 8 ] Sinclair, J. ( 1996 ). The Empty Lexicon. International Journal of Corpus Linguistics, 1 ( 1 ), 99 - 104. [ 8 ] Biber, D. ( 1993 ). Representativeness in Corpus Design. Literary and Linguistic Computing, 8 ( 4 ), 243 - 257.
[ 8 ] Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. ( 1993 ). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19 ( 2 ), 313 - 330. [ 8 ] Ide, N., & Pustejovsky, J. ( 2010 ). The Handbook of Linguistic Annotation. Springer.
[ 8 ] Baker, C. F., Fillmore, C. J., & Lowe, J. B. ( 1998 ). The Berkeley FrameNet project. In Proceedings of the 36 th Annual Meeting of the Association for Computational Linguistics and 17 th International Conference on Computational Linguistics (Vol. 1 , pp. 86 - 90 ). [ 8 ] Resnik, P., & Yarowsky, D. ( 1999 ). A survey of monolingual and parallel corpora for word sense disambiguation. In Proceedings of the ACL Workshop on Word Sense Disambiguation (pp. 1 - 8 ).
loud. The dog's bark was louder.", the total number of tokens would be 14 , but the number of types would be less than 14 due to repetitions like "The", "'s", "was", and "loud".
The primary purpose of tokenization is to transform unstructured text into a structured sequence that can be indexed, counted, and analyzed by computational algorithms. It is the first and most crucial step in the NLP pipeline, influencing all subsequent tasks, including part-of-speech tagging, parsing, and machine translation 16.
While simple tokenization might involve splitting text by whitespace, real-world language presents numerous challenges that require sophisticated rules or statistical models 16 :
Challenge Example Standard Tokenization Issue
Computational Solution
Punctuation U.S.A.,end.
Should the period be part of the token or a separate token?
Context-sensitive rules (e.g., period after an abbreviation is part of the token; otherwise, it's a separator).
Contractions don't,I'm
Shoulddon't be one token or two (do and n't)?
Splitting into base word and clitic (do + n't) for morphological analysis.
Hyphenation^ state-of-the-art,worker co-
Should hyphenated words be one token or multiple?
Domain-specific rules; often kept as one token unless the hyphen is a line- break artifact.
Multi-word Expressions (MWEs)^ New York,take off
Should MWEs be treated as a single semantic unit?
Named Entity Recognition (NER) or statistical methods to identify and treat MWEs as single tokens.
Case Sensitivity^ Apple (company) vs.apple (fruit)
Should tokens be case-normalized (lowercased) or kept as is?
Depends on the task; typically lowcased for frequency analysis, but preserved for NER.
Tokenization can occur at various levels of granularity:
Once a corpus has been tokenized, the next logical step is to count the frequency of each type. This frequency distribution is a fundamental characteristic of any natural language and is governed by a remarkable statistical regularity known as Zipf's Law 16.
The term frequency ($f$) of a type is simply the number of times that type appears in the corpus. When all types in a corpus are sorted in descending order of their frequency, they are assigned a rank ($r$). The most frequent word has rank $r= 1 $, the second most frequent has $r= 2 $, and so on.
Zipf's Law, named after the linguist George Kingsley Zipf, describes the inverse relationship between the frequency of a word and its rank in a large corpus. The law states that the frequency of any word is inversely proportional to its rank 16.
Mathematically, this relationship can be expressed as:
f \propto \frac{ 1 }{r}
Or, more precisely, as an equality involving a constant $C$: