












































































Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Appunti completi del corso Computational Linguistics 2 - Passarotti
Tipologia: Appunti
1 / 84
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!













































































WSD is one of the biggest enemies of computational linguists because it’s difficult, since every word can have a number of senses from 1 to n, so every time we have to start from scratch. The problem is that human language is not artificial, so it’s ambiguous meaning that many words can be interpreted in multiple ways depending on the context in which they occur. This makes the task of MT difficult, because it’s not possible to translate from one language to another just choosing the corresponding word from language a to language b, since there’re multiple differences in meaning.
words in a text). Computationally means making possible that a machine doing some computation choses the right sense. So, WDS is a token-based task because we understand the meaning of a word using the context, therefore what we always need is context. WHY IS WSD DIFFICULT FOR A MACHINE? There’re different formalizations of meaning , which stand for some fundamental questions about semantics like:
of a finite set of senses to rule-based generation of new senses): we can represent the meaning in very different ways
grained definitions
very general or specific; in domain-oriented texts technical words tend to use the technical meaning
target words or on all the words of the text In order to be performed, WSD relies a lot on knowledge. Today LLMs in AI have knowledge. The skeletal procedure of any WSD system is based on knowledge and can be summarized like this: given a set of words (e.g. a sentence), a technique is applied and makes use of one or more sources of knowledge to associate the most appropriate senses with words in context. Today the most important source of knowledge available are LLMs. Source of knowledge can vary: from corpora to more structured resources, such as machine- readable dictionaries and semantic networks (like WordNet). The manual creation of knowledge resources is expensive and time consuming. This is a fundamental problem
knowledge pass through it. WSD was too difficult for years, but now thanks to unsupervised learning, we have the possibility to work on it. WSD - IMPACT Text disambiguation can potentially provide a major breakthrough in the treatment of large-scale amounts of data , thus constituting a fundamental contribution to the
which information is given well-defined meaning, better enabling computers and people to work in cooperation” (—> see Linguistic Linked Open Data: the skeleton of Semantic Web). Machine translation is the main impact of WSD, but also the ability to semantically process large amount of data to extract knowledge from that. WSD - THE TASK WSD can be viewed as a classification task (assign a class to something). In NLP there’re 2 big categories of models and tasks:
the same for all words
depending on the word to be classified, each word has its only senses which is much more difficult
lexicon. Each token in a text is a new classification task. WSD - 2 VARIANTS OF THE TASK
associate senses to words. So they’re resources, in particular semantic lexical resources , meaning that provide information about the sense, the meaning of the words. STRUCTURED RESOURCES (= for each lexical entry in different way, we have senses related to that):
words, like synonymy and antonymy. In English the most used thesaurus in the field
boats. One side the context where a word occurs and on the other side the lexical features
LLMs are able to put attention on some words more than others in order to detect the right meaning.
where words are closed to each other if they share some semantics, something under the meaning point of view
knowledge for NLP since the 1980s, when the first dictionaries were made available in electronic format
usually including a taxonomy and a set of semantic relations. They’re a kind of subset of thesauri: in thesauri we have relations between words while in ontologies we have concepts, that can be lexicalized like in WordNet, where a concept is lexicalized in the sense that there’re some words that share same concepts. In this respect, WordNet can be considered an ontology Ontology = shared vocabulary
encyclopedic terms obtained by semi-automatically mapping various resources, such as WordNet, multilingual versions of WordNet and Wikipedia, among others. BabelNet is structured as a semantic network where nodes are multilingual synsets, i.e. groups of synonyms lexicalized in several languages and edges are semantic relations between them. BabelNet is much bigger than WordNet UNSTRUCTURED RESOURCES:
subset of the Brown whose content words have been manually annotated with PoS tags, lemmas, and word senses from the WordNet inventor. It’s the largest and most used sense-tagged corpus.
fine-tune that model to perform one specific task. Now, we don’t use annotated corpora anymore to train a model form scratch, but just to fine-tune a foundational model that was pre-trained on fundamental knowledge to make that model able to perform WSD. GPT: G stands for generative, P stands for pre-trained and T for transformative
others. Collocations are keywords in context. Words tend to occur with other words. And words that share some semantic features, that have something in common under the meaning point of view, can appear together more frequently
The third element of WSD is the representation of the context. We have a word and we want to use the context in order to disambiguate the sense of that word. A text is an unstructured source of information, since it’s a sequence of tokens. To make it a suitable input to an automatic method, it’s usually transformed into a structured format. To this end, a pre-processing of the input text is usually performed, which typically (but not necessary) includes tokenization, lemmatization, PoS tagging, chunking (= syntactic parsing limited to phrases) and syntactic parsing. As a result of the pre-processing phase of a portion of text, each token can be represented as a vector of features of different kinds or in more structured ways, for instance, as a tree or graph of the relations between words. Vectors are lists of things, features are properties. To each word we assign a number of features. We want to use these vectors of features in order to put words that share some features in a common group. SET OF FEATURES TO REPRESENT THE CONTEXT:
a small number of words surrounding the target word, including PoS tags, word forms, positions with respect to the target word. They’re the features of the words that surround on left and right the target word; in order to disambiguate the sense of a word in the context, we look on the left and on the right, and we pay more attention to some words than to some others. A machine is able to do this using attention
representing more general contexts, usually as bags of words. They’re not focused on single target words, but they focus on target larger context, so for example a single chapter or section of a text. They put all words together in a bag of words, an unstructured collection of data. We assign topical features to these types of words, which means keywords about the topic (for the first chapter of Promessi Sposi would be something like wedding).
and unlabeled data (raw data) are employed in different proportions to learn a classifier
There’s a difference between token-based and type-based approaches:
meaning, sense to each occurrence of a word depending on the context in which it appears
consensually referred with the same sense within a single text; so here we assume that a word tends to keep the same sense in a text.
the analysis of the entire text and possibly assign it to each occurrence within the text. It doesn’t focus on the single occurrences of a word in a text, but tries to understand what is the predominant sense of that word and assigns to all the occurrences of that word the same sense. Computationally and logarithmically speaking the type-based is the easiest.
(Valid for all NLP) Supervised WSD uses machine-learning techniques for inducing a classifier/train model from manually sense-annotated datasets. Usually, the classifier (often called
order to assign the appropriate sense to each instance of that word. Supervised approaches and models do what we tell them to do. Supervised learning deals with a process that is based on labeled data like a treebank. SUPERVISED DISAMBIGUATION - DECISION LISTS A decision list (Rivest 1987) is an ordered set of rules for categorizing test instances (in the case of WSD, for assigning the appropriate sense to each target word both in
The difference from the typical rule-based approach, where rules are written by human, is that here rules are induced in a machine-learning process. As a result, we have rules of the kind feature-value, sense and score. Feature-value means that for specific features we assign a value. Sense means that given a feature, this is the sense for that word. Big part of LLMs deal with feature vectors, because we transform words into feature vectors. We assign properties to tokens. The ordering of these rules, based on their decreasing score, constitutes the decision list.
is checked, and the feature with the highest score that matches the input vector, selects the word sense to be assigned. Example:
the lowest, one after the other
finance. The machine-learning process learns the rules, the regularities. So, it’s pretty regular
sense is bank/finance. Here, the machine is using the context since it understands that given some context, we can predict some sense for that word. This is what happens in LLMs and AI. Here it’s very small context, just two, four words on the left and on the right; LLMs make use of thousands of words on the left and on the right. A LLM is not based on a labeled training set, but it’s based on big data. SUPERVISED DISAMBIGUATION - DECISION TREES The PoS tagger based on decision trees is TreeTagger. A decision tree is a predictive model used to represent classification rules with a tree structure that recursively partitions the training dataset. Each internal node of a decision tree checks the feature values. It’s a binary test about the feature values, that results into only 2 possible answers: either yes or no. Each branch represents an outcome of the test, either yes or no.
The decision tree decides from the training set, since the machine-learning approach here understands that there’s a regularity by which in most cases when there’s the
The machine defines that the sense is finance because it’s labeled, we’re under the supervised disambiguation. A prediction is made when a terminal node, i.e. a leaf of the tree, is reached. All these questions and answers come from the fact that given some context so some feature-values, the machine-learning approach is able to predict multi senses that are assigned to the target word in the training set. It’s completely the same of PoS tagging: here we assign word senses, in PoS tagging we assign PoS tags. SUPERVISED DISAMBIGUATION - NAIVE BAYES A Naive Bayes classifier is a simple probabilistic classifier based on the application of Bayes’ theorem. It’s the conditional probability , i.e. the probability that something happens under some conditions, that here are the contexts: given some context, we assign a probability to an event to occur. The sense is produced by a number of features.
is greater than the output of any other unit for every training example, so that given the possible senses, one has a greater weight than the others. Usually in simple networks, one layer is enough
accumulation of evidence in favor or against a sense choice
not probably the river sense, not the supply sense. So, also negative weight There’re 2 kinds of neural networks: **1) Feedforward fully connected
one input and one output layer, with just one hidden layer in the middle. Here we’re in the supervised approach, which means that senses that we select here are those that are assigned by layers in the training set. While in the unsupervised approach, the senses are induced from the data. We don’t know what happens in the hidden layer, we’re not able to predict fully what happens there. This is the challenge, the place where things that aren’t fully deterministic happen. Here the weights are assigned to the process that leads a collection of features to a sense. We use a collection of features. SUPERVISED DISAMBIGUATION - EXEMPLAR-BASED OR INSTANCE-BASED LEARNING Exemplar-based (or instance-based, or memory-based) learning is a supervised algorithm in which the classification model is built from examples. The model retains examples in memory as points in the feature space, as new examples are subjected to classification, they’re progressively added to the model. In the feature space we put close the points that share the same features in the training set. k-Nearest Neighbor (kNN) algorithm : one of the highest-performing methods in WSD. In kNN the classification, we have a number of examples that are clustered using some features and a new example arrives. We put this example close to those examples that share the same features with it. We put similar things together.
Example of how a new instance relates to its kth^ nearest neighbors:
the same sense are enclosed in polygons, black dots are the kth^ nearest neighbors of the new instance, and all other instances are drawn in gray. The new instance is assigned to the bottom class with five black dots, because the new instance arrives with its features and is classified close to those examples that share the highest number of features with it. (Embeddings are like this: they’re collection of words represented as features in a vector space, and words that share features are put close to each other) SUPERVISED DISAMBIGUATION - ENSEMBLE METHODS Ensemble methods put together learning algorithms of different nature, and so with significantly different characteristics. We can have different views of the training data, if we use different methods of learning of the training data, we perform different ways of look at training data. Putting them together results in an accuracy way that is higher than the one reached with the simplest one and it’s probably better. Ensemble methods are becoming more and more popular as they allow to overcome the weaknesses of single supervised approaches.
The boundary between supervised and unsupervised disambiguation is not always clearcut. Minimally or semisupervised methods make use at the same time of parts of totally supervised processes (annotated data) and parts of unsupervised processes (unannotated data). MINIMALLY AND SEMISUPERVISED WSD - BOOTSTRAPPING The typical process in minimal and semisupervised WSD is called bootstrapping. Starting from a set of annotated data, its aim is to train a sense classifier (a train model that performs automatically WSD) just using little training data, and thus overcome the 2 main problems of supervision: the lack of annotated data and the data sparsity problem (present whenever we have limited annotated datasets, that are not sufficiently represented for a specific NLP task). We start from few annotated data (each word is assigned a sense label and this is the supervised part), we train a model to a classifier and we have our train model. We apply our train model on the unannotated data (raw data). Now we have the unannotated data corpus, that has become annotated since it was annotated by the train model. Then we have to evaluate the output since we must know what is its quality. Until 80% of quality it will be okay, which means that we will keep these annotated data and we will retrain another model with these larger annotated data with lower quality but higher size and it goes on iteratively (= one after the other) until the threshold is maintained. At the end we will have a training model with not so high quality that permits us to go on with the process. MINIMALLY AND SEMISUPERVISED WSD - YAROWSKY’S BOOTSTRAPPING Yarowsky’s bootstrapping method relies on 2 heuristics:
So clusters are collections of data that share something, meaning that words that share a similar distribution in the training set have something in common. Words that tend to cooccur or to occur with similar words, tend to be clustered together. Clusters are not rigid, they change since languages change and reflect the words around us. !!! The unsupervised method doesn’t label the clusters, so it doesn’t give a name to the clusters. And it doesn’t make use of any annotated data, it just clusters objects.
means something to us. Unsupervised methods don’t rely on labeled training text, and in the poorest version, they don’t make use of any machine-readable resources like dictionaries, thesauri, ontologies, … They just make use of the context. The main disadvantage of fully unsupervised systems is that as they don’t exploit any dictionary, they cannot rely on a shared reference inventory of senses. So there’re processes that are able to label the clusters, to name the clusters. This is a disadvantage but also an added value to this method because dictionaries are not always updated: when a word enters a dictionary, this word is new for the dictionary but not for people, since it’s already used by them since dictionaries reflect the use of a word. Sometimes the clusters that are automatically found for different senses of a word by a machine in an unsupervised approach, are of number and quality different from that senses for that word in a sense inventory. Unsupervised WSD approaches have the aim of identifying sense clusters without
Consequently, these methods may not discover clusters equivalent to the traditional senses in a dictionary sense inventory. For this reason, their evolution is usually more difficult: to define the quality of a sense cluster we should ask a human to look of the words in the different clusters and determine the nature of the relations that they all share. Usually what we cluster are not occurrences of just one word, but occurrences of many words (embeddings). So in a cluster we will have different words and different natures of the relationships between words, relationships like antonymy, hypernym, hyponymy, toponymy. Or to evaluate the quality of clusters we can employ the clusters in end-to-end applications, thus measuring the quality of the former based on the performance of the application.
translation system; if the machine-translation system works better using the clusters, the quality of the clusters is probably good. Extrinsic means that we’re not evaluating the resource, the cluster itself, but we’re evaluating the quality of the resource by using that resource in another resource for instance in a tool for machine-learning.
Feature vector = list of features assigned to an object Here each of occurrences of a target word is represented by a context vector , which is a collection of context features. The vectors are then clustered into groups, each identifying a sense of the target word.
dimensions are words. We want to put these vectors in a space, where words that are close to each other are those words whose vectors are similar to each other. We assign a point in a space to each word and this point is given by the coordinates which are represented by the feature context vector. Synonyms have very similar context vectors because they tend to occur with the same words in the training corpus so they are assigned using coordinates of the context vector which are very similar. In the space there’re dimensions which are words. We know the meaning of a word by the way it behaves contextually (distribution).
vectors built by using just 2 pieces of information:
Restaurant’s coordinates are 210, 80, while money’s coordinates are 100, 250, where the first dimension (210 and 100) represents the count of cooccurrences of the 2
The edges are the word vectors. The aim is to cluster content vectors, so at the end we’re not clustering words, but the content vectors of words. We’re transforming our words into vectors of features, into another kind of representation which is basically
mathematical representations of words which allow the machine to understand that a word is more similar to another than to another). Now we have context vectors that represent the context of specific occurrences of a word. This is possible since to train such unsupervised model, we need a lot of data and computational power; without them we couldn’t do it, because the vectors are made of thousand and thousands of numbers. A context vector is built as the centroid (i.e. the normalized average) of the vectors of the words occurring in the target context, which can be seen as an approximation of its semantic context.
It’s calculated as centroid , or the sum of the vectors of words occurring in the same context; it’s an approximation of its semantic context. The centroid is a point in the space which has the same distance from the other 3 points.
unsupervised approach, which is the sense inventory. Unsupervised approach’s task: put together words that share something, common context. Sense labels are used in supervised, semisupervised and knowledge- based WSD. Knowledge-based WSD usually have lower performance/precision than their supervised alternatives, because they aren’t trained like in the supervised approach. However, they have the advantage of a wider coverage, thanks to the use of large- scale knowledge resources and because they cover the entire lexicon. K-BASED WSD - OVERLAP OF SENSE DEFINITIONS This is a very simple method that makes use of knowledge. Here, we have just the knowledge source, that tells us something about the meaning of the words and then we have the texts. Our task is to assign a sense out of senses given by one knowledge resource to the occurrences of the word in a text.
after its author (Lesk 1986), it’s very rigid and it calculates the word overlap between the sense definitions of 2 or more target words. It’s very old fashion, but sometimes it works. Given a two-word context (w1, w2), the senses of the target words whose definitions have the highest overlap (i.e. words in common with the context words) are assumed to be the correct ones. This is a method that achieves 50-70% accuracy. It’s very sensitive to the exact wording of definition, so the absence of a certain word can radically change the results. This method is very rigid, so NLP tools that work better are those that are dynamic. K-BASED WSD - SELECTIONAL PREFERENCES (They’re related to VerbNet) Selectional preferences or restrictions are constraints on the semantic type that a word sense imposes on the words with which it combines in sentences. Words tend to select some other words, that they prefer and to exclude from the selection other words that they don’t like:
requirements
The easiest way to learn selectional preferences is to determine the semantic appropriateness of the association provided by a word-to-word relation. So, we want to know what are the words that are by preference selected by another word. To do it we use frequency count : if we have a big corpus, we take all the occurrences of the target word and we’ll see that some words tend to cooccur with the target word more than others, those words are by tendency preferred in the selectional process by the target word. WORD SENSE DOMINANCE Given a word, the frequency distribution of its senses can be highly skewed in texts, thus affecting the performance of WSD. Methods for the determination of word sense dominance perform type-based disambiguation based on this observation. Among the senses, there’s one that dominates the others. This is an unsupervised method for automatically ranking the senses of ambiguous words from raw texts: we take a word and its neighbors, then we have a number of neighbors that correspond to different senses of a word. Our aim is to rank the senses of a word. An algorithm is there to distinguish the different senses of word. EVALUATION - BASELINES
A baseline is the result of the application of the most basic method, so its result must be the lowest one; it’s a standard method to which the performance of different approaches is compared.
text made of a sequence of words, consists in assigning random senses from those available for each word. The problem is that many words are polysemous.
a type-based WSD approach, which means that we assign a sense to a type and not to a token. It consists in choosing the first sense according to such a ranking for each word in a corpus, independent of its context. This is applied by using the frequency of the senses that are acquired from a training set. EVALUATION - LOWER AND UPPER BOUNDS Lower and upper bounds are performance figures that indicate the range within which the performance of any system should fall.
method and which any system should be able to exceed. A typical lower bound is the random baseline
a typical upper bound is the inter-annotator agreement or inter-tagger agreement (ITA), that is, the percentage of words tagged with the same sense by 2 or more human annotators
choice among those available
methods
the systems output the correct sense
maximum hypothetical performance of any combination method aiming at improving the results of the single systems
output anything because it always know the correct answer EVALUATION - THE SENSEVAL/SEMAVAL COMPETITION Senseval , now renamed Semeval , is an international WSD competition, held every 3 years since 1998, whose objective is to perform a comparative evaluation of WSD systems in several kinds of tasks, including all-words and lexical sample WSD for different languages. This competition has usually 2 runs: one is called open and the other one is called close:
one provided by the organizers
The systems submitted for evaluation to these competitions usually integrate different techniques and often combine supervised and knowledge-based methods.
In order to move from explicit (words) to implicit and to build a LLM, we have to represent statistically linguistic knowledge , which is made possible by 3 conditions:
understand anything; they’re the first way to represent the content that is readable by the machine. Whenever we click on a bottom on the laptop keyboard, on the back there’s a coding of the letters. Extended ASCII needs 8 bits for storing each character. A 4-character word is represented as a sequence of 32 0 and 1, because we have 4 letters, each letter is a byte, each byte corresponds to 8 bits and so 8x4 = 32. A 5-character word will get a 40-bit long representation. In this case the size of the representation depends on the length of the words: the longer is the word, the longer is the string of bits that represents that word. REPRESENTING WORDS - ONE-HOT REPRESENTATION In order to solve this issue, the so called one-hot representation was introduced. It’s a way to directly represent words rather than characters. We’re moving from an explicit representation of a word, which depends on the length of the word, to an implicit representation of the word, that is always the same in terms of length regardless of the number of characters a word is formed by. This representation maps words to distinct fixed-sized patterns of 0 and 1. So, we have a representation which has as many positions as are the words in our text. All the positions are set to 0 apart 1, which is the position that corresponds to the index of the word in the text. So the first word in the text will be represented by 1 0 0 0 0 0 0 0 …, the second word in the text will be 0 1 0 0 0 0 0 0 … The problem here is that it’s not really economic. However, it’s important to know it since it’s a way to represent different words using the same length, the fixed-size representation, which will allow us to compare words. Such representations allow to represent words as vectors, which are sequences of properties. This representation is important because:
The compatibility of vector-based representation with convention and modern machine learning and deep learning has significantly helped the model to prevail. Today LMMs and AI are based on vector-based representation of words. Words are positions in a vector space depending on their distribution in the training corpus. Word2vec is the first tool that was introduced to build embeddings. Since then, the term “embedding” almost replaced “representation” and dominated the field of the lexical semantics. REPRESENTING WORDS - QUESTIONS We have thousand and thousand of words and we want to build all the embeddings for all the words. We do it using machine learning, unsupervised methods, exploiting large text corpora. How word cooccurrences can denote semantic similarity? The foundation of automatically constructing word VSM is the distributional hypothesis (Firth, 1957), according to which words that appear in similar context tend to have similar meanings. Embedding visualizations:
embeddings , in which each word is assigned a point in a 2D- or 3D-space and they’re close to each other if they distribute similarly in the training set
to each other, the machines know it just by looking at data in unsupervised fashion
it’s, the closer the words are and the closer is the distance of the words with the target word WSD immediately understands the limitation of word embeddings: polysemous words should appear in different places in the vector space and assign more points, because they have different companies according to their senses. REPRESENTING WORDS - UN/SELFSUPERVISED AND PREDICTIVE Word representation learning is usually framed as an unsupervised or self supervised procedure, in which it doesn’t require any manual annotation of the training data. The most advanced methods of deep learning in unsupervised fashion are able to supervise theirselves and to check if they’re doing right or wrong. It works like this: given a sequence of words, the task of the system is to generate another sequence of words; the system learns something in unsupervised fashion and it can try to replicate the sequences of words that words use in the training phase and checks if the system itself is able to predict exactly the same sequences of words founded in the training set; basically the system is asking itself.