Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Computational Linguistics 2 - Passarotti, Appunti di Lingua Inglese

Appunti completi del corso Computational Linguistics 2 - Passarotti

Tipologia: Appunti

2023/2024

In vendita dal 24/09/2025

giogio-
giogio- 🇮🇹

4.7

(43)

49 documenti

1 / 84

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
Lesson 1 - 24/02
WORD SENSE DISAMBIGUATION (WSD)
WSD is one of the biggest enemies of computational linguists because it’s difficult,
since every word can have a number of senses from 1 to n, so every time we have to
start from scratch.
The problem is that human language is not artificial, so it’s ambiguous meaning that
many words can be interpreted in multiple ways depending on the context in which
they occur.
This makes the task of MT difficult, because it’s not possible to translate from one
language to another just choosing the corresponding word from language a to
language b, since there’re multiple differences in meaning.
WSD is the ability to computationally determine which sense of a word is
activated by its use in a particular context
(in its single tokens, occurrences of
words in a text).
Computationally means making possible that a machine doing some computation
choses the right sense.
So, WDS is a token-based task because we understand the meaning of a word using
the context, therefore what we always need is context.
WHY IS WSD DIFFICULT FOR A MACHINE?
There’re different formalizations of meaning, which stand for some fundamental
questions about semantics like:
the approach to the representation of a word sense (ranging from the enumeration
of a finite set of senses to rule-based generation of new senses): we can represent
the meaning in very different ways
the granularity of sense inventories , which can go from very fine-grained to coarse-
grained definitions
the domain-oriented VS unrestricted nature of texts : the text we deal with can be
very general or specific; in domain-oriented texts technical words tend to use the
technical meaning
the set of target words to disambiguate : WSD is focused either on a selected set of
target words or on all the words of the text
In order to be performed, WSD relies a lot on knowledge. Today LLMs in AI have
knowledge.
The skeletal procedure of any WSD system is based on knowledge and can be
summarized like this: given a set of words (e.g. a sentence), a technique is applied
and makes use of one or more sources of knowledge to associate the most
appropriate senses with words in context.
Today the most important source of knowledge available are LLMs. Source of
knowledge can vary: from corpora to more structured resources, such as machine-
readable dictionaries and semantic networks (like WordNet). The manual creation of
knowledge resources is expensive and time consuming. This is a fundamental problem
which pervades the field of WSD and is called the
knowledge acquisition
bottleneck
: there’s a bottleneck which is very small, where we need to make
knowledge pass through it.
WSD was too difficult for years, but now thanks to unsupervised learning, we have the
possibility to work on it.
WSD - IMPACT
Text disambiguation can potentially provide a major breakthrough in the treatment
of large-scale amounts of data, thus constituting a fundamental contribution to the
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54

Anteprima parziale del testo

Scarica Computational Linguistics 2 - Passarotti e più Appunti in PDF di Lingua Inglese solo su Docsity!

Lesson 1 - 24/

WORD SENSE DISAMBIGUATION (WSD)

WSD is one of the biggest enemies of computational linguists because it’s difficult, since every word can have a number of senses from 1 to n, so every time we have to start from scratch. The problem is that human language is not artificial, so it’s ambiguous meaning that many words can be interpreted in multiple ways depending on the context in which they occur. This makes the task of MT difficult, because it’s not possible to translate from one language to another just choosing the corresponding word from language a to language b, since there’re multiple differences in meaning.

WSD is the ability to computationally determine which sense of a word is

activated by its use in a particular context (in its single tokens, occurrences of

words in a text). Computationally means making possible that a machine doing some computation choses the right sense. So, WDS is a token-based task because we understand the meaning of a word using the context, therefore what we always need is context. WHY IS WSD DIFFICULT FOR A MACHINE? There’re different formalizations of meaning , which stand for some fundamental questions about semantics like:

• the approach to the representation of a word sense (ranging from the enumeration

of a finite set of senses to rule-based generation of new senses): we can represent the meaning in very different ways

• the granularity of sense inventories, which can go from very fine-grained to coarse-

grained definitions

• the domain-oriented VS unrestricted nature of texts: the text we deal with can be

very general or specific; in domain-oriented texts technical words tend to use the technical meaning

• the set of target words to disambiguate: WSD is focused either on a selected set of

target words or on all the words of the text In order to be performed, WSD relies a lot on knowledge. Today LLMs in AI have knowledge. The skeletal procedure of any WSD system is based on knowledge and can be summarized like this: given a set of words (e.g. a sentence), a technique is applied and makes use of one or more sources of knowledge to associate the most appropriate senses with words in context. Today the most important source of knowledge available are LLMs. Source of knowledge can vary: from corpora to more structured resources, such as machine- readable dictionaries and semantic networks (like WordNet). The manual creation of knowledge resources is expensive and time consuming. This is a fundamental problem

which pervades the field of WSD and is called the knowledge acquisition

bottleneck : there’s a bottleneck which is very small, where we need to make

knowledge pass through it. WSD was too difficult for years, but now thanks to unsupervised learning, we have the possibility to work on it. WSD - IMPACT Text disambiguation can potentially provide a major breakthrough in the treatment of large-scale amounts of data , thus constituting a fundamental contribution to the

realization of the so-called Semantic Web , “an extension of the current Web, in

which information is given well-defined meaning, better enabling computers and people to work in cooperation” (—> see Linguistic Linked Open Data: the skeleton of Semantic Web). Machine translation is the main impact of WSD, but also the ability to semantically process large amount of data to extract knowledge from that. WSD - THE TASK WSD can be viewed as a classification task (assign a class to something). In NLP there’re 2 big categories of models and tasks:

  1. On one side we have the classification/discrimination task : it assigns the class to the objects that must be processed. They’re also called discriminative because the task is to discriminate among the labels for the different classes
  2. On the other hand we have generative task : it generates new instances Word senses are the classes , and an automatic classification/discrimination method is used to assign each token to one or more classes based on the evidence from the context and from external knowledge sources. Also PoS tagging and NEResolution (= the classification of target textual items into predefined categories: classify proper nouns saying that it’s a proper noun of a person, it’s a proper noun of a country, …) are classification tasks. An important difference between these tasks and WSD is that:

- the formers use a single predefined set of classes (PoS, categories, …) that remains

the same for all words

- whereas in WSD the set of classes, which are the senses, typically change

depending on the word to be classified, each word has its only senses which is much more difficult

In this respect, WSD comprises n distinct classification tasks, where n is the size of the

lexicon. Each token in a text is a new classification task. WSD - 2 VARIANTS OF THE TASK

  1. Lexical sample (or targeted WSD) , where a system is required to disambiguate a restricted set of target words usually occurring one per sentence. So, there’s a set of sentences, each sentence includes one occurrence of a target word and then the task of WSD is to assign the correct sense to that target word in each of its occurrences. Supervised systems are typically employed in this setting: the training data is a set of disambiguated occurrences of the target word(s)
  2. All-words WSD , where systems are expected to disambiguate all open-class words (content words) in a text. This tasks requires wide-coverage systems, consequently, purely supervised systems can potentially suffer from the problem of data sparseness, as it’s unlikely that a training set of adequate size is available which covers the full lexicon of the language of interest THE 4 ELEMENTS OF WSD **1) Selection of word senses
  1. External knowledge sources
  2. Representation of context
  3. Choice of a classification method**

• A word is monosemous when it can convey only one meaning

• A word is polysemous if it can convey more meanings

Lesson 2 - 26/

2) EXTERNAL KNOWLEDGE SOURCES

The second element of WSD is represented by knowledge sources, basically sense

inventories which are collections of linguistic data that provide essential information to

associate senses to words. So they’re resources, in particular semantic lexical resources , meaning that provide information about the sense, the meaning of the words. STRUCTURED RESOURCES (= for each lexical entry in different way, we have senses related to that):

• Thesauri : lexical resources that provide information about relationships between

words, like synonymy and antonymy. In English the most used thesaurus in the field

of WSD is Roget’s International Thesaurus (1911).

The word “ bank” is polysemous: bank of river / bank, the financial institution. In

thesaurus we have 2 entries for “ bank”, one is connected to words that are

hypernyms so “ bank” as financial institution will be linked to something like

institution; while the other entry for “ bank” will be linked to something like river or

boats. One side the context where a word occurs and on the other side the lexical features

that we use in order to detect the sense. In order to detect the sense of “ bank”, we

need the context and some words to which “ bank” is connected. Technically

speaking this is called attention , which means that given a sequence of words,

LLMs are able to put attention on some words more than others in order to detect the right meaning.

This idea of being enclosed is present in embeddings, which are lexical resources

where words are closed to each other if they share some semantics, something under the meaning point of view

• Machine-readable dictionaries : they have become a popular source of

knowledge for NLP since the 1980s, when the first dictionaries were made available in electronic format

• Ontologies : specifications of conceptualizations of specific domains of interest,

usually including a taxonomy and a set of semantic relations. They’re a kind of subset of thesauri: in thesauri we have relations between words while in ontologies we have concepts, that can be lexicalized like in WordNet, where a concept is lexicalized in the sense that there’re some words that share same concepts. In this respect, WordNet can be considered an ontology Ontology = shared vocabulary

• BabelNet : a multilingual dictionary with coverage of both lexicographic and

encyclopedic terms obtained by semi-automatically mapping various resources, such as WordNet, multilingual versions of WordNet and Wikipedia, among others. BabelNet is structured as a semantic network where nodes are multilingual synsets, i.e. groups of synonyms lexicalized in several languages and edges are semantic relations between them. BabelNet is much bigger than WordNet UNSTRUCTURED RESOURCES:

• Corpora :

- raw corpora (unlabeled, meaning not annotated with any WSD label) e.g. WSJ

corpus, Brown

- sense-annotated corpora: SemCor is the main sense-annotated corpus, which is a

subset of the Brown whose content words have been manually annotated with PoS tags, lemmas, and word senses from the WordNet inventor. It’s the largest and most used sense-tagged corpus.

Fine tuning LLMs: we have a LLM trained on fundamental knowledge and then we

fine-tune that model to perform one specific task. Now, we don’t use annotated corpora anymore to train a model form scratch, but just to fine-tune a foundational model that was pre-trained on fundamental knowledge to make that model able to perform WSD. GPT: G stands for generative, P stands for pre-trained and T for transformative

• Collocation resources : register the tendency for words to occur regularly with

others. Collocations are keywords in context. Words tend to occur with other words. And words that share some semantic features, that have something in common under the meaning point of view, can appear together more frequently

• Sense embeddings

3) REPRESENTATION OF CONTEXT

The third element of WSD is the representation of the context. We have a word and we want to use the context in order to disambiguate the sense of that word. A text is an unstructured source of information, since it’s a sequence of tokens. To make it a suitable input to an automatic method, it’s usually transformed into a structured format. To this end, a pre-processing of the input text is usually performed, which typically (but not necessary) includes tokenization, lemmatization, PoS tagging, chunking (= syntactic parsing limited to phrases) and syntactic parsing. As a result of the pre-processing phase of a portion of text, each token can be represented as a vector of features of different kinds or in more structured ways, for instance, as a tree or graph of the relations between words. Vectors are lists of things, features are properties. To each word we assign a number of features. We want to use these vectors of features in order to put words that share some features in a common group. SET OF FEATURES TO REPRESENT THE CONTEXT:

- local features : they represent the local context of a word usage, that is, features of

a small number of words surrounding the target word, including PoS tags, word forms, positions with respect to the target word. They’re the features of the words that surround on left and right the target word; in order to disambiguate the sense of a word in the context, we look on the left and on the right, and we pay more attention to some words than to some others. A machine is able to do this using attention

- topical features : they define the general topic of a text or discourse, thus

representing more general contexts, usually as bags of words. They’re not focused on single target words, but they focus on target larger context, so for example a single chapter or section of a text. They put all words together in a bag of words, an unstructured collection of data. We assign topical features to these types of words, which means keywords about the topic (for the first chapter of Promessi Sposi would be something like wedding).

• Semisupervised and weakly (or minimally) supervised : if both sense-labeled

and unlabeled data (raw data) are employed in different proportions to learn a classifier

• Fully unsupervised : if only unlabeled data (raw data) are employed

There’s a difference between token-based and type-based approaches:

• Token-based approaches : they’re based on tokens, they associate a specific

meaning, sense to each occurrence of a word depending on the context in which it appears

• Type-based disambiguation : it’s based on the assumption that a word is

consensually referred with the same sense within a single text; so here we assume that a word tends to keep the same sense in a text.

This method tends to infer a sense, called the predominant sense, for a word form

the analysis of the entire text and possibly assign it to each occurrence within the text. It doesn’t focus on the single occurrences of a word in a text, but tries to understand what is the predominant sense of that word and assigns to all the occurrences of that word the same sense. Computationally and logarithmically speaking the type-based is the easiest.

Lesson 3 - 27/

FULLY (OR STRONGLY) SUPERVISED WSD

(Valid for all NLP) Supervised WSD uses machine-learning techniques for inducing a classifier/train model from manually sense-annotated datasets. Usually, the classifier (often called

word expert) is concerned with a single word and performs a classification task in

order to assign the appropriate sense to each instance of that word. Supervised approaches and models do what we tell them to do. Supervised learning deals with a process that is based on labeled data like a treebank. SUPERVISED DISAMBIGUATION - DECISION LISTS A decision list (Rivest 1987) is an ordered set of rules for categorizing test instances (in the case of WSD, for assigning the appropriate sense to each target word both in

the token-based and the type-based approach). It can be seen as a list of weighted “ if-

then-else” rules.

The difference from the typical rule-based approach, where rules are written by human, is that here rules are induced in a machine-learning process. As a result, we have rules of the kind feature-value, sense and score. Feature-value means that for specific features we assign a value. Sense means that given a feature, this is the sense for that word. Big part of LLMs deal with feature vectors, because we transform words into feature vectors. We assign properties to tokens. The ordering of these rules, based on their decreasing score, constitutes the decision list.

Given a word occurrence w and its representation as a feature vector, the decision list

is checked, and the feature with the highest score that matches the input vector, selects the word sense to be assigned. Example:

• We have feature-value, prediction or sense and score

• The rules are ordered by the score from the highest to

the lowest, one after the other

• These are the features for the target word “ bank”

First rule in the example: if an occurrence of “ bank” enters the system, checks among

the features if “ bank" is preceded on the left by “ account with”, it assigns the sense

finance. The machine-learning process learns the rules, the regularities. So, it’s pretty regular

that every time there’s an occurrence of “ bank” preceded by “ account with”, the

sense is bank/finance. Here, the machine is using the context since it understands that given some context, we can predict some sense for that word. This is what happens in LLMs and AI. Here it’s very small context, just two, four words on the left and on the right; LLMs make use of thousands of words on the left and on the right. A LLM is not based on a labeled training set, but it’s based on big data. SUPERVISED DISAMBIGUATION - DECISION TREES The PoS tagger based on decision trees is TreeTagger. A decision tree is a predictive model used to represent classification rules with a tree structure that recursively partitions the training dataset. Each internal node of a decision tree checks the feature values. It’s a binary test about the feature values, that results into only 2 possible answers: either yes or no. Each branch represents an outcome of the test, either yes or no.

“ Bank account?” means is bank surrounded by a context in

which the word account appears?

- If yes, just assign finance

- If no, go on

The decision tree decides from the training set, since the machine-learning approach here understands that there’s a regularity by which in most cases when there’s the

word “ account” in a contextual window close to bank, the sense of bank is finance.

The machine defines that the sense is finance because it’s labeled, we’re under the supervised disambiguation. A prediction is made when a terminal node, i.e. a leaf of the tree, is reached. All these questions and answers come from the fact that given some context so some feature-values, the machine-learning approach is able to predict multi senses that are assigned to the target word in the training set. It’s completely the same of PoS tagging: here we assign word senses, in PoS tagging we assign PoS tags. SUPERVISED DISAMBIGUATION - NAIVE BAYES A Naive Bayes classifier is a simple probabilistic classifier based on the application of Bayes’ theorem. It’s the conditional probability , i.e. the probability that something happens under some conditions, that here are the contexts: given some context, we assign a probability to an event to occur. The sense is produced by a number of features.

• Neural networks are trained until the output corresponding to the desired response

is greater than the output of any other unit for every training example, so that given the possible senses, one has a greater weight than the others. Usually in simple networks, one layer is enough

• Weights in the network can be either positive or negative, thus enabling the

accumulation of evidence in favor or against a sense choice

• —> if “ bank” is preceded by “ account with”, it doesn’t only mean that that

occurrence of bank is the finance sense, but also that that occurrence of “ bank” is

not probably the river sense, not the supply sense. So, also negative weight There’re 2 kinds of neural networks: **1) Feedforward fully connected

  1. Convolutional networks** , where you don’t only go left to right, but you can also come back This is a very simple network, called feedforward fully connected network. It’s called like this because feedforward means that we just move from left to right, from a layer to the next one, we never go back; and fully connected means that each node of the layer is connected to all the nodes of the next layer. This is a multilayer perceptron neural network fed with the values of 4 features (w-1, w+1, subj-verb, obj-verb), which give a weight to a number of neurons in the hidden layer. In the output layer, the corresponding values, which is to say scores, of three sense of the target word in context, are assigned.

Perceptron is the simplest kind of feedforward neural network, a network that has

one input and one output layer, with just one hidden layer in the middle. Here we’re in the supervised approach, which means that senses that we select here are those that are assigned by layers in the training set. While in the unsupervised approach, the senses are induced from the data. We don’t know what happens in the hidden layer, we’re not able to predict fully what happens there. This is the challenge, the place where things that aren’t fully deterministic happen. Here the weights are assigned to the process that leads a collection of features to a sense. We use a collection of features. SUPERVISED DISAMBIGUATION - EXEMPLAR-BASED OR INSTANCE-BASED LEARNING Exemplar-based (or instance-based, or memory-based) learning is a supervised algorithm in which the classification model is built from examples. The model retains examples in memory as points in the feature space, as new examples are subjected to classification, they’re progressively added to the model. In the feature space we put close the points that share the same features in the training set. k-Nearest Neighbor (kNN) algorithm : one of the highest-performing methods in WSD. In kNN the classification, we have a number of examples that are clustered using some features and a new example arrives. We put this example close to those examples that share the same features with it. We put similar things together.

Example of how a new instance relates to its kth^ nearest neighbors:

The points are instances of “ bank”. Instances assigned to

the same sense are enclosed in polygons, black dots are the kth^ nearest neighbors of the new instance, and all other instances are drawn in gray. The new instance is assigned to the bottom class with five black dots, because the new instance arrives with its features and is classified close to those examples that share the highest number of features with it. (Embeddings are like this: they’re collection of words represented as features in a vector space, and words that share features are put close to each other) SUPERVISED DISAMBIGUATION - ENSEMBLE METHODS Ensemble methods put together learning algorithms of different nature, and so with significantly different characteristics. We can have different views of the training data, if we use different methods of learning of the training data, we perform different ways of look at training data. Putting them together results in an accuracy way that is higher than the one reached with the simplest one and it’s probably better. Ensemble methods are becoming more and more popular as they allow to overcome the weaknesses of single supervised approaches.

Lesson 4 - 03/

MINIMALLY AND SEMISUPERVISED WSD

The boundary between supervised and unsupervised disambiguation is not always clearcut. Minimally or semisupervised methods make use at the same time of parts of totally supervised processes (annotated data) and parts of unsupervised processes (unannotated data). MINIMALLY AND SEMISUPERVISED WSD - BOOTSTRAPPING The typical process in minimal and semisupervised WSD is called bootstrapping. Starting from a set of annotated data, its aim is to train a sense classifier (a train model that performs automatically WSD) just using little training data, and thus overcome the 2 main problems of supervision: the lack of annotated data and the data sparsity problem (present whenever we have limited annotated datasets, that are not sufficiently represented for a specific NLP task). We start from few annotated data (each word is assigned a sense label and this is the supervised part), we train a model to a classifier and we have our train model. We apply our train model on the unannotated data (raw data). Now we have the unannotated data corpus, that has become annotated since it was annotated by the train model. Then we have to evaluate the output since we must know what is its quality. Until 80% of quality it will be okay, which means that we will keep these annotated data and we will retrain another model with these larger annotated data with lower quality but higher size and it goes on iteratively (= one after the other) until the threshold is maintained. At the end we will have a training model with not so high quality that permits us to go on with the process. MINIMALLY AND SEMISUPERVISED WSD - YAROWSKY’S BOOTSTRAPPING Yarowsky’s bootstrapping method relies on 2 heuristics:

So clusters are collections of data that share something, meaning that words that share a similar distribution in the training set have something in common. Words that tend to cooccur or to occur with similar words, tend to be clustered together. Clusters are not rigid, they change since languages change and reflect the words around us. !!! The unsupervised method doesn’t label the clusters, so it doesn’t give a name to the clusters. And it doesn’t make use of any annotated data, it just clusters objects.

Example parapalla:

Yesterday I cooked a very good pasta just by using some parapalla —> to understand

the meaning of parapalla we use the context and the neighbors like pasta and cooked,

which occur in the same sentence where parapalla, the word we never heard about,

means something to us. Unsupervised methods don’t rely on labeled training text, and in the poorest version, they don’t make use of any machine-readable resources like dictionaries, thesauri, ontologies, … They just make use of the context. The main disadvantage of fully unsupervised systems is that as they don’t exploit any dictionary, they cannot rely on a shared reference inventory of senses. So there’re processes that are able to label the clusters, to name the clusters. This is a disadvantage but also an added value to this method because dictionaries are not always updated: when a word enters a dictionary, this word is new for the dictionary but not for people, since it’s already used by them since dictionaries reflect the use of a word. Sometimes the clusters that are automatically found for different senses of a word by a machine in an unsupervised approach, are of number and quality different from that senses for that word in a sense inventory. Unsupervised WSD approaches have the aim of identifying sense clusters without

assigning any label. And so unsupervised WSD performs word sense

discrimination , that is, it aims to divide “ the occurrences of a word into a number of

classes by determining for any two occurrences whether they belong to the same

sense or not” - Schutze

Consequently, these methods may not discover clusters equivalent to the traditional senses in a dictionary sense inventory. For this reason, their evolution is usually more difficult: to define the quality of a sense cluster we should ask a human to look of the words in the different clusters and determine the nature of the relations that they all share. Usually what we cluster are not occurrences of just one word, but occurrences of many words (embeddings). So in a cluster we will have different words and different natures of the relationships between words, relationships like antonymy, hypernym, hyponymy, toponymy. Or to evaluate the quality of clusters we can employ the clusters in end-to-end applications, thus measuring the quality of the former based on the performance of the application.

These are called extrinsic evolution by which we use the cluster in a machine-

translation system; if the machine-translation system works better using the clusters, the quality of the clusters is probably good. Extrinsic means that we’re not evaluating the resource, the cluster itself, but we’re evaluating the quality of the resource by using that resource in another resource for instance in a tool for machine-learning.

Lesson 5 - 05/

UNSUPERVISED WSD - CONTEXT CLUSTERING

Feature vector = list of features assigned to an object Here each of occurrences of a target word is represented by a context vector , which is a collection of context features. The vectors are then clustered into groups, each identifying a sense of the target word.

Idea of word space - Schutze 1992: a vector space whose

dimensions are words. We want to put these vectors in a space, where words that are close to each other are those words whose vectors are similar to each other. We assign a point in a space to each word and this point is given by the coordinates which are represented by the feature context vector. Synonyms have very similar context vectors because they tend to occur with the same words in the training corpus so they are assigned using coordinates of the context vector which are very similar. In the space there’re dimensions which are words. We know the meaning of a word by the way it behaves contextually (distribution).

The target words here are “ money” and “ restaurant”, for which we have very simple

vectors built by using just 2 pieces of information:

1. one about the number of times the words “ money” and “ restaurant”

cooccur with the word “ bank”

2. second about the number of times the words “ money” and

“ restaurant” cooccur with the word “ restaurant”

Restaurant’s coordinates are 210, 80, while money’s coordinates are 100, 250, where the first dimension (210 and 100) represents the count of cooccurrences of the 2

target words with the word “ food” and the second counts the cooccurrences with the

word “ bank”.

The edges are the word vectors. The aim is to cluster content vectors, so at the end we’re not clustering words, but the content vectors of words. We’re transforming our words into vectors of features, into another kind of representation which is basically

numbers ( encoding part of the transformer: a transformer transforms words into

mathematical representations of words which allow the machine to understand that a word is more similar to another than to another). Now we have context vectors that represent the context of specific occurrences of a word. This is possible since to train such unsupervised model, we need a lot of data and computational power; without them we couldn’t do it, because the vectors are made of thousand and thousands of numbers. A context vector is built as the centroid (i.e. the normalized average) of the vectors of the words occurring in the target context, which can be seen as an approximation of its semantic context.

This is an example of context vector for the word “ stock”, which

cooccurs with deposit, money and account. These 3 words

cooccur more with “ bank” than with “ food”.

It’s calculated as centroid , or the sum of the vectors of words occurring in the same context; it’s an approximation of its semantic context. The centroid is a point in the space which has the same distance from the other 3 points.

unsupervised approach, which is the sense inventory. Unsupervised approach’s task: put together words that share something, common context. Sense labels are used in supervised, semisupervised and knowledge- based WSD. Knowledge-based WSD usually have lower performance/precision than their supervised alternatives, because they aren’t trained like in the supervised approach. However, they have the advantage of a wider coverage, thanks to the use of large- scale knowledge resources and because they cover the entire lexicon. K-BASED WSD - OVERLAP OF SENSE DEFINITIONS This is a very simple method that makes use of knowledge. Here, we have just the knowledge source, that tells us something about the meaning of the words and then we have the texts. Our task is to assign a sense out of senses given by one knowledge resource to the occurrences of the word in a text.

This approach is named gloss overlap of sense definition or the Lesk algorithm

after its author (Lesk 1986), it’s very rigid and it calculates the word overlap between the sense definitions of 2 or more target words. It’s very old fashion, but sometimes it works. Given a two-word context (w1, w2), the senses of the target words whose definitions have the highest overlap (i.e. words in common with the context words) are assumed to be the correct ones. This is a method that achieves 50-70% accuracy. It’s very sensitive to the exact wording of definition, so the absence of a certain word can radically change the results. This method is very rigid, so NLP tools that work better are those that are dynamic. K-BASED WSD - SELECTIONAL PREFERENCES (They’re related to VerbNet) Selectional preferences or restrictions are constraints on the semantic type that a word sense imposes on the words with which it combines in sentences. Words tend to select some other words, that they prefer and to exclude from the selection other words that they don’t like:

- selectional preferences : tend to select those senses which better satisfy the

requirements

- selectional restrictions : rule out senses that violate the constraint

The easiest way to learn selectional preferences is to determine the semantic appropriateness of the association provided by a word-to-word relation. So, we want to know what are the words that are by preference selected by another word. To do it we use frequency count : if we have a big corpus, we take all the occurrences of the target word and we’ll see that some words tend to cooccur with the target word more than others, those words are by tendency preferred in the selectional process by the target word. WORD SENSE DOMINANCE Given a word, the frequency distribution of its senses can be highly skewed in texts, thus affecting the performance of WSD. Methods for the determination of word sense dominance perform type-based disambiguation based on this observation. Among the senses, there’s one that dominates the others. This is an unsupervised method for automatically ranking the senses of ambiguous words from raw texts: we take a word and its neighbors, then we have a number of neighbors that correspond to different senses of a word. Our aim is to rank the senses of a word. An algorithm is there to distinguish the different senses of word. EVALUATION - BASELINES

A baseline is the result of the application of the most basic method, so its result must be the lowest one; it’s a standard method to which the performance of different approaches is compared.

For WSD there’s a baseline called random baseline , that having a dictionary and a

text made of a sequence of words, consists in assigning random senses from those available for each word. The problem is that many words are polysemous.

Another baseline is the first sense baseline (or most frequent sense baseline ). It’s

a type-based WSD approach, which means that we assign a sense to a type and not to a token. It consists in choosing the first sense according to such a ranking for each word in a corpus, independent of its context. This is applied by using the frequency of the senses that are acquired from a training set. EVALUATION - LOWER AND UPPER BOUNDS Lower and upper bounds are performance figures that indicate the range within which the performance of any system should fall.

• A lower bound usually measures a performance obtained with an extremely simple

method and which any system should be able to exceed. A typical lower bound is the random baseline

• An upper bound specifies the highest performance reasonably attainable. In WSD,

a typical upper bound is the inter-annotator agreement or inter-tagger agreement (ITA), that is, the percentage of words tagged with the same sense by 2 or more human annotators

Another upper bound, more efficient, is the oracle :

- it’s a hypothetic system which is always supposed to know the appropriate sense

choice among those available

- it constitutes a good upper bound to compare the performance of ensemble

methods

- its accuracy is determined by the number of word instances for which at least one of

the systems output the correct sense

- as a result, given the output of WSD systems, the oracle performance provides the

maximum hypothetical performance of any combination method aiming at improving the results of the single systems

- sometimes none of the systems outputs the correct sense, so the oracle doesn’t

output anything because it always know the correct answer EVALUATION - THE SENSEVAL/SEMAVAL COMPETITION Senseval , now renamed Semeval , is an international WSD competition, held every 3 years since 1998, whose objective is to perform a comparative evaluation of WSD systems in several kinds of tasks, including all-words and lexical sample WSD for different languages. This competition has usually 2 runs: one is called open and the other one is called close:

- in the close one, the competition allows participants to use as training set only the

one provided by the organizers

- in the open run, there’re no restrictions, only the results if the system counts

The systems submitted for evaluation to these competitions usually integrate different techniques and often combine supervised and knowledge-based methods.

In order to move from explicit (words) to implicit and to build a LLM, we have to represent statistically linguistic knowledge , which is made possible by 3 conditions:

  1. Huge amounts of data
  2. Advanced statistical algorithms
  3. High computational power These 3 conditions are not available to anybody. AI is not for everybody, building a LLM under these 3 conditions, is something that is just in the hands of few companies in the world. And this is the reason why today OpenAI, Meta and X are so important since they’re the only ones that have data. REPRESENTING WORDS - THE ISSUE We have to represent words, which are made of letters, that are symbols. Symbolic NLP (explicit, supervised) vs neural NLP (implicit, unsupervised): neural NLP work with numbers.

Punch cards : just a sheet of paper, they’re implicit, since we human don’t

understand anything; they’re the first way to represent the content that is readable by the machine. Whenever we click on a bottom on the laptop keyboard, on the back there’s a coding of the letters. Extended ASCII needs 8 bits for storing each character. A 4-character word is represented as a sequence of 32 0 and 1, because we have 4 letters, each letter is a byte, each byte corresponds to 8 bits and so 8x4 = 32. A 5-character word will get a 40-bit long representation. In this case the size of the representation depends on the length of the words: the longer is the word, the longer is the string of bits that represents that word. REPRESENTING WORDS - ONE-HOT REPRESENTATION In order to solve this issue, the so called one-hot representation was introduced. It’s a way to directly represent words rather than characters. We’re moving from an explicit representation of a word, which depends on the length of the word, to an implicit representation of the word, that is always the same in terms of length regardless of the number of characters a word is formed by. This representation maps words to distinct fixed-sized patterns of 0 and 1. So, we have a representation which has as many positions as are the words in our text. All the positions are set to 0 apart 1, which is the position that corresponds to the index of the word in the text. So the first word in the text will be represented by 1 0 0 0 0 0 0 0 …, the second word in the text will be 0 1 0 0 0 0 0 0 … The problem here is that it’s not really economic. However, it’s important to know it since it’s a way to represent different words using the same length, the fixed-size representation, which will allow us to compare words. Such representations allow to represent words as vectors, which are sequences of properties. This representation is important because:

- we must compare things that are the same

- we represent our elements by something that is implicit

- each word is represented by a vector

REPRESENTING WORDS - VECTOR SPACE AND NLP

The compatibility of vector-based representation with convention and modern machine learning and deep learning has significantly helped the model to prevail. Today LMMs and AI are based on vector-based representation of words. Words are positions in a vector space depending on their distribution in the training corpus. Word2vec is the first tool that was introduced to build embeddings. Since then, the term “embedding” almost replaced “representation” and dominated the field of the lexical semantics. REPRESENTING WORDS - QUESTIONS We have thousand and thousand of words and we want to build all the embeddings for all the words. We do it using machine learning, unsupervised methods, exploiting large text corpora. How word cooccurrences can denote semantic similarity? The foundation of automatically constructing word VSM is the distributional hypothesis (Firth, 1957), according to which words that appear in similar context tend to have similar meanings. Embedding visualizations:

- each of the points is a word —> these are called word

embeddings , in which each word is assigned a point in a 2D- or 3D-space and they’re close to each other if they distribute similarly in the training set

- !!! nobody told the machines that those words are similar

to each other, the machines know it just by looking at data in unsupervised fashion

- the values are the degrees of the respective attraction; the lower

it’s, the closer the words are and the closer is the distance of the words with the target word WSD immediately understands the limitation of word embeddings: polysemous words should appear in different places in the vector space and assign more points, because they have different companies according to their senses. REPRESENTING WORDS - UN/SELFSUPERVISED AND PREDICTIVE Word representation learning is usually framed as an unsupervised or self supervised procedure, in which it doesn’t require any manual annotation of the training data. The most advanced methods of deep learning in unsupervised fashion are able to supervise theirselves and to check if they’re doing right or wrong. It works like this: given a sequence of words, the task of the system is to generate another sequence of words; the system learns something in unsupervised fashion and it can try to replicate the sequences of words that words use in the training phase and checks if the system itself is able to predict exactly the same sequences of words founded in the training set; basically the system is asking itself.