summery text analytics, Summaries of Data Mining

summery text analytics from eur

Typology: Summaries

2024/2025

Uploaded on 12/08/2025

lucas-vercauteren
lucas-vercauteren 🇬🇧

1 document

1 / 65

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
summary text analytics
lucas vercauteren
April 2025
1 week 2
Word Clouds
Word clouds can be used to summarize text data. A basic word cloud simply counts the number of
times each word appears in the text. Words that occur more frequently are shown larger and are typically
placed toward the center of the cloud.
While a histogram is the traditional approach to visualize word frequencies, it can only show a
limited number of words clearly. In contrast, word clouds can display a much larger set of words at once,
providing a broader overview of the text content.
Word clouds are therefore useful to quickly identify prominent words and get a general sense of what
the text is about. However, they do not reveal much about the underlying relationships or structure of
the text.
To gain deeper insight, we can compare word clouds from different subsets of text. For example,
comparing word usage between positive and negative customer reviews helps us explore which words are
more typical for each group. However, since both happy and unhappy customers often talk about similar
things, their most frequent words tend to overlap. This makes it difficult to identify what distinguishes
the two groups.
To address this, we can use a commonality cloud, which shows the words shared between both groups.
Alternatively, a comparison cloud highlights the words that appear more frequently in one group than
the other. Still, both approaches focus on frequent words rather than the most discriminative ones.
The main issue with these visualizations is that they are dominated by very frequent words, which
are not necessarily the most informative. To control for this, we can use TF-IDF (Term Frequency
Inverse Document Frequency).
Term Frequency (TF) counts how often a word appears across all documents.
Document Frequency (DF) counts how many documents contain the word.
Some words are common across nearly all documents (e.g., "the", "and"), while others appear fre-
quently in only a few. The latter are more unique and thus more descriptive of specific content. TF-IDF
gives higher scores to words that occur frequently in a specific subset but not across all documents.
For example, to identify words that best describe positive reviews, we first calculate term frequencies
in this subset, then adjust using the inverse document frequency. This highlights words that are both
common in positive reviews and rare in the full set.
Alternative weighting schemes also exist, such as multiplying TF by the log of the IDF instead of
using IDF directly.
TF shows how many times a word appears in all reviews, while DF shows in how many individual
reviews the word appears. For instance, if the word "clean" appears in more reviews than "night"
or "nice", but those other words have higher TFs, it suggests that "clean" is used less frequently per
review. This reveals interesting patterns in word usage—people may mention "nice" multiple times when
expressing satisfaction, whereas they only mention "clean" once.
Looking at the histogram of TF-IDF values, we find that words like smoking or dog stand out
in negative reviews—indicating annoyance—and words related to location appear frequently too. By
definition, TF-IDF values are always positive.
To create even stronger contrasts between groups (e.g., 1-star vs 5-star reviews), we can use the
relative frequency of words across the groups. For example, we can calculate:
Relative Frequency =TF(Group A)
TF(Group B)
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41

Partial preview of the text

Download summery text analytics and more Summaries Data Mining in PDF only on Docsity!

summary text analytics

lucas vercauteren

April 2025

1 week 2

Word Clouds

Word clouds can be used to summarize text data. A basic word cloud simply counts the number of times each word appears in the text. Words that occur more frequently are shown larger and are typically placed toward the center of the cloud. While a histogram is the traditional approach to visualize word frequencies, it can only show a limited number of words clearly. In contrast, word clouds can display a much larger set of words at once, providing a broader overview of the text content. Word clouds are therefore useful to quickly identify prominent words and get a general sense of what the text is about. However, they do not reveal much about the underlying relationships or structure of the text. To gain deeper insight, we can compare word clouds from different subsets of text. For example, comparing word usage between positive and negative customer reviews helps us explore which words are more typical for each group. However, since both happy and unhappy customers often talk about similar things, their most frequent words tend to overlap. This makes it difficult to identify what distinguishes the two groups. To address this, we can use a commonality cloud, which shows the words shared between both groups. Alternatively, a comparison cloud highlights the words that appear more frequently in one group than the other. Still, both approaches focus on frequent words rather than the most discriminative ones. The main issue with these visualizations is that they are dominated by very frequent words, which are not necessarily the most informative. To control for this, we can use TF-IDF (Term Frequency – Inverse Document Frequency).

  • Term Frequency (TF) counts how often a word appears across all documents.
  • Document Frequency (DF) counts how many documents contain the word. Some words are common across nearly all documents (e.g., "the", "and"), while others appear fre- quently in only a few. The latter are more unique and thus more descriptive of specific content. TF-IDF gives higher scores to words that occur frequently in a specific subset but not across all documents. For example, to identify words that best describe positive reviews, we first calculate term frequencies in this subset, then adjust using the inverse document frequency. This highlights words that are both common in positive reviews and rare in the full set. Alternative weighting schemes also exist, such as multiplying TF by the log of the IDF instead of using IDF directly. TF shows how many times a word appears in all reviews, while DF shows in how many individual reviews the word appears. For instance, if the word "clean" appears in more reviews than "night" or "nice", but those other words have higher TFs, it suggests that "clean" is used less frequently per review. This reveals interesting patterns in word usage—people may mention "nice" multiple times when expressing satisfaction, whereas they only mention "clean" once. Looking at the histogram of TF-IDF values, we find that words like smoking or dog stand out in negative reviews—indicating annoyance—and words related to location appear frequently too. By definition, TF-IDF values are always positive. To create even stronger contrasts between groups (e.g., 1-star vs 5-star reviews), we can use the relative frequency of words across the groups. For example, we can calculate:

Relative Frequency =

TF(Group A) TF(Group B)

Words that appear much more in Group A than Group B will have high values and are therefore more characteristic of Group A. Words used equally in both groups will receive a score around 1. This method helps identify the most diagnostic words for each group.

Summary of the Video

  • A comparison cloud shows the most frequent words in each group but may miss the most predictive or informative words.
  • To find more diagnostic words, we can use relative frequencies or TF-IDF, which highlight less common but more informative terms.
  • These techniques help uncover words that truly distinguish between different types of reviews.

Multidimensional Scaling (MDS)

Text visualizations often rely on four features:

  • Size
  • Location (position and distance)
  • Color
  • Connections between words

MDS focuses on location and distance. It’s often compared to a map, where distances between objects represent their relationships. In this case, we aim to map words based on how "close" they are to each other in text. To build such a map, we must first define distances or similarities between words. Common similarity definitions include:

  • Words that occur together in the same review
  • Words that appear in the same sentence
  • Words that are near each other within a fixed window (e.g., 5-word span)

We can measure these relationships by counting how often words co-occur within a given context. For example, we begin by counting co-occurrences within full reviews:

locat night nice clean bed service locat 22842 14234 13457 12553 12055 10459 night 14234 21337 14026 11916 14760 10914 nice 13457 14026 19596 11763 12708 9789 clean 12553 11916 11763 18120 10987 7501

Table 1: Co-occurrence matrix of frequently used review words

Words that co-occur frequently, such as night and bed, are expected to be positioned close to each other in the MDS map. The goal of MDS is to find positions for each word in a 2D space such that: X

i,j

(Distancemap(i, j) − Distancedata(i, j))^2

is minimized. However, co-occurrence values are a measure of similarity, not distance, so they must first be transformed. This involves:

  • Normalizing the similarity matrix by row and column sums (to control for word frequency)
  • Converting the normalized similarities into distance values

best predict the data. For text, we use the word vector format—typically a binary format with 0s and 1s to indicate whether a word appears in a review or not. PCA then tries to predict the presence or absence of words in the reviews based on these components. A key decision when applying PCA is how many components to keep. For text data, this decision is more difficult, since text is very unstructured. Adding more components will result in a more detailed representation, but at some point, the additional components provide little new insight and might even make interpretation harder. Also, some interesting relationships could become harder to see when too many components are included. If we use variance explained as a criterion for selecting components, we must keep in mind that variance explained tends to be low for text data. This is because it’s hard to predict whether someone will write “great”, “excellent”, or “good” to describe the same thing—there is a lot of variability in word choice. Common tools to decide on the number of components include:

  • The scree plot
  • The percentage of variance explained
  • The interpretability of the components

In practice, the scree plot and variance explained often don’t lead to clear decisions. The number of components depends on the goal of the analysis. If the focus is on accuracy, including many components might help (up to the point of overfitting). If the goal is interpretation, using more than 10 components might be too much and could confuse the audience, especially when visualizing. In the example shown by the professor, the top 5 components are used. For each component, the 10 most important words are shown. Many words appear across multiple components, and there’s no clear one-to-one link between specific words and components. Instead, the components are often related to broad groups of words. The idea of PCA is to find the directions in high-dimensional space that capture the most information. If we use 10 components, we try to summarize the content of a review using 10 numbers. This helps with prediction, but the interpretation is not always straightforward. Since the principal components exist in a high-dimensional space, we can change the coordinate system and find new directions that are easier to interpret. One such technique is varimax rotation. After applying this rotation, the interpretation improves: for example, one component might point toward words like desk, front, check, and time—words related to the front desk and check-in process. Another component might point toward words like park, walk, and nice—suggesting a more location-related or general satisfaction theme. With these rotated dimensions, we can also plot individual reviews. The location of each review in the plot shows how strongly it is related to each component. For example, a review placed far to the left on the x-axis may focus more on front desk service. In the professor’s example, two reviews are discussed. One is located further to the left and downward, while another is closer to the x-axis. Even though review 38933 is closer to the x-axis, the factor score of review 24561 is larger in absolute value—meaning it is more strongly related to that component. We can also visualize other components—for example, dimensions 3 and 4. In this plot, only a few words are strongly related to these components. On the left (x-axis / dimension 3), we see words like time, walk, and location. On the vertical (y-axis / dimension 4), we see words like nice, service, and night. A more advanced plot, called a biplot, shows both the reviews and the words that define each component. This helps interpret what each dimension represents and how individual reviews relate to those dimensions.

Summary

To interpret PCA components, we look at the words that are most strongly associated with each dimen- sion. Each review can be described based on its position in this latent space, and the component scores can be used as input in further analyses—for example, to study whether certain components are more related to positive or negative reviews. Rotation improves interpretability in follow-up analyses. PCA works with word vector input and therefore only captures information at the global review level. It does not capture the local structure or the distances between words in the original text.

MDS Principal Components Analysis Summarizes structure from similarity data Summarizes structure underlying word vector data Visualization in 2 dimensions Results in many dimensions, one per component Single picture showing it all (Too) many pictures needed to give overall picture Groups of words represent common themes in re- views

Directions in the graph represent (presence of) themes Reviews cannot be positioned or summarized in "MDS" space

Reviews can be summarized with scores on the various components A review is the sum of its parts Provides features that can be used in, for exam- ple, prediction tasks

Table 3: Comparison between MDS and Principal Components Analysis (PCA)

1.1 Exam question

1.1.1 Question 1

An analyst at Amazon wants to better understand the success of Zalando, as they are more successful in selling clothes and shoes than Amazon. To gain insight, the analyst has collected review data from customers of both stores. The analyst wants to understand what is driving the success (measured by having a positive review) for Zalando, relative to Amazon. From a competitive perspective, it is also interesting to know what the major weaknesses are of Zalando, as Amazon can then try to avoid those and take over part of the market by focusing on the weaknesses of its competitor. Explain what analysis you would do and how you would efficiently summarize the findings relevant for the two questions raised above. Note that your summary has to be efficient, but that you are not restricted to a single “summary measure”

1.1.2 Answer Question 1

The analyst wants to understand what is driving the success (measured by having a positive review) for Zalando, relative to Amazon. The relative strength of Zalando versus Amazon can be obtained by counting words for positive reviews or for positive sentences. This shows what people are enthusiastic about for both companies. To identify the relative strengths, the ratio of words for Amazon vs Zalando has to be shown, as otherwise the common themes will be shown. The strengths of Zalando can then be visualized using the relative frequency for the size of the words. From a competitive perspective, it is also interesting to know what the major weaknesses are of Zalando, as Amazon can then try to avoid those and take over part of the market by focusing on the weaknesses of its competitor. Similarly, the weaknesses can be shown. Note: Studying words in positive sentences could also work, but success is measured with having a positive review according to the question

1.1.3 Question 2

Principal components analysis allows a researcher to efficiently summarize data with very many variables. The following graph depicts results that were generated using reviews on games. Provide interpretation of this graph in terms of

  • a. The meaning of the axes
  • b. The meaning of the arrows
  • The word bui (derived from buy) is clearly important in the visualization above. Indicate the position of a review that does NOT mention “bui” and the words related to it. To answer this question, you need to indicate the position in terms of horizontal (left/middle/right) and vertical (top/middle/bottom) position.

1.1.4 Answer Question 2

  • The axes represent the latent dimensions. In this case the horizontal axis captures features corre- sponding to “fun” and “graphic”, while the vertical axis is negatively related to “bui” and “monei”.

Improved Sentiment Scoring Procedure

The professor introduces a more nuanced scoring approach:

  1. Identify valence words using the sentiment dictionary.
  2. Look at the 4 words before and 2 words after each valence word.
  3. Assign the base polarity: +1 for positive, -1 for negative.
  4. If a negation (like not) appears nearby, flip the sign.
  5. If an amplifier (like very) appears nearby, adjust the polarity by adding or subtracting 0.8.
  6. Example: "not very good" starts at +1, becomes -1 due to not, then adjusted to -0.2 due to very.
  7. Add up all adjusted polarity scores.
  8. Normalize by dividing the total by the square root of the number of words in the review.

Summary

Sentiment analysis often relies on sentiment dictionaries—lists of words with predefined valence or emo- tion scores. A basic analysis counts how many positive and negative words are used in a review. To improve the analysis, we can correct for the context around each sentiment word, such as negations and amplifiers. The resulting sentiment score provides a summary of the review’s tone and can often be linked to the review’s star rating. By going beyond simple word counts and including context, we get a more accurate measure of sentiment.

Zooming In on Sentences and Words

To get a more detailed and nuanced view of sentiment, we can zoom in on the sentence and word level within reviews. Why do this? Sentiment at the full review level often doesn’t give us much more than a simple classification—positive or negative. It doesn’t tell us what people liked or disliked specifically. To gain deeper insights, we can look at:

  • Which words appear in positive vs. negative reviews
  • Which words appear in positive vs. negative sentences within those reviews

When we look at the word frequencies of negative reviews, it’s often hard to extract meaningful insights. In contrast, analyzing the word frequencies of just the negative sentences provides a clearer picture of the specific things people complain about. However, even sentence-level analysis can surface generic or uninformative words, like hotel or bad, which don’t really tell us much. These words appear frequently and therefore dominate the analysis.

Relative Frequencies for More Contrast

To reduce the impact of these very common words, we can go beyond absolute frequency and instead focus on relative frequency. Since most reviews have more positive sentences than negative ones, absolute counts will naturally be biased toward the positive side. A better approach is to look at the difference or ratio of word usage between positive and negative sentences. For example:

Relative frequency (positive) − Relative frequency (negative)

This helps identify words that are specifically used in one type of sentence more than the other. It highlights which words are strongly associated with either positivity or negativity. Still, while this approach surfaces typical sentiment-related words—like bad, issue, or worst—these don’t always give us a clear idea of what exactly people are referring to.

Zooming In on Nouns Using POS Tagging

To get more specific, we can look at the types of words people use when expressing sentiment. The most informative words are often nouns—they represent the actual things people are talking about or evaluating. To detect nouns in sentences, we use Part-of-Speech (POS) tagging, which is a natural language processing technique that assigns a grammatical role (e.g., noun, verb, adjective) to each word. POS tagging helps us focus our analysis on the things people mention in their feedback. By combining sentence-level sentiment analysis with POS tagging, we can isolate the nouns that are frequently mentioned in negative or positive sentences. This allows us to understand exactly what aspects of a service or product people are complaining or praising. We can visualize this using word clouds—one for nouns in negative sentences and one for nouns in positive sentences. We can also do a similar analysis by tagging emotional content in sentences and examining which nouns are associated with each emotion.

Summary

Sentiment analysis at the full review level is limited in detail. By identifying sentiment at the sentence level and focusing on nouns, we gain more specific insights into what consumers are actually evaluating. Nouns represent the aspects of the service or product being discussed, and combining them with sentence- level sentiment or emotion gives a clearer view of the customer’s perspective.

LDA Application

This section demonstrates an application of Latent Dirichlet Allocation (LDA) using hotel review data. Before applying LDA, the reviews are preprocessed by removing stopwords and performing stemming. For computational speed, we only use the first 10,000 reviews. After preprocessing, a term-document matrix of dimensions 10 , 000 × 903 (documents by words) is created, keeping even infrequent words in the dataset.

Estimating LDA: Parameters and Settings

To estimate the LDA model, we need to specify several parameters:

  • Document-Term Matrix (DTM): The input data.
  • Number of topics (K): Determines how many latent topics we identify.
  • Gibbs Sampler: A simulation-based algorithm that approximates the model parameters by iter- atively sampling from conditional distributions.
  • Burn-in Period: The initial steps of the algorithm, which are discarded before measurements begin to ensure the samples represent stable parameter estimates.
  • Iterations: Total number of sampling steps (in the example: 500 steps after the burn-in), providing the parameter samples used for final estimation.
  • Alpha and Beta: Hyperparameters that influence topic distribution across documents (alpha) and word distribution within topics (beta).

Determining the Number of Topics

We select the optimal number of topics using measures such as perplexity and coherence. Perplexity assesses the fit of the model; typically, a higher perplexity score indicates poorer predictive power. However, in practice, perplexity tends to favor very large topic numbers, which can reduce interpretability. For the hotel review data, a highly complex model (e.g., 50 topics) may not be practical. Alternatively, the coherence measure, which evaluates how well the identified topics align with real-word co-occurrences in the text, often provides more interpretable results. In the example, the coherence plot suggests using around 30 topics as it balances interpretability and complexity. It is also worth noting that the training data typically has higher coherence than the validation data, indicating some degree of overfitting.

it using a smaller number of latent dimensions. These resulting matrices are often referred to as factors and loadings. However, PCA has a limitation in the context of text: the factors and loadings can include negative values. While this can make sense in some domains (e.g., scoring below average on a dimension), in text data it becomes less interpretable. For example, a negative value for the word service is hard to understand—while it’s natural to say a word is absent, it’s harder to say a word is “negatively present.”

NMF: Enforcing Non-Negativity

NMF solves this problem by enforcing non-negativity—all values in the resulting factor matrices are 0 or positive. Like PCA, the input is still a matrix V (such as a document-term matrix), where:

  • V is N × M , with N terms (words) and M documents
  • W is N × K and connects terms to K latent dimensions (basis matrix)
  • H is K × M and connects documents to the dimensions (coefficient matrix)
  • All entries in W and H are ≥ 0
  • V ≈ W H The idea is similar to PCA, but with the key restriction that all factor values must be non-negative, making interpretation in text analysis much more intuitive.

Optimization and Estimation

Formally, the goal is to minimize the Euclidean distance (sum of squared errors) between the original matrix V and the product W H. We can also add regularization to:

  • Encourage sparsity in W (terms should relate to only a few dimensions)
  • Encourage sparsity in H (documents should activate only a few dimensions) In R, one common estimation approach is alternating constrained least squares. This method iteratively:
  1. Estimates W given a current guess for H (using least squares)
  2. Updates H using the new W
  3. Repeats the process until convergence The method treats V = W H + error as a regression model without an intercept. If we have a good estimate of W , we can estimate each column of H using linear regression. Then, using the updated H, we can re-estimate W.

Challenges and Solutions

There are several known issues with this approach:

  • Negative values: Least squares may produce negative values—these are set to zero in NMF.
  • Local optima or non-convergence: Use multiple random starts and choose the best-fitting solution.
  • Slow convergence on large data: Use a smaller subset of the data to get good starting values.

Choosing the Number of Factors (K)

Determining the right number of dimensions is difficult. NMF is computationally expensive, so we can’t easily experiment with many values of K. More dimensions will usually reduce error, but without a prediction task or out-of-sample evaluation, it’s hard to assess the quality of the approximation. Some ways to choose K:

  • Use PCA to get a rough idea
  • Inspect the interpretability of the dimensions
  • Avoid dimensions that are hard to explain or unrelated to any theme

Practical Applications of NMF in Text

NMF results can be used to:

  • Identify similar documents (based on H values)
  • Identify similar words (based on W values)
  • Detect hidden topics or themes in a corpus

NMF in Recommendation Systems

NMF can also be used for recommendation:

  • V becomes a matrix of user-item ratings (with missing values)
  • Missing values are ignored during estimation of W and H
  • Predicted values for the missing entries can be used for recommendation
  • A high predicted score means an item is a good candidate to recommend

Summary

NMF factorizes a document-term matrix by linking words and documents to a smaller set of dimensions. These latent dimensions give structure to the data and reveal how words and documents relate to each other. NMF helps discover patterns in text, allows for comparison across documents, and supports prediction tasks. Its non-negativity constraint makes it easier to interpret than PCA, and it’s a flexible tool in both text mining and recommendation systems.

3.2 NMF Application

In this practical example, we apply NMF to hotel review data with the following preprocessing steps:

  • Remove stopwords and apply stemming
  • Use only the first 10,000 reviews for computational speed
  • Remove words occurring fewer than 200 times
  • Remove empty documents
  • Generate a term-document matrix of size 9999 × 527 (documents by terms)

The professor applies NMF with 10 factors, meaning the data is summarized into 10 dimensions representing common themes in the reviews.

Structure and Challenges of the Data

The data is structured as a document-term matrix, which is very sparse—with many zeros and relatively few ones. This sparsity makes it challenging to directly identify meaningful structures. By reducing dimensionality, NMF attempts to represent reviews in terms of fewer shared underlying dimensions. The goal of NMF is to minimize mispredictions by approximating the original matrix with lower- dimensional representations. Each review is represented by weights on a set of underlying dimensions, and these dimensions indicate the likelihood of specific words appearing if the dimension is active. In practice, Alternating Least Squares (ALS) performs best, achieving faster convergence and better fits (lower prediction error) compared to other algorithms.

Results and Interpretation

The result of NMF consists of two matrices:

  • Basis matrix W : connects terms (words) to dimensions.
  • Coefficient matrix H: connects dimensions to documents (reviews).

To interpret the results clearly, we visualize these matrices:

Terminology: Words vs. Terms

In LDA, it’s important to distinguish between:

  • Words: Individual words appearing in sentences, potentially multiple times within a document.
  • Terms: Unique words in the entire corpus; each term is distinct and occurs at least once across all documents.

The core question LDA tries to answer is: What process generates the observed words in each docu- ment?

Topic and Word Probabilities

For each word in a document, we first determine which topic it belongs to. Each topic has a probability distribution over the words in the corpus. For instance, if a topic relates to the hotel’s location, words like walk, park, and city would have high probabilities. Words unrelated to the location (such as breakfast) would have lower probabilities. Each document is characterized by its own distribution over topics (topic probabilities). If we have 20 topics, each review will have 20 probabilities, indicating how likely each topic is to be covered. Above this, there is an overall topic probability distribution across all documents, reflecting the average likelihood of each topic appearing in a randomly selected review. For example, if guests frequently mention noise more than location, the noise topic will generally have a higher probability across reviews.

Dirichlet and Multinomial Distributions

Topic probabilities in LDA are drawn from a Dirichlet distribution. Sampling from this distribution results in a probability vector (with elements between 0 and 1) summing up to 1, representing the likelihood of each topic appearing. Given these probabilities, a single topic for each word is selected using a multinomial distribution. The same multinomial distribution method applies when selecting specific words from the word distribution for a given topic.

Notation and Parameters in LDA

LDA uses the following notation and parameters:

  • K: Number of topics.
  • βk: Probability distribution of words for topic k. If topic k is selected, βk determines the likelihood of each word occurring within that topic.
  • θn: Topic probability distribution for document n. This indicates the extent to which a document is about certain topics, such as location, service, or noise.
  • δ: Parameter controlling the prior distribution of βk (usually has minimal practical impact).
  • α: Parameter controlling the prior distribution of θn. It represents how important each topic is on average within the entire corpus and influences the sparsity of topics within individual documents.

Illustrative Example

Suppose we have K = 2 topics: “university" and “spare time," with a vocabulary consisting of four words: lecture, school, party, and friends. The first two words relate mainly to university, the latter two to spare time.

  • β 1 (university): (0.4, 0.4, 0.01, 0.19), meaning words like “lecture" or “school" have a 40% proba- bility each under this topic.
  • β 2 (spare time): (0.01, 0.01, 0.59, 0.39), indicating a 59% probability for “party" and 39% for “friends."
  • θn = (0. 9 , 0 .1) for a specific document, meaning that document is predominantly (90%) about the university topic and only 10% about spare time.

To calculate the probability of a particular word (e.g., “friends") in this document, we sum over topics:

P (friends) = 0. 9 × 0 .19 + 0. 1 × 0. 39 Thus, βk encodes how strongly words are associated with each topic, allowing words to belong to multiple topics simultaneously. Both βk (word-topic probabilities) and θn (document-topic probabilities) are drawn from Dirichlet distributions, ensuring that each set of probabilities sums to one. Topics can differ systematically in their average sizes or importance.

LDA as a Soft Clustering Method

LDA can be viewed as a soft clustering method. Unlike hard clustering, where each document is assigned exclusively to a single cluster, soft clustering allows documents (and words) to belong to multiple topics simultaneously. Documents can discuss combinations of topics (e.g., location, service, or noise), which is more realistic and computationally efficient than hard clustering. Hard clustering with multiple topics requires a combinatorial number of clusters to represent combinations of topics, whereas soft clustering efficiently represents mixed-topic documents with fewer clusters.

Estimation and Implementation of LDA

To estimate an LDA model, several steps are required:

  1. Fix the number of topics (K).
  2. Estimate the probability distributions of words within topics (βk).
  3. Estimate the topic distributions within documents (θn).
  4. Estimate the parameter α, governing how topics are distributed across documents. A large α would mean that each document covers all topics with similar probabilities, which is unrealistic in practice. Typically, topic distributions are sparse—documents often cover only a few topics intensively.

Estimation is typically done through algorithms such as Gibbs sampling or variational inference. These iterative algorithms rely on conditional distributions, continually updating one set of parameters given current estimates of the others, until convergence is reached. For example, consider estimating the probability of observing a specific word given two topics: if topic 1 has a 40% chance (α = 0. 4 ) and topic 2 a 60% chance (α = 0. 6 ), and a particular word has probabilities 0.2 under topic 1 and 0.1 under topic 2, we calculate:

P (topic 1 | word) =

0. 4 × 0. 2

(0. 4 × 0 .2) + (0. 6 × 0 .1)

Thus, this word would most likely be associated with topic 1.

Choosing the Number of Topics

A critical challenge is deciding the optimal number of topics. Increasing the number of topics generally improves in-sample fit, but makes the model more complex. The trade-off involves balancing model fit and complexity. To evaluate topic models, we use measures like:

  • Perplexity: Lower perplexity indicates a better fit.
  • Coherence: Indicates how well the identified topics reflect real co-occurrences of words in the data. High coherence means that words frequently appear together as expected by the topic structure.

this means it is roughly four times more likely to find actor in the context of director than in that of music. To further illustrate, we normalize the raw frequencies so that each row (representing a focal word) sums to 1, as shown in Table 5.

Table 5: Normalized Word Co-occurrence Matrix (Row-wise Totals)

music funni action director actor best amaz Total music 0.01108 0.00262 0.00477 0.00662 0.00539 0.01724 0.00308 1 funni 0.00317 0.00745 0.00279 0.00317 0.00745 0.00428 0.00242 1 action 0.00554 0.00268 0.00769 0.00733 0.00554 0.01502 0.00268 1 director 0.00562 0.00222 0.00535 0.00340 0.02128 0.01972 0.00313 1 actor 0.00343 0.00392 0.00304 0.01597 0.00529 0.03683 0.00490 1 best 0.01017 0.00209 0.00763 0.01371 0.03414 0.01489 0.00154 1 amaz 0.00765 0.00497 0.00573 0.00917 0.01911 0.00650 0.00726 1

GloVe Objective Function: The final GloVe model is expressed as:

wTi w˜k + bi + ˜bk = log

Xik

This equation states that the logarithm of the co-occurrence frequency Xik between word i and word k is approximated by three components:

  • The dot product wTi w˜k, which measures how similar or related the two words are.
  • A bias term bi for the focal word.
  • A bias term ˜bk for the context word.

In simple terms, if two words appear together very frequently, their embeddings (and bias terms) will combine to predict a high log co-occurrence count. If they rarely appear together, the prediction will be low. GloVe learns these word embeddings by minimizing the difference between the left- and right-hand sides of this equation for all word pairs. The model is estimated using a weighted least squares approach and ignores word pairs with zero observed co-occurrence to avoid issues with log(0).

Implementation Details: When training a GloVe model in R, two critical parameters must be set:

  1. Context Window: This determines how many words around a focal word are considered (e.g., a symmetric window of length 6).
  2. Embedding Dimensionality: This defines the number of dimensions for each word vector (e.g., 50).

The training procedure runs iteratively until the model’s loss function is minimized. After training, each word is represented by a numerical vector, where similar vectors indicate similar word meanings. For example, one row of the resulting word embedding matrix might look like: These embeddings allow us to analyze semantic similarity among words. For instance, words that appear frequently together in similar contexts will have similar vectors.

In conclusion, the GloVe model translates raw co-occurrence counts into word embeddings. These embeddings capture the statistical information about word usage in the corpus, revealing the latent semantic relationships between words.

Table 6: Matrix of Word Embedding Values (Example)

music funni action director 1 -0.11469 0.06212 -0.40551 0. 2 -0.03010 0.36766 -0.34198 0. 3 -0.27864 -0.34814 -0.07261 0. 4 -0.05495 0.01586 0.42751 -0. 5 -0.12153 0.17083 -0.00373 -0. 6 -0.43247 -0.40902 -0.19929 0. 7 0.00976 -0.57009 -0.34870 0. 8 -0.77751 0.14515 -0.00828 -0. 9 -0.09279 -0.04932 -0.79857 -0. 10 -0.56523 -0.35379 -0.48263 -0.

Implementation of the GloVe Model

To train a GloVe model in R, two key parameters must be specified:

  1. Context Window: This defines both the size and shape of the window (e.g., a symmetric window with a length of 6) that determines which neighboring words are considered as context.
  2. Embedding Dimensionality: This sets the size of the word embeddings (e.g., 50 dimensions).

Training the model is an iterative procedure. At each iteration, the model adjusts the word vectors and bias terms to lower the loss function. Once the model converges, it outputs word embeddings—numerical vectors representing each word. For example, one row of the embedding matrix might be:

Table 7: Matrix of Example Word Embedding Values

music funni action director 1 -0.11469 0.06212 -0.40551 0. 2 -0.03010 0.36766 -0.34198 0. 3 -0.27864 -0.34814 -0.07261 0. 4 -0.05495 0.01586 0.42751 -0. 5 -0.12153 0.17083 -0.00373 -0. 6 -0.43247 -0.40902 -0.19929 0. 7 0.00976 -0.57009 -0.34870 0. 8 -0.77751 0.14515 -0.00828 -0. 9 -0.09279 -0.04932 -0.79857 -0. 10 -0.56523 -0.35379 -0.48263 -0.

These embeddings capture the meaning of each word; words with similar meanings are expected to have similar vectors.

Interpreting Word Embeddings with Cosine Similarity

Since embeddings are high-dimensional vectors, you can think of them as points in a high-dimensional space. The cosine similarity quantifies the similarity between two word vectors by measuring the angle between them. In simple terms, words with similar meanings tend to point in the same direction and thus have a high cosine similarity. For example, consider the following normalized word similarity matrix calculated using cosine simi- larity:

similarity. Thus, by analyzing these embeddings, we can infer semantic similarities and differences among words.

Getting Insights from Word Embeddings

To extract meaningful insights from word embeddings, it is crucial to examine the relationships between words. One effective way to do this is by using the analogy framework: A is to B as C is to D. In this framework, the idea is that the relationship between words can be captured through vector arithmetic. For example, consider the analogy:

king − man + woman ≈ queen.

This means that if you take the vector for king, subtract the vector for man, and then add the vector for woman, the result is a vector that is very similar to the vector for queen. Another example is: walking − park + pool ≈ swimming,

which suggests that the difference between walking and park can be transformed into a difference between swimming and pool. In other words, you can add or subtract features from word embeddings to highlight specific attributes or semantic shifts. We can apply the same idea to words in our movie review data. For example, if we take the embedding of actor, add the embedding for drama, and subtract the embedding for comedy, we aim to capture the characteristics of an actor’s performance in dramatic roles as opposed to comedic roles. The resulting vector should emphasize the traits that define drama over comedy. The similarity scores between actor and various related words can then be analyzed to understand these relationships better. Table 11 below shows the cosine similarity scores of actor with other words in our dataset. Higher values indicate that the word is more similar to actor, while lower values suggest less similarity.

Table 11: Similarity Scores for Words Related to actor

Word Similarity actor 0. cast 0. support 0. actors, 0. actress 0. talent 0. act 0. cast, 0. actors. 0. cast. 0.

In summary, by leveraging the concept that “A is to B as C is to D”, we can perform vector arithmetic to uncover semantic relationships in our data. This enables us to better understand the subtle differences in word meanings—for instance, how the context of an actor can shift when associated with drama rather than comedy. The cosine similarity between the resulting vectors provides a quantitative measure of these relationships, facilitating deeper insights into the semantic structure of the text.

4.1 Practice exam question

An analyst at Albert Heijn (AH, a large Dutch supermarket chain) wants to better understand the success of Picnic (a startup in grocery home delivery services), as they are more successful in acquiring customers for their home delivery service. To gain this insight, the analyst has collected review data from customers of both stores. The analyst wants to apply the GloVe algorithm to the data she collected. As a first step, the data needs to be cleaned. (a) Provide one reason why you would want to remove stop words and one reason why you might not want to remove stop words when the aim is to apply the GloVe algorithm.

ANSWER:

(b) Explain why the adjustment of the word co-occurrence matrix is needed, even when the company name is mentioned in all reviews. ANSWER: (c) Which word analogy would you obtain from the data to learn how the service of Picnic differs from that of AH? Also explain why this analogy is informative about this question. ANSWER:

5 week 6

5.1 Predicting Ratings

N-grams extend the bag-of-words concept by recognizing that adjacent words can be interrelated. When two words frequently appear next to each other, this combination can convey additional, relevant infor- mation. Using n-grams, we can identify frequent and meaningful word combinations. In addition, skip n- grams, which allow for nonadjacent words to be grouped, provide further insights. By analyzing both individual words and n-grams in reviews, we can predict whether a review is positive or negative. For example, we may select the top 50 words and bigrams from a text and then apply logistic regression using these words as features to model the sentiment of the review. The model’s output includes p-values, where a low p-value indicates a strong association between the word and the review’s sentiment. A positive coefficient suggests that the word is more indicative of a positive review. It is important to note that the same word may appear both as a unigram (e.g., “recommend”) and as part of a bigram (e.g., “highly recommend”). When this occurs, both estimates are included in the model, resulting in a combined effect on the prediction. Thus, a positive coefficient corresponds to a higher predicted probability of a “happy” or positive review. When moving from unigrams to bigrams, the interpretation should be made under the ceteris paribus assumption (i.e., holding all other factors constant). If a word appears both on its own and within a bigram, the effect of the bigram should be considered together with the effect of the unigram.

5.2 Predicting Ratings Using Emotions

We can use a generalized linear model (GLM) to predict whether a review evaluation is positive or negative based on the emotional content of its words. The predictor variables are sentiment scores derived from a sentiment analysis. These scores include both specific emotions—such as anger, fear, surprise, and joy—and more global sentiments categorized as negative or positive. The GLM results from the professor indicate that most emotions significantly affect the review evalu- ation. In the model, positive emotions have a positive coefficient (i.e., a positive estimate), while negative emotions have a negative coefficient. Notably, both surprise and trust show negative coefficients, suggest- ing that reviews expressing these emotions tend to be rated as more negative. This finding is surprising because, intuitively, these emotions might be expected to correlate with more positive evaluations. One explanation for the negative coefficients of trust and surprise is the issue of double counting. That is, words associated with these specific emotions are simultaneously captured by the overall positive sentiment measure. When the overall positive and negative sentiment features are removed from the analysis, the effects of trust and surprise become insignificant. Furthermore, excluding the emotion scores leads to a stronger effect for the global positive and negative sentiment measures, resulting in larger estimated effect sizes. In conclusion, when interpreting sentiment in a GLM, it is essential to consider the ceteris paribus assumption and recognize that the presence of overlapping variables (specific emotions and overall sen- timents) can influence the interpretation of the coefficients.

5.3 Predicting using PCA Results

Principal Component Analysis (PCA) can be used to summarize review text data. In the example, 20 principal components are extracted to capture the variability in the review text and used as predictors in a logistic regression model to classify reviews as positive or negative. Many of these components are statistically significant. Although the PCA factors themselves do not have an inherent interpretation, we can gain insight by examining the top 10 words associated with each factor. For example, one component