

























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
summery text analytics from eur
Typology: Summaries
1 / 65
This page cannot be seen from the preview
Don't miss anything!


























































Word clouds can be used to summarize text data. A basic word cloud simply counts the number of times each word appears in the text. Words that occur more frequently are shown larger and are typically placed toward the center of the cloud. While a histogram is the traditional approach to visualize word frequencies, it can only show a limited number of words clearly. In contrast, word clouds can display a much larger set of words at once, providing a broader overview of the text content. Word clouds are therefore useful to quickly identify prominent words and get a general sense of what the text is about. However, they do not reveal much about the underlying relationships or structure of the text. To gain deeper insight, we can compare word clouds from different subsets of text. For example, comparing word usage between positive and negative customer reviews helps us explore which words are more typical for each group. However, since both happy and unhappy customers often talk about similar things, their most frequent words tend to overlap. This makes it difficult to identify what distinguishes the two groups. To address this, we can use a commonality cloud, which shows the words shared between both groups. Alternatively, a comparison cloud highlights the words that appear more frequently in one group than the other. Still, both approaches focus on frequent words rather than the most discriminative ones. The main issue with these visualizations is that they are dominated by very frequent words, which are not necessarily the most informative. To control for this, we can use TF-IDF (Term Frequency – Inverse Document Frequency).
Relative Frequency =
TF(Group A) TF(Group B)
Words that appear much more in Group A than Group B will have high values and are therefore more characteristic of Group A. Words used equally in both groups will receive a score around 1. This method helps identify the most diagnostic words for each group.
Summary of the Video
Text visualizations often rely on four features:
MDS focuses on location and distance. It’s often compared to a map, where distances between objects represent their relationships. In this case, we aim to map words based on how "close" they are to each other in text. To build such a map, we must first define distances or similarities between words. Common similarity definitions include:
We can measure these relationships by counting how often words co-occur within a given context. For example, we begin by counting co-occurrences within full reviews:
locat night nice clean bed service locat 22842 14234 13457 12553 12055 10459 night 14234 21337 14026 11916 14760 10914 nice 13457 14026 19596 11763 12708 9789 clean 12553 11916 11763 18120 10987 7501
Table 1: Co-occurrence matrix of frequently used review words
Words that co-occur frequently, such as night and bed, are expected to be positioned close to each other in the MDS map. The goal of MDS is to find positions for each word in a 2D space such that: X
i,j
(Distancemap(i, j) − Distancedata(i, j))^2
is minimized. However, co-occurrence values are a measure of similarity, not distance, so they must first be transformed. This involves:
best predict the data. For text, we use the word vector format—typically a binary format with 0s and 1s to indicate whether a word appears in a review or not. PCA then tries to predict the presence or absence of words in the reviews based on these components. A key decision when applying PCA is how many components to keep. For text data, this decision is more difficult, since text is very unstructured. Adding more components will result in a more detailed representation, but at some point, the additional components provide little new insight and might even make interpretation harder. Also, some interesting relationships could become harder to see when too many components are included. If we use variance explained as a criterion for selecting components, we must keep in mind that variance explained tends to be low for text data. This is because it’s hard to predict whether someone will write “great”, “excellent”, or “good” to describe the same thing—there is a lot of variability in word choice. Common tools to decide on the number of components include:
In practice, the scree plot and variance explained often don’t lead to clear decisions. The number of components depends on the goal of the analysis. If the focus is on accuracy, including many components might help (up to the point of overfitting). If the goal is interpretation, using more than 10 components might be too much and could confuse the audience, especially when visualizing. In the example shown by the professor, the top 5 components are used. For each component, the 10 most important words are shown. Many words appear across multiple components, and there’s no clear one-to-one link between specific words and components. Instead, the components are often related to broad groups of words. The idea of PCA is to find the directions in high-dimensional space that capture the most information. If we use 10 components, we try to summarize the content of a review using 10 numbers. This helps with prediction, but the interpretation is not always straightforward. Since the principal components exist in a high-dimensional space, we can change the coordinate system and find new directions that are easier to interpret. One such technique is varimax rotation. After applying this rotation, the interpretation improves: for example, one component might point toward words like desk, front, check, and time—words related to the front desk and check-in process. Another component might point toward words like park, walk, and nice—suggesting a more location-related or general satisfaction theme. With these rotated dimensions, we can also plot individual reviews. The location of each review in the plot shows how strongly it is related to each component. For example, a review placed far to the left on the x-axis may focus more on front desk service. In the professor’s example, two reviews are discussed. One is located further to the left and downward, while another is closer to the x-axis. Even though review 38933 is closer to the x-axis, the factor score of review 24561 is larger in absolute value—meaning it is more strongly related to that component. We can also visualize other components—for example, dimensions 3 and 4. In this plot, only a few words are strongly related to these components. On the left (x-axis / dimension 3), we see words like time, walk, and location. On the vertical (y-axis / dimension 4), we see words like nice, service, and night. A more advanced plot, called a biplot, shows both the reviews and the words that define each component. This helps interpret what each dimension represents and how individual reviews relate to those dimensions.
Summary
To interpret PCA components, we look at the words that are most strongly associated with each dimen- sion. Each review can be described based on its position in this latent space, and the component scores can be used as input in further analyses—for example, to study whether certain components are more related to positive or negative reviews. Rotation improves interpretability in follow-up analyses. PCA works with word vector input and therefore only captures information at the global review level. It does not capture the local structure or the distances between words in the original text.
MDS Principal Components Analysis Summarizes structure from similarity data Summarizes structure underlying word vector data Visualization in 2 dimensions Results in many dimensions, one per component Single picture showing it all (Too) many pictures needed to give overall picture Groups of words represent common themes in re- views
Directions in the graph represent (presence of) themes Reviews cannot be positioned or summarized in "MDS" space
Reviews can be summarized with scores on the various components A review is the sum of its parts Provides features that can be used in, for exam- ple, prediction tasks
Table 3: Comparison between MDS and Principal Components Analysis (PCA)
1.1.1 Question 1
An analyst at Amazon wants to better understand the success of Zalando, as they are more successful in selling clothes and shoes than Amazon. To gain insight, the analyst has collected review data from customers of both stores. The analyst wants to understand what is driving the success (measured by having a positive review) for Zalando, relative to Amazon. From a competitive perspective, it is also interesting to know what the major weaknesses are of Zalando, as Amazon can then try to avoid those and take over part of the market by focusing on the weaknesses of its competitor. Explain what analysis you would do and how you would efficiently summarize the findings relevant for the two questions raised above. Note that your summary has to be efficient, but that you are not restricted to a single “summary measure”
1.1.2 Answer Question 1
The analyst wants to understand what is driving the success (measured by having a positive review) for Zalando, relative to Amazon. The relative strength of Zalando versus Amazon can be obtained by counting words for positive reviews or for positive sentences. This shows what people are enthusiastic about for both companies. To identify the relative strengths, the ratio of words for Amazon vs Zalando has to be shown, as otherwise the common themes will be shown. The strengths of Zalando can then be visualized using the relative frequency for the size of the words. From a competitive perspective, it is also interesting to know what the major weaknesses are of Zalando, as Amazon can then try to avoid those and take over part of the market by focusing on the weaknesses of its competitor. Similarly, the weaknesses can be shown. Note: Studying words in positive sentences could also work, but success is measured with having a positive review according to the question
1.1.3 Question 2
Principal components analysis allows a researcher to efficiently summarize data with very many variables. The following graph depicts results that were generated using reviews on games. Provide interpretation of this graph in terms of
1.1.4 Answer Question 2
Improved Sentiment Scoring Procedure
The professor introduces a more nuanced scoring approach:
Summary
Sentiment analysis often relies on sentiment dictionaries—lists of words with predefined valence or emo- tion scores. A basic analysis counts how many positive and negative words are used in a review. To improve the analysis, we can correct for the context around each sentiment word, such as negations and amplifiers. The resulting sentiment score provides a summary of the review’s tone and can often be linked to the review’s star rating. By going beyond simple word counts and including context, we get a more accurate measure of sentiment.
To get a more detailed and nuanced view of sentiment, we can zoom in on the sentence and word level within reviews. Why do this? Sentiment at the full review level often doesn’t give us much more than a simple classification—positive or negative. It doesn’t tell us what people liked or disliked specifically. To gain deeper insights, we can look at:
When we look at the word frequencies of negative reviews, it’s often hard to extract meaningful insights. In contrast, analyzing the word frequencies of just the negative sentences provides a clearer picture of the specific things people complain about. However, even sentence-level analysis can surface generic or uninformative words, like hotel or bad, which don’t really tell us much. These words appear frequently and therefore dominate the analysis.
Relative Frequencies for More Contrast
To reduce the impact of these very common words, we can go beyond absolute frequency and instead focus on relative frequency. Since most reviews have more positive sentences than negative ones, absolute counts will naturally be biased toward the positive side. A better approach is to look at the difference or ratio of word usage between positive and negative sentences. For example:
Relative frequency (positive) − Relative frequency (negative)
This helps identify words that are specifically used in one type of sentence more than the other. It highlights which words are strongly associated with either positivity or negativity. Still, while this approach surfaces typical sentiment-related words—like bad, issue, or worst—these don’t always give us a clear idea of what exactly people are referring to.
Zooming In on Nouns Using POS Tagging
To get more specific, we can look at the types of words people use when expressing sentiment. The most informative words are often nouns—they represent the actual things people are talking about or evaluating. To detect nouns in sentences, we use Part-of-Speech (POS) tagging, which is a natural language processing technique that assigns a grammatical role (e.g., noun, verb, adjective) to each word. POS tagging helps us focus our analysis on the things people mention in their feedback. By combining sentence-level sentiment analysis with POS tagging, we can isolate the nouns that are frequently mentioned in negative or positive sentences. This allows us to understand exactly what aspects of a service or product people are complaining or praising. We can visualize this using word clouds—one for nouns in negative sentences and one for nouns in positive sentences. We can also do a similar analysis by tagging emotional content in sentences and examining which nouns are associated with each emotion.
Summary
Sentiment analysis at the full review level is limited in detail. By identifying sentiment at the sentence level and focusing on nouns, we gain more specific insights into what consumers are actually evaluating. Nouns represent the aspects of the service or product being discussed, and combining them with sentence- level sentiment or emotion gives a clearer view of the customer’s perspective.
This section demonstrates an application of Latent Dirichlet Allocation (LDA) using hotel review data. Before applying LDA, the reviews are preprocessed by removing stopwords and performing stemming. For computational speed, we only use the first 10,000 reviews. After preprocessing, a term-document matrix of dimensions 10 , 000 × 903 (documents by words) is created, keeping even infrequent words in the dataset.
Estimating LDA: Parameters and Settings
To estimate the LDA model, we need to specify several parameters:
Determining the Number of Topics
We select the optimal number of topics using measures such as perplexity and coherence. Perplexity assesses the fit of the model; typically, a higher perplexity score indicates poorer predictive power. However, in practice, perplexity tends to favor very large topic numbers, which can reduce interpretability. For the hotel review data, a highly complex model (e.g., 50 topics) may not be practical. Alternatively, the coherence measure, which evaluates how well the identified topics align with real-word co-occurrences in the text, often provides more interpretable results. In the example, the coherence plot suggests using around 30 topics as it balances interpretability and complexity. It is also worth noting that the training data typically has higher coherence than the validation data, indicating some degree of overfitting.
it using a smaller number of latent dimensions. These resulting matrices are often referred to as factors and loadings. However, PCA has a limitation in the context of text: the factors and loadings can include negative values. While this can make sense in some domains (e.g., scoring below average on a dimension), in text data it becomes less interpretable. For example, a negative value for the word service is hard to understand—while it’s natural to say a word is absent, it’s harder to say a word is “negatively present.”
NMF: Enforcing Non-Negativity
NMF solves this problem by enforcing non-negativity—all values in the resulting factor matrices are 0 or positive. Like PCA, the input is still a matrix V (such as a document-term matrix), where:
Optimization and Estimation
Formally, the goal is to minimize the Euclidean distance (sum of squared errors) between the original matrix V and the product W H. We can also add regularization to:
Challenges and Solutions
There are several known issues with this approach:
Choosing the Number of Factors (K)
Determining the right number of dimensions is difficult. NMF is computationally expensive, so we can’t easily experiment with many values of K. More dimensions will usually reduce error, but without a prediction task or out-of-sample evaluation, it’s hard to assess the quality of the approximation. Some ways to choose K:
Practical Applications of NMF in Text
NMF results can be used to:
NMF in Recommendation Systems
NMF can also be used for recommendation:
Summary
NMF factorizes a document-term matrix by linking words and documents to a smaller set of dimensions. These latent dimensions give structure to the data and reveal how words and documents relate to each other. NMF helps discover patterns in text, allows for comparison across documents, and supports prediction tasks. Its non-negativity constraint makes it easier to interpret than PCA, and it’s a flexible tool in both text mining and recommendation systems.
In this practical example, we apply NMF to hotel review data with the following preprocessing steps:
The professor applies NMF with 10 factors, meaning the data is summarized into 10 dimensions representing common themes in the reviews.
Structure and Challenges of the Data
The data is structured as a document-term matrix, which is very sparse—with many zeros and relatively few ones. This sparsity makes it challenging to directly identify meaningful structures. By reducing dimensionality, NMF attempts to represent reviews in terms of fewer shared underlying dimensions. The goal of NMF is to minimize mispredictions by approximating the original matrix with lower- dimensional representations. Each review is represented by weights on a set of underlying dimensions, and these dimensions indicate the likelihood of specific words appearing if the dimension is active. In practice, Alternating Least Squares (ALS) performs best, achieving faster convergence and better fits (lower prediction error) compared to other algorithms.
Results and Interpretation
The result of NMF consists of two matrices:
To interpret the results clearly, we visualize these matrices:
Terminology: Words vs. Terms
In LDA, it’s important to distinguish between:
The core question LDA tries to answer is: What process generates the observed words in each docu- ment?
Topic and Word Probabilities
For each word in a document, we first determine which topic it belongs to. Each topic has a probability distribution over the words in the corpus. For instance, if a topic relates to the hotel’s location, words like walk, park, and city would have high probabilities. Words unrelated to the location (such as breakfast) would have lower probabilities. Each document is characterized by its own distribution over topics (topic probabilities). If we have 20 topics, each review will have 20 probabilities, indicating how likely each topic is to be covered. Above this, there is an overall topic probability distribution across all documents, reflecting the average likelihood of each topic appearing in a randomly selected review. For example, if guests frequently mention noise more than location, the noise topic will generally have a higher probability across reviews.
Dirichlet and Multinomial Distributions
Topic probabilities in LDA are drawn from a Dirichlet distribution. Sampling from this distribution results in a probability vector (with elements between 0 and 1) summing up to 1, representing the likelihood of each topic appearing. Given these probabilities, a single topic for each word is selected using a multinomial distribution. The same multinomial distribution method applies when selecting specific words from the word distribution for a given topic.
Notation and Parameters in LDA
LDA uses the following notation and parameters:
Illustrative Example
Suppose we have K = 2 topics: “university" and “spare time," with a vocabulary consisting of four words: lecture, school, party, and friends. The first two words relate mainly to university, the latter two to spare time.
To calculate the probability of a particular word (e.g., “friends") in this document, we sum over topics:
P (friends) = 0. 9 × 0 .19 + 0. 1 × 0. 39 Thus, βk encodes how strongly words are associated with each topic, allowing words to belong to multiple topics simultaneously. Both βk (word-topic probabilities) and θn (document-topic probabilities) are drawn from Dirichlet distributions, ensuring that each set of probabilities sums to one. Topics can differ systematically in their average sizes or importance.
LDA as a Soft Clustering Method
LDA can be viewed as a soft clustering method. Unlike hard clustering, where each document is assigned exclusively to a single cluster, soft clustering allows documents (and words) to belong to multiple topics simultaneously. Documents can discuss combinations of topics (e.g., location, service, or noise), which is more realistic and computationally efficient than hard clustering. Hard clustering with multiple topics requires a combinatorial number of clusters to represent combinations of topics, whereas soft clustering efficiently represents mixed-topic documents with fewer clusters.
Estimation and Implementation of LDA
To estimate an LDA model, several steps are required:
Estimation is typically done through algorithms such as Gibbs sampling or variational inference. These iterative algorithms rely on conditional distributions, continually updating one set of parameters given current estimates of the others, until convergence is reached. For example, consider estimating the probability of observing a specific word given two topics: if topic 1 has a 40% chance (α = 0. 4 ) and topic 2 a 60% chance (α = 0. 6 ), and a particular word has probabilities 0.2 under topic 1 and 0.1 under topic 2, we calculate:
P (topic 1 | word) =
Thus, this word would most likely be associated with topic 1.
Choosing the Number of Topics
A critical challenge is deciding the optimal number of topics. Increasing the number of topics generally improves in-sample fit, but makes the model more complex. The trade-off involves balancing model fit and complexity. To evaluate topic models, we use measures like:
this means it is roughly four times more likely to find actor in the context of director than in that of music. To further illustrate, we normalize the raw frequencies so that each row (representing a focal word) sums to 1, as shown in Table 5.
Table 5: Normalized Word Co-occurrence Matrix (Row-wise Totals)
music funni action director actor best amaz Total music 0.01108 0.00262 0.00477 0.00662 0.00539 0.01724 0.00308 1 funni 0.00317 0.00745 0.00279 0.00317 0.00745 0.00428 0.00242 1 action 0.00554 0.00268 0.00769 0.00733 0.00554 0.01502 0.00268 1 director 0.00562 0.00222 0.00535 0.00340 0.02128 0.01972 0.00313 1 actor 0.00343 0.00392 0.00304 0.01597 0.00529 0.03683 0.00490 1 best 0.01017 0.00209 0.00763 0.01371 0.03414 0.01489 0.00154 1 amaz 0.00765 0.00497 0.00573 0.00917 0.01911 0.00650 0.00726 1
GloVe Objective Function: The final GloVe model is expressed as:
wTi w˜k + bi + ˜bk = log
Xik
This equation states that the logarithm of the co-occurrence frequency Xik between word i and word k is approximated by three components:
In simple terms, if two words appear together very frequently, their embeddings (and bias terms) will combine to predict a high log co-occurrence count. If they rarely appear together, the prediction will be low. GloVe learns these word embeddings by minimizing the difference between the left- and right-hand sides of this equation for all word pairs. The model is estimated using a weighted least squares approach and ignores word pairs with zero observed co-occurrence to avoid issues with log(0).
Implementation Details: When training a GloVe model in R, two critical parameters must be set:
The training procedure runs iteratively until the model’s loss function is minimized. After training, each word is represented by a numerical vector, where similar vectors indicate similar word meanings. For example, one row of the resulting word embedding matrix might look like: These embeddings allow us to analyze semantic similarity among words. For instance, words that appear frequently together in similar contexts will have similar vectors.
In conclusion, the GloVe model translates raw co-occurrence counts into word embeddings. These embeddings capture the statistical information about word usage in the corpus, revealing the latent semantic relationships between words.
Table 6: Matrix of Word Embedding Values (Example)
music funni action director 1 -0.11469 0.06212 -0.40551 0. 2 -0.03010 0.36766 -0.34198 0. 3 -0.27864 -0.34814 -0.07261 0. 4 -0.05495 0.01586 0.42751 -0. 5 -0.12153 0.17083 -0.00373 -0. 6 -0.43247 -0.40902 -0.19929 0. 7 0.00976 -0.57009 -0.34870 0. 8 -0.77751 0.14515 -0.00828 -0. 9 -0.09279 -0.04932 -0.79857 -0. 10 -0.56523 -0.35379 -0.48263 -0.
To train a GloVe model in R, two key parameters must be specified:
Training the model is an iterative procedure. At each iteration, the model adjusts the word vectors and bias terms to lower the loss function. Once the model converges, it outputs word embeddings—numerical vectors representing each word. For example, one row of the embedding matrix might be:
Table 7: Matrix of Example Word Embedding Values
music funni action director 1 -0.11469 0.06212 -0.40551 0. 2 -0.03010 0.36766 -0.34198 0. 3 -0.27864 -0.34814 -0.07261 0. 4 -0.05495 0.01586 0.42751 -0. 5 -0.12153 0.17083 -0.00373 -0. 6 -0.43247 -0.40902 -0.19929 0. 7 0.00976 -0.57009 -0.34870 0. 8 -0.77751 0.14515 -0.00828 -0. 9 -0.09279 -0.04932 -0.79857 -0. 10 -0.56523 -0.35379 -0.48263 -0.
These embeddings capture the meaning of each word; words with similar meanings are expected to have similar vectors.
Since embeddings are high-dimensional vectors, you can think of them as points in a high-dimensional space. The cosine similarity quantifies the similarity between two word vectors by measuring the angle between them. In simple terms, words with similar meanings tend to point in the same direction and thus have a high cosine similarity. For example, consider the following normalized word similarity matrix calculated using cosine simi- larity:
similarity. Thus, by analyzing these embeddings, we can infer semantic similarities and differences among words.
To extract meaningful insights from word embeddings, it is crucial to examine the relationships between words. One effective way to do this is by using the analogy framework: A is to B as C is to D. In this framework, the idea is that the relationship between words can be captured through vector arithmetic. For example, consider the analogy:
king − man + woman ≈ queen.
This means that if you take the vector for king, subtract the vector for man, and then add the vector for woman, the result is a vector that is very similar to the vector for queen. Another example is: walking − park + pool ≈ swimming,
which suggests that the difference between walking and park can be transformed into a difference between swimming and pool. In other words, you can add or subtract features from word embeddings to highlight specific attributes or semantic shifts. We can apply the same idea to words in our movie review data. For example, if we take the embedding of actor, add the embedding for drama, and subtract the embedding for comedy, we aim to capture the characteristics of an actor’s performance in dramatic roles as opposed to comedic roles. The resulting vector should emphasize the traits that define drama over comedy. The similarity scores between actor and various related words can then be analyzed to understand these relationships better. Table 11 below shows the cosine similarity scores of actor with other words in our dataset. Higher values indicate that the word is more similar to actor, while lower values suggest less similarity.
Table 11: Similarity Scores for Words Related to actor
Word Similarity actor 0. cast 0. support 0. actors, 0. actress 0. talent 0. act 0. cast, 0. actors. 0. cast. 0.
In summary, by leveraging the concept that “A is to B as C is to D”, we can perform vector arithmetic to uncover semantic relationships in our data. This enables us to better understand the subtle differences in word meanings—for instance, how the context of an actor can shift when associated with drama rather than comedy. The cosine similarity between the resulting vectors provides a quantitative measure of these relationships, facilitating deeper insights into the semantic structure of the text.
An analyst at Albert Heijn (AH, a large Dutch supermarket chain) wants to better understand the success of Picnic (a startup in grocery home delivery services), as they are more successful in acquiring customers for their home delivery service. To gain this insight, the analyst has collected review data from customers of both stores. The analyst wants to apply the GloVe algorithm to the data she collected. As a first step, the data needs to be cleaned. (a) Provide one reason why you would want to remove stop words and one reason why you might not want to remove stop words when the aim is to apply the GloVe algorithm.
(b) Explain why the adjustment of the word co-occurrence matrix is needed, even when the company name is mentioned in all reviews. ANSWER: (c) Which word analogy would you obtain from the data to learn how the service of Picnic differs from that of AH? Also explain why this analogy is informative about this question. ANSWER:
5 week 6
N-grams extend the bag-of-words concept by recognizing that adjacent words can be interrelated. When two words frequently appear next to each other, this combination can convey additional, relevant infor- mation. Using n-grams, we can identify frequent and meaningful word combinations. In addition, skip n- grams, which allow for nonadjacent words to be grouped, provide further insights. By analyzing both individual words and n-grams in reviews, we can predict whether a review is positive or negative. For example, we may select the top 50 words and bigrams from a text and then apply logistic regression using these words as features to model the sentiment of the review. The model’s output includes p-values, where a low p-value indicates a strong association between the word and the review’s sentiment. A positive coefficient suggests that the word is more indicative of a positive review. It is important to note that the same word may appear both as a unigram (e.g., “recommend”) and as part of a bigram (e.g., “highly recommend”). When this occurs, both estimates are included in the model, resulting in a combined effect on the prediction. Thus, a positive coefficient corresponds to a higher predicted probability of a “happy” or positive review. When moving from unigrams to bigrams, the interpretation should be made under the ceteris paribus assumption (i.e., holding all other factors constant). If a word appears both on its own and within a bigram, the effect of the bigram should be considered together with the effect of the unigram.
We can use a generalized linear model (GLM) to predict whether a review evaluation is positive or negative based on the emotional content of its words. The predictor variables are sentiment scores derived from a sentiment analysis. These scores include both specific emotions—such as anger, fear, surprise, and joy—and more global sentiments categorized as negative or positive. The GLM results from the professor indicate that most emotions significantly affect the review evalu- ation. In the model, positive emotions have a positive coefficient (i.e., a positive estimate), while negative emotions have a negative coefficient. Notably, both surprise and trust show negative coefficients, suggest- ing that reviews expressing these emotions tend to be rated as more negative. This finding is surprising because, intuitively, these emotions might be expected to correlate with more positive evaluations. One explanation for the negative coefficients of trust and surprise is the issue of double counting. That is, words associated with these specific emotions are simultaneously captured by the overall positive sentiment measure. When the overall positive and negative sentiment features are removed from the analysis, the effects of trust and surprise become insignificant. Furthermore, excluding the emotion scores leads to a stronger effect for the global positive and negative sentiment measures, resulting in larger estimated effect sizes. In conclusion, when interpreting sentiment in a GLM, it is essential to consider the ceteris paribus assumption and recognize that the presence of overlapping variables (specific emotions and overall sen- timents) can influence the interpretation of the coefficients.
Principal Component Analysis (PCA) can be used to summarize review text data. In the example, 20 principal components are extracted to capture the variability in the review text and used as predictors in a logistic regression model to classify reviews as positive or negative. Many of these components are statistically significant. Although the PCA factors themselves do not have an inherent interpretation, we can gain insight by examining the top 10 words associated with each factor. For example, one component