summery text analytics | Summaries Data Mining

summary text analytics

lucas vercauteren

April 2025

1 week 2

Word Clouds

Word clouds can be used to summarize text data. A basic word cloud simply counts the number of

times each word appears in the text. Words that occur more frequently are shown larger and are typically

placed toward the center of the cloud.

While a histogram is the traditional approach to visualize word frequencies, it can only show a

limited number of words clearly. In contrast, word clouds can display a much larger set of words at once,

providing a broader overview of the text content.

Word clouds are therefore useful to quickly identify prominent words and get a general sense of what

the text is about. However, they do not reveal much about the underlying relationships or structure of

the text.

To gain deeper insight, we can compare word clouds from different subsets of text. For example,

comparing word usage between positive and negative customer reviews helps us explore which words are

more typical for each group. However, since both happy and unhappy customers often talk about similar

things, their most frequent words tend to overlap. This makes it difficult to identify what distinguishes

the two groups.

To address this, we can use a commonality cloud, which shows the words shared between both groups.

Alternatively, a comparison cloud highlights the words that appear more frequently in one group than

the other. Still, both approaches focus on frequent words rather than the most discriminative ones.

The main issue with these visualizations is that they are dominated by very frequent words, which

are not necessarily the most informative. To control for this, we can use TF-IDF (Term Frequency –

Inverse Document Frequency).

•Term Frequency (TF) counts how often a word appears across all documents.

•Document Frequency (DF) counts how many documents contain the word.

Some words are common across nearly all documents (e.g., "the", "and"), while others appear fre-

quently in only a few. The latter are more unique and thus more descriptive of specific content. TF-IDF

gives higher scores to words that occur frequently in a specific subset but not across all documents.

For example, to identify words that best describe positive reviews, we first calculate term frequencies

in this subset, then adjust using the inverse document frequency. This highlights words that are both

common in positive reviews and rare in the full set.

Alternative weighting schemes also exist, such as multiplying TF by the log of the IDF instead of

using IDF directly.

TF shows how many times a word appears in all reviews, while DF shows in how many individual

reviews the word appears. For instance, if the word "clean" appears in more reviews than "night"

or "nice", but those other words have higher TFs, it suggests that "clean" is used less frequently per

review. This reveals interesting patterns in word usage—people may mention "nice" multiple times when

expressing satisfaction, whereas they only mention "clean" once.

Looking at the histogram of TF-IDF values, we find that words like smoking or dog stand out

in negative reviews—indicating annoyance—and words related to location appear frequently too. By

definition, TF-IDF values are always positive.

To create even stronger contrasts between groups (e.g., 1-star vs 5-star reviews), we can use the

relative frequency of words across the groups. For example, we can calculate:

Relative Frequency =TF(Group A)

TF(Group B)

summery text analytics, Summaries of Data Mining

Related documents

Partial preview of the text

Download summery text analytics and more Summaries Data Mining in PDF only on Docsity!

summary text analytics

lucas vercauteren

April 2025

1 week 2

Word Clouds

Multidimensional Scaling (MDS)

1.1 Exam question

Zooming In on Sentences and Words

LDA Application

3.2 NMF Application

0. 4 × 0. 2

(0. 4 × 0 .2) + (0. 6 × 0 .1)

Implementation of the GloVe Model

Interpreting Word Embeddings with Cosine Similarity

Getting Insights from Word Embeddings

4.1 Practice exam question

ANSWER:

5.1 Predicting Ratings

5.2 Predicting Ratings Using Emotions

5.3 Predicting using PCA Results