



















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A comprehensive exam crafted to evaluate mastery over natural language processing (NLP) and text mining using KNIME’s Text Processing extension. The exam includes tasks such as text preprocessing, tokenization, POS tagging, stemming/lemmatization, sentiment analysis, keyword extraction, topic modeling (LDA), document similarity, and vectorization techniques like TF-IDF and word embeddings. Learners demonstrate the ability to configure text-based workflows, perform large-scale document analysis, and integrate NLP components into analytics pipelines.
Typology: Exams
1 / 91
This page cannot be seen from the preview
Don't miss anything!




















































































Question 1. Which term best describes the process of converting raw text into a structured Document object in KNIME? A) Tokenization B) Parsing C) String to Document conversion D) Stemming Answer: C Explanation: The Strings To Document node transforms raw strings into KNIME’s Document data type, making the text ready for downstream processing. Question 2. In text mining, the collection of all documents used for analysis is called a: A) Lexicon B) Corpus C) Vocabulary D) Token set Answer: B Explanation: A corpus (plural corpora) is the complete set of documents that constitute the dataset for text mining. Question 3. Which of the following is a characteristic of unstructured data? A) Fixed schema B) Stored in relational tables C) No predefined data model D) Primary keys Answer: C
Explanation: Unstructured data such as emails or PDFs lack a predefined schema, unlike structured data in databases. Question 4. The primary goal of stemming is to: A) Convert words to their dictionary forms B) Reduce words to a common root by truncation C) Identify part‑of‑speech tags D) Detect named entities Answer: B Explanation: Stemming removes affixes to produce a truncated root form, without guaranteeing a valid language word. Question 5. Which technique retains the semantic meaning of a word rather than just its root? A) Stemming B) Lemmatization C) Tokenization D) Bag‑of‑Words creation Answer: B Explanation: Lemmatization uses morphological analysis to return the correct dictionary (lemma) form, preserving meaning. Question 6. In KNIME, which node is used to parse PDF files into Document objects? A) File Reader B) Tika Parser C) CSV Reader
B) Document length C) Word order and context D) Vocabulary size Answer: C Explanation: BoW treats documents as unordered collections of terms, discarding sequence information. Question 10. Which node in KNIME creates the Term‑by‑Document matrix from Document objects? A) Document Vector B) Bag of Words Creator C) TF/IDF Calculator D) Text Filter Answer: B Explanation: The Bag of Words Creator node builds the matrix that counts term occurrences per document. Question 11. Which of the following is NOT a typical step in the CRISP‑DM process for text mining? A) Business Understanding B. Data Collection C. Model Deployment D. Word Embedding Generation Answer: D Explanation: Word embedding generation is a technique within modeling, not a separate CRISP‑DM phase.
Question 12. What is the purpose of a stop‑word list in text preprocessing? A) To highlight important terms B) To remove high‑frequency, low‑information words C) To create synonyms D) To perform part‑of‑speech tagging Answer: B Explanation: Stop‑words (e.g., “the”, “and”) are filtered out because they carry little discriminative power. Question 13. Which node would you use to perform part‑of‑speech (POS) tagging on Document objects? A) POS Tagger B) Named Entity Recognizer C) Lemma Annotation D) Tokenizer Answer: A Explanation: The POS Tagger node assigns grammatical categories (noun, verb, etc.) to each token. Question 14. Named Entity Recognition (NER) is primarily used to: A. Reduce dimensionality of the term matrix B. Identify and classify proper nouns like people, locations, organizations C. Convert words to their base forms D. Compute TF/IDF scores Answer: B
D. Naïve Bayes smoothing Answer: B Explanation: SVD (used in Latent Semantic Analysis) reduces matrix dimensions while preserving important patterns. Question 18. Latent Dirichlet Allocation (LDA) is used for: A. Sentiment scoring B. Topic modeling C. Part‑of‑speech tagging D. Document clustering based on Euclidean distance Answer: B Explanation: LDA discovers latent topics by modeling each document as a mixture of topics. Question 19. Which node would you select to generate word embeddings such as Word2Vec within a KNIME workflow? A. Word2Vec Learner B. Document Vector C. Bag of Words Creator D. TF/IDF Calculator Answer: A Explanation: The Word2Vec Learner node trains word embedding models that capture contextual similarity. Question 20. When visualizing term importance, a tag cloud (word cloud) primarily encodes information through: A. Color hue
B. Font size proportional to term weight C. Position on the plot D. Underlining Answer: B Explanation: In a word cloud, larger fonts represent higher frequency or weight. Question 21. Which preprocessing step would convert “KNIME’s” to “knime” (lowercase and remove punctuation)? A. Stemming B. Lemmatization C. Case conversion and punctuation removal D. Synonym mapping Answer: C Explanation: Lowercasing and stripping punctuation standardizes the token. Question 22. In KNIME, the composite data type that stores both the text and its annotations (e.g., POS tags, NER) is called: A. String B. Document C. Table Row D. Image Answer: B Explanation: Document objects encapsulate raw text plus any attached annotations. Question 23. Which of the following is a limitation of the Bag‑of‑Words model?
Explanation: Lexicon‑based methods use word‑level sentiment scores from resources like SentiWordNet. Question 26. Which node would you use to filter out tokens shorter than three characters? A. Token Filter B. Document Filter C. String Manipulation D. Row Filter Answer: A Explanation: The Token Filter node can remove tokens based on length criteria. Question 27. The term “bag‑of‑ngrams” extends the BoW model by: A. Including character n‑grams as features B. Storing word order information C. Reducing dimensionality automatically D. Using binary weighting only Answer: A Explanation: Bag‑of‑ngrams adds sequences of n consecutive tokens (e.g., bigrams) to capture limited context. Question 28. Which of the following best describes the role of a “Document Viewer” node in KNIME? A. Generates TF/IDF scores B. Displays enriched text with annotations for interactive inspection C. Performs clustering on document vectors D. Trains a classification model
Answer: B Explanation: The Tagged Document Viewer shows the text together with POS tags, NER, etc., for manual review. Question 29. In the context of text mining, “dimensionality reduction” is primarily performed to: A. Increase the number of features B. Remove stop‑words C. Decrease computational load and mitigate the curse of dimensionality D. Convert text to uppercase Answer: C Explanation: Reducing dimensions (e.g., via SVD) makes models faster and less prone to over‑fitting. Question 30. Which node is used to calculate TF/IDF weights after a Bag‑of‑Words matrix has been created? A. TF/IDF Calculator B. Document Vector C. Word2Vec Learner D. Column Filter Answer: A Explanation: The TF/IDF Calculator node transforms raw term frequencies into weighted scores. Question 31. When performing hierarchical clustering on document vectors, which linkage method measures the maximum distance between elements of two clusters? A. Single linkage
Question 34. Which preprocessing step would most directly address the issue of “noisy” numeric strings like “12345” appearing in a text corpus? A. Stop‑word removal B. Number filtering in Token Filter C. Lemmatization D. Synonym mapping Answer: B Explanation: Token Filter can be configured to exclude tokens that consist solely of digits. Question 35. The main difference between “homonyms” and “synonyms” is that: A. Homonyms share the same spelling or pronunciation but differ in meaning; synonyms have different forms but similar meanings. B. Homonyms are always nouns, synonyms are verbs. C. Homonyms are language‑specific; synonyms are universal. D. There is no difference; they are interchangeable terms. Answer: A Explanation: Homonyms (e.g., “bank” – riverbank vs. financial institution) differ in meaning; synonyms (e.g., “big”, “large”) share meaning. Question 36. Which node would you use to create a custom list of stop‑words in KNIME? A. Stop Word Filter (custom) B. String Manipulation C. CSV Reader D. Table Row to Variable Answer: A
Explanation: The Stop Word Filter node can be supplied with a user‑defined list to filter out specific tokens. Question 37. In LDA topic modeling, the parameter “α” controls: A. Number of topics B. Document‑topic density (how many topics per document) C. Word‑topic density (how many words per topic) D. Learning rate of the algorithm Answer: B Explanation: α (alpha) is the Dirichlet prior for per‑document topic distribution; lower α yields fewer topics per document. Question 38. Which of the following is NOT a typical output of the LDA Topic Extractor node? A. Document‑topic distribution matrix B. Word‑topic distribution matrix C. Sentiment polarity scores D. Top N words per topic Answer: C Explanation: LDA provides topic probabilities, not sentiment analysis results. Question 39. When using the Word2Vec model, the vector representation of a word is derived from: A. Its frequency across the corpus B. Its surrounding context windows during training C. Its morphological root
A. Hold‑out validation with 90% training data B. Leave‑One‑Out Cross‑Validation C. K‑Fold Cross‑Validation (e.g., 5‑fold) D. Random sampling without replacement Answer: C Explanation: K‑Fold CV balances bias and variance, providing reliable estimates with limited data. Question 43. Which of the following best describes “log frequency” weighting? A. Applying natural logarithm to raw term counts to dampen the effect of very frequent terms B. Converting frequencies to binary values C. Multiplying TF by IDF D. Using base‑10 logarithm of document length Answer: A Explanation: Log frequency reduces the impact of high raw counts while retaining ranking information. Question 44. The “Document Filter” node in KNIME can be used to: A. Remove documents that do not meet a length threshold B. Tokenize text into words C. Compute TF/IDF scores D. Train a neural network Answer: A Explanation: Document Filter can drop entire documents based on metadata or content criteria such as length.
Question 45. Which of the following is an advantage of using lemmatization over stemming? A. Faster processing time B. Produces valid dictionary words, preserving semantics C. Removes stop‑words automatically D. Generates n‑grams automatically Answer: B Explanation: Lemmatization returns the proper base form (lemma), maintaining correct meaning. Question 46. In KNIME, which node would you employ to merge multiple text columns into a single Document column? A. Column Combiner B. Strings To Document C. Concatenate D. Column Appender Answer: B Explanation: Strings To Document can take one or more string columns and create a single Document object. Question 47. Which term describes the process of assigning a probability distribution over topics for each document in LDA? A. Topic allocation B. Document‑topic inference C. Word embedding D. Sentiment scoring
B. Synonym mapping / custom dictionary C. Tokenization D. POS tagging Answer: B Explanation: A custom synonym dictionary consolidates multiple variants into a single canonical form. Question 51. In the context of text mining, the term “corpus size” usually refers to: A. Number of unique words B. Number of documents in the collection C. Total number of characters D. Number of stop‑words Answer: B Explanation: Corpus size is measured by the count of individual documents. Question 52. Which node can be used to export the Term‑by‑Document matrix to a CSV file for external analysis? A. CSV Writer B. Table Writer C. Document Exporter D. Bag of Words Creator Answer: A Explanation: CSV Writer writes tabular data, such as a TDM, to a CSV file. Question 53. Which metric is commonly used to assess the quality of a topic model?
A. Silhouette coefficient B. Perplexity C. Accuracy D. ROC AUC Answer: B Explanation: Perplexity measures how well a probabilistic model predicts a held‑out set; lower values indicate better fit. Question 54. Which of the following statements about “tokenization” is true? A. It removes all punctuation from text. B. It splits text into meaningful units such as words or sub‑words. C. It assigns POS tags to each word. D. It calculates TF/IDF scores. Answer: B Explanation: Tokenization is the process of breaking raw text into discrete tokens (words, symbols, etc.). Question 55. In KNIME, which node allows you to apply a pre‑trained machine‑learning model to new text data? A. Model Reader B. Predictor (e.g., Decision Tree Predictor) C. Document Vector D. Text Exporter Answer: B Explanation: Predictor nodes load a trained model and generate predictions on incoming feature vectors.