








































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam tests skills in processing text data within KNIME, focusing on techniques like tokenization, text mining, natural language processing (NLP), and sentiment analysis.
Typology: Exams
1 / 80
This page cannot be seen from the preview
Don't miss anything!









































































Question 1. Which KNIME node is primarily used to convert a plain text column into a Document object? A) CSV Reader B) Strings to Document C) Table Row to Variable D) Column Filter Answer: B Explanation: The Strings to Document node wraps raw string data into KNIME’s Document data type, enabling downstream text-processing nodes to operate on the text. Question 2. In the context of text mining, the term “corpus” refers to: A) A single document after tokenization B) The entire collection of documents being analyzed C) A list of stop words D) The matrix of term frequencies Answer: B Explanation: A corpus is the complete set of documents that constitute the data source for a text-mining project. Question 3. Which of the following best describes the difference between stemming and lemmatization? A) Stemming removes punctuation; lemmatization adds it B) Stemming produces a valid dictionary word; lemmatization truncates the word C) Stemming truncates words mechanically; lemmatization uses morphological analysis to return the base dictionary form D) Stemming is language-independent; lemmatization works only for English Answer: C Explanation: Stemming applies heuristic rules to cut off word endings, often yielding non-dictionary fragments, whereas lemmatization uses linguistic knowledge to map a word to its proper base form.
Question 4. Which node in KNIME’s Text Processing extension computes TF/IDF weights for a Bag-of-Words matrix? A) TF/IDF Calculator B) Document Vector C) Bag of Words Creator D) Term Frequency Counter Answer: A Explanation: The TF/IDF Calculator node takes a term-by-document matrix and applies the TF/IDF weighting scheme to each cell. Question 5. In a Term-by-Document Matrix, rows represent: A) Documents B) Terms (tokens) C) Classes D) Stop words Answer: B Explanation: Each row corresponds to a distinct term (or token) while each column corresponds to a document; the cell values are frequencies or weighted counts. Question 6. Which of the following is NOT a typical preprocessing step in KNIME’s text workflow? A) Case conversion to lower case B) Removing HTML tags C) Adding random noise to the text D) Removing punctuation Answer: C Explanation: Adding random noise would degrade data quality; preprocessing aims to clean and standardize text. Question 7. The Named Entity Recognition (NER) node in KNIME can identify which of these entity types by default? A) Chemical formulas only
D) Stop Word Filter Answer: A Explanation: The POS Tagger node annotates each token with its grammatical category (noun, verb, adjective, etc.). Question 11. The Bag-of-Words model loses which type of information? A) Term frequency B) Document length C) Word order and context D) Vocabulary size Answer: C Explanation: BoW treats documents as unordered collections of terms, discarding syntactic order and contextual relationships. Question 12. When using the Tika Parser node, which file formats can be read directly without additional converters? A) CSV, JSON, XML only B) PDF, Microsoft Word, HTML, plain text C) JPEG, PNG, GIF D) Audio files (MP3, WAV) Answer: B Explanation: The Tika Parser leverages Apache Tika to extract text from many document types, including PDFs, Word files, HTML, and plain text. Question 13. Which of the following best describes the purpose of dimensionality reduction (e.g., SVD) on a term-document matrix? A) Increase the number of terms for better accuracy B) Reduce computational cost and mitigate sparsity while preserving major patterns C) Convert text to binary format D) Remove stop words automatically Answer: B
Explanation: Techniques like Singular Value Decomposition compress the high-dimensional sparse matrix into a lower-dimensional space, retaining the dominant semantic structure. Question 14. In KNIME, the Document Vector node converts a Bag-of-Words model into: A) A list of raw strings B) A numeric vector per document suitable for machine learning C) An image representation of the text D) A PDF file Answer: B Explanation: The node creates a numeric vector (often TF/IDF weighted) for each document, enabling downstream classifiers or clustering algorithms. Question 15. Which of the following is a synonym handling technique in KNIME? A) Tokenizer B) Synonym Mapper C) NER Annotator N) Stop Word Filter Answer: B Explanation: The Synonym Mapper node replaces tokens with a canonical form based on a provided synonym list. Question 16. The Topic Extractor node in KNIME implements which algorithm by default? A) K-Means clustering B) Latent Dirichlet Allocation (LDA) C) Decision Tree classification D) Naïve Bayes sentiment analysis Answer: B Explanation: The node runs LDA to discover latent topics and provides document-topic and term-topic distributions.
Question 20. The Word2Vec Learner node in KNIME learns word embeddings based on: A) Co-occurrence matrix factorization only B) Skip-gram or Continuous Bag-of-Words (CBOW) neural models C) Rule-based synonym dictionaries D) TF/IDF weighting Answer: B Explanation: Word2Vec implements either the Skip-gram or CBOW architecture to produce dense vector representations of words. Question 21. Which of the following preprocessing steps would most directly improve the performance of a Named Entity Recognition node? A) Removing all numbers from the text B) Converting all characters to uppercase C) Normalizing whitespace and removing HTML tags D) Applying stemming before NER Answer: C Explanation: Clean, well-structured text (proper whitespace, no HTML markup) helps the NER model correctly identify entity boundaries. Question 22. In a Bag-of-Words model, the term “document frequency” (DF) of a word is defined as: A) The total number of times the word appears across all documents B) The number of documents that contain the word at least once C) The inverse of the term frequency D) The logarithm of the term’s TF/IDF score Answer: B Explanation: Document frequency counts how many distinct documents include the term, not how many times it occurs overall. Question 23. Which node would you use to remove tokens shorter than three characters from a Document?
A) Token Length Filter B) Stop Word Filter C) Tokenizer D) Document Cleaner Answer: A Explanation: The Token Length Filter node discards tokens based on length criteria, such as removing very short words. Question 24. When applying LDA, the hyperparameter α primarily controls: A) The number of topics B) The sparsity of document-topic distributions C) The vocabulary size D) The learning rate of the algorithm Answer: B Explanation: α (alpha) is the Dirichlet prior for per-document topic proportions; a smaller α yields sparser topic assignments per document. Question 25. Which KNIME node can be used to visualize term frequencies as a tag cloud? A) Word Cloud (Tag Cloud) B) Bar Chart C) Scatter Plot D) Histogram Answer: A Explanation: The Word Cloud (also called Tag Cloud) node creates a visual where term size reflects frequency or weight. Question 26. The Document Vector to Table node is useful for: A) Converting numeric vectors back into raw text B) Exporting vector representations as a flat table for external modeling tools C) Merging multiple documents into a single PDF
Explanation: Homonyms are words with identical spelling (or pronunciation) but distinct senses, e.g., “bank” (financial) vs. “bank” (river side). Question 30. The TF/IDF weighting scheme reduces the impact of: A) Rare terms that appear in only a few documents B) Common terms that appear in many documents C) Terms that have high term frequency in a single document D) All of the above Answer: B Explanation: IDF penalizes terms that occur in many documents, lowering their overall weight. Question 31. Which KNIME node would you employ to automatically generate a list of the most frequent terms across a corpus? A) Term Frequency Counter B) Frequency Distribution C) Text Statistics D) Top K Terms Answer: D Explanation: The Top K Terms node extracts the K most frequent terms from a Document collection. Question 32. When constructing a workflow for sentiment analysis, which sequence of nodes is most appropriate? A) File Reader → Document Vector → TF/IDF → Naïve Bayes B) File Reader → Strings to Document → Document Cleaner → POS Tagger → Sentiment Lexicon → Decision Tree C) File Reader → Strings to Document → Document Cleaner → Stop Word Filter → TF/IDF → Naïve Bayes → Scorer D) File Reader → PDF Parser → Word2Vec Learner → K-Means Answer: C
Explanation: The typical pipeline cleans text, removes stop words, computes TF/IDF, then feeds the vectors into a classifier (e.g., Naïve Bayes) for sentiment scoring. Question 33. Which of the following is NOT a valid composite data type used in the KNIME Text Processing extension? A) Document B) Term C) Token D) Image Answer: D Explanation: Document, Term, and Token are specific to text processing; Image is a separate data type unrelated to text nodes. Question 34. In a text clustering workflow, the k-Medoids algorithm differs from k-Means primarily because: A) k-Medoids uses actual data points as cluster centers, making it robust to outliers B) k-Medoids works only on binary data C) k-Medoids requires a TF/IDF matrix, while k-Means does not D) k-Medoids automatically determines the optimal number of clusters Answer: A Explanation: k-Medoids selects representative objects (medoids) from the dataset, reducing sensitivity to extreme values. Question 35. Which node would you use to assign a custom tag (e.g., “review”) to each Document for later filtering? A) Document Tagger B) Tag Creator C) Metadata Writer D) Column Appender Answer: A
Question 39. Which of the following best describes the purpose of a stop word list? A) To highlight important domain-specific terms B) To remove high-frequency, low-information words that do not contribute to the semantics of the text C. To replace all nouns with their synonyms D. To convert all verbs into infinitive form Answer: B Explanation: Stop words (e.g., “the”, “and”) are removed because they occur frequently but carry little discriminative power. Question 40. When using the Topic Extractor node, the “Number of Topics” parameter controls: A) The maximum number of words per topic B) How many distinct latent topics the algorithm will attempt to discover C) The size of the output document vectors D. The depth of the underlying decision tree Answer: B Explanation: This parameter tells LDA how many topic distributions to learn from the corpus. Question 41. Which node would you use to split a large corpus into training and testing sets while preserving the original document structure? A) Partitioning B) Row Splitter C) Document Splitter D) Sample Answer: C Explanation: The Document Splitter node divides a collection of Document objects into two or more subsets, maintaining their metadata.
Question 42. In the context of word embeddings, the term “context window size” refers to: A) The dimensionality of the embedding vectors B) The number of surrounding words considered during training C. The number of topics extracted by LDA D. The size of the corpus in megabytes Answer: B Explanation: The context window defines how many neighboring tokens on each side are used to predict a target word in models like Word2Vec. Question 43. Which of the following is a common evaluation metric for a text classification model? A) Silhouette score B) Perplexity C) Accuracy, Precision, Recall, F1-Score D. Cosine similarity Answer: C Explanation: Classification tasks are assessed with metrics such as accuracy, precision, recall, and the F1-score. Question 44. The Document Normalizer node in KNIME can perform which of the following actions? A) Convert all characters to uppercase only B) Apply L2 normalization to document vectors C) Remove duplicate documents from a corpus D. Generate a word cloud automatically Answer: B Explanation: This node normalizes numeric document vectors (e.g., scaling to unit length) which can improve similarity calculations. Question 45. Which preprocessing technique would you apply to ensure that “USA” and “U.S.A.” are treated as the same token? A) Stemming
C) Average linkage D. Ward’s method Answer: D Explanation: Ward’s method minimizes the total within-cluster variance, often yielding compact and well-separated clusters. Question 49. The Tag Cloud node in KNIME derives term size based on which default metric? A) TF/IDF weight B) Raw term frequency C) Document length D. Random assignment Answer: B Explanation: By default, the tag cloud scales words according to raw frequency, though other weighting schemes can be supplied. Question 50. Which node can be used to export a processed corpus as a set of plain text files, one per document? A) Document Writer B) CSV Writer C) Text Exporter D. File Meta Data Writer Answer: A Explanation: The Document Writer node writes each Document object to a separate file (e.g., .txt) preserving the processed content. Question 51. In text mining, a “lexicon” typically refers to: A) A collection of stop words only B) A curated dictionary of words with associated properties (e.g., sentiment polarity) C. A set of PDF files to be parsed D. A list of document IDs
Answer: B Explanation: Lexicons provide word-level information such as sentiment scores, synonyms, or part-of-speech tags. Question 52. Which of the following is a true statement about the Word2Vec model compared to traditional BoW? A) Word2Vec captures semantic similarity between words, while BoW does not B) Word2Vec produces sparse matrices, BoW produces dense vectors C. Word2Vec requires document-level labels for training D. BoW can represent word order, Word2Vec cannot Answer: A Explanation: Word2Vec learns continuous embeddings where semantically related words have similar vectors, a capability absent in BoW. Question 53. When using the Document Enricher node to add POS tags, which column must contain the Document objects? A) Any column of type String B) A column of type Document (the only acceptable input) C) A numeric column with token counts D. A Boolean column indicating sentiment Answer: B Explanation: The Document Enricher operates on Document data types; other column types are ignored. Question 54. The TF/IDF weighting scheme can be interpreted as: A) A measure of term rarity across the corpus multiplied by its local importance within a document B) The product of term length and document length C. The sum of term frequencies across all documents D. A binary indicator of term presence Answer: A Explanation: TF captures local importance; IDF reflects global rarity. Their product yields TF/IDF.
Question 58. Which node can be used to convert a collection of Document objects into a CSV file containing one row per document and a column with the raw text? A) Document to CSV B) CSV Writer (configured with the Document column) C. Document Table Exporter D) Text Exporter Answer: A Explanation: The Document to CSV node extracts the text content of each Document and writes it to a CSV format. Question 59. When applying a Naïve Bayes classifier to TF/IDF vectors, which assumption is made about the features? A) Features are conditionally independent given the class label B) Features follow a Gaussian distribution C. Features are linearly correlated D. Features are ordinal Answer: A Explanation: Naïve Bayes assumes feature independence conditioned on the class, simplifying probability calculations. Question 60. Which of the following best describes the role of a Component in a KNIME text-mining workflow? A) It bundles a set of nodes into a reusable, encapsulated unit with its own input/output ports B) It converts Document objects into images C. It automatically performs hyperparameter tuning D. It visualizes term frequencies as a 3-D plot Answer: A Explanation: Components allow modular design, letting you reuse a sub-workflow (e.g., preprocessing pipeline) across multiple projects. Question 61. The Document Viewer node provides:
A) An interactive GUI to inspect token annotations, POS tags, and NER results within each Document B) A statistical summary of term frequencies only C. A way to export documents as PDFs D. Automatic sentiment scores Answer: A Explanation: The viewer displays the full enriched Document, showing highlighted annotations for easy manual inspection. Question 62. Which of the following is a standard method for handling out-of-vocabulary (OOV) words when using pre-trained Word2Vec embeddings? A) Drop the entire document containing OOV words B) Map OOV words to a special “unknown” token vector C. Replace OOV words with their stems D. Convert OOV words to uppercase Answer: B Explanation: Pre-trained embedding models typically include an “UNK” vector to represent unseen words. Question 63. In a text classification workflow, why might you prefer a Linear SVM over a Decision Tree? A) Linear SVM handles high-dimensional sparse data more effectively B. Decision Trees are faster for large vocabularies C. SVMs automatically generate word clouds D. Decision Trees cannot process TF/IDF vectors Answer: A Explanation: Linear SVMs are well-suited to high-dimensional, sparse feature spaces common in text data, often yielding better generalization. Question 64. Which node can be used to compute the perplexity of a language model built on a corpus? A) Language Model Evaluator