























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam validates proficiency in discovering patterns, trends, and relationships within large datasets. Topics include classification, clustering, association rules, anomaly detection, predictive modeling, and evaluation techniques. Certified candidates can apply data mining methods to solve complex analytical problems across multiple domains.
Typology: Exams
1 / 95
This page cannot be seen from the preview
Don't miss anything!
























































































Question 1. Which linguistic level deals with the study of word formation such as prefixes and suffixes? A) Syntax B) Morphology C) Semantics D) Pragmatics Answer: B Explanation: Morphology is the branch of linguistics that analyzes the structure of words, including roots, prefixes, and suffixes. Question 2. In the Penn Treebank corpus, what does the tag “NN” represent? A) Proper noun singular B) Noun, singular or mass C) Verb, base form D) Adjective, comparative Answer: B Explanation: “NN” is the Penn Treebank tag for a noun that is singular or a mass noun. Question 3. Which of the following is the most appropriate regular expression to capture an email address of the form [email protected]? A) \w+@\w+\.\w+ B) [a-z]+[0-9] C) \d{3}-\d{2}-\d{4} D) \s+ Answer: A
Explanation: The pattern \w+@\w+\.\w+ matches one or more word characters, an “@”, domain name, a dot, and a top‑level domain. Question 4. When performing tokenization, which step should be applied first to ensure case‑insensitive matching? A) Lemmatization B) Stop‑word removal C) Lowercasing D) Stemming Answer: C Explanation: Lowercasing converts all characters to the same case, allowing subsequent steps to treat “Apple” and “apple” identically. Question 5. Which stemming algorithm is known for its aggressive reduction of words and is often considered too harsh for English? A) Porter B) Snowball C) Lancaster D) Krovetz Answer: C Explanation: The Lancaster stemmer applies very aggressive rules, often stripping more characters than desired. Question 6. What is the main difference between stemming and lemmatization? A) Stemming uses dictionaries; lemmatization does not. B) Lemmatization returns the base form; stemming only truncates. C) Stemming works on characters; lemmatization works on phonemes.
B) The term is rare across documents. C) The term is a stop‑word. D) The term occurs only in the current document. Answer: B Explanation: Inverse Document Frequency is higher for terms that appear in fewer documents, emphasizing rare but potentially discriminative words. Question 10. Which word‑embedding technique explicitly incorporates sub‑word information? A) Word2Vec CBOW B) GloVe C) FastText D) Latent Semantic Analysis Answer: C Explanation: FastText represents words as a sum of character n‑gram vectors, enabling it to handle unseen or morphologically rich words. Question 11. In Word2Vec, what is the primary objective of the Skip‑gram model? A) Predict the current word from surrounding context. B) Predict surrounding context words from the current word. C) Reduce dimensionality of the term‑document matrix. D) Cluster words based on co‑occurrence frequency. Answer: B Explanation: Skip‑gram learns embeddings by using a target word to predict its neighboring context words.
Question 12. How is cosine similarity between two embedding vectors computed? A) Dot product divided by the product of their magnitudes. B) Euclidean distance subtracted from 1. C) Sum of absolute differences. D) Ratio of their L1 norms. Answer: A Explanation: Cosine similarity = (A·B) / (||A|| × ||B||), measuring the angle between vectors irrespective of magnitude. Question 13. Which dimensionality‑reduction technique is commonly used to visualize high‑dimensional word embeddings in 2‑D space? A) PCA B) LDA C) t‑SNE D) K‑Means Answer: C Explanation: t‑Distributed Stochastic Neighbor Embedding (t‑SNE) preserves local structure and is popular for visualizing embeddings. Question 14. What is the main advantage of using a Dependency Parser over a Constituency Parser? A) It generates phrase‑structure trees. B) It directly captures head‑dependent relations. C) It requires less annotated data. D) It can only parse English. Answer: B
D) Beginning of a location entity. Answer: B Explanation: “B‑PER” follows the BIO scheme, marking the first token of a person‑type entity. Question 18. Which of the following is a common error type in NER systems? A) Over‑stemming of proper nouns. B) Mis‑classifying dates as organizations. C) Ignoring stop‑words. D) Tokenizing punctuation as separate words. Answer: B Explanation: NER models sometimes confuse temporal expressions with organization names due to similar lexical cues. Question 19. What problem does the vanishing gradient affect most in standard RNNs? A) Overfitting on training data. B) Inability to learn long‑range dependencies. C) Excessive memory usage. D) Slow convergence due to large gradients. Answer: B Explanation: Vanishing gradients cause the error signal to shrink exponentially, preventing the network from learning dependencies over many time steps. Question 20. Which gating mechanism in LSTM controls how much of the previous cell state is retained? A) Input gate B) Forget gate
C) Output gate D) Reset gate Answer: B Explanation: The forget gate decides which information from the prior cell state should be discarded. Question 21. In a GRU cell, which component replaces both the input and forget gates of an LSTM? A) Update gate B) Reset gate C) Candidate activation D) Output gate Answer: A Explanation: The GRU’s update gate determines how much of the previous hidden state is kept, merging the functions of LSTM’s input and forget gates. Question 22. Which type of attention allows each token to attend to all other tokens in the same sequence? A) Causal attention B) Self‑attention C) Cross‑attention D) Local attention Answer: B Explanation: Self‑attention computes attention scores among all token pairs within a single sequence. Question 23. In the Transformer architecture, what is the purpose of positional encodings?
Question 26. Which pre‑trained model is encoder‑only and designed primarily for masked language modeling? A) GPT‑ 2 B) BERT C) T D) Transformer‑XL Answer: B Explanation: BERT uses a bidirectional encoder and is trained with a masked language modeling objective. Question 27. Which model architecture follows an encoder‑decoder paradigm and is commonly used for text‑to‑text tasks? A) BERT B) GPT‑ 3 C) T D) RoBERTa Answer: C Explanation: T5 (Text‑to‑Text Transfer Transformer) treats every NLP problem as a text generation task using an encoder‑decoder. Question 28. In transfer learning for NLP, what is the typical first step after loading a pre‑trained model? A. Training from scratch on the target dataset. B. Freezing all layers and only training the classifier head. C. Removing the tokenizer. D. Converting the model to a rule‑based system. Answer: B
Explanation: Fine‑tuning usually starts by freezing the majority of the pre‑trained layers and training a new task‑specific head. Question 29. Which fine‑tuning strategy is most appropriate when only a few labeled examples are available in a new domain? A) Full‑model retraining B) Zero‑shot inference C) Few‑shot prompt engineering D) Data augmentation with synthetic labels Answer: C Explanation: Few‑shot prompt engineering leverages the model’s in‑context learning ability with a handful of examples. Question 30. In prompt engineering, what does “zero‑shot” refer to? A) Providing no examples and only a task description. B) Using a single example. C) Training a new model from scratch. D) Removing all stop‑words from the prompt. Answer: A Explanation: Zero‑shot prompting asks the model to perform a task based solely on the instruction, without demonstrations. Question 31. Which evaluation metric is most suitable for assessing abstractive summarization quality? A) BLEU B) ROUGE‑L C) Perplexity
A) Jaccard similarity B) Cosine similarity C) Hamming distance D) Manhattan distance Answer: B Explanation: Cosine similarity efficiently measures angular closeness of high‑dimensional dense vectors. Question 35. Which preprocessing step is essential before applying a regular expression that matches dates in the format “DD/MM/YYYY”? A) Stemming B) Lowercasing C) Removing punctuation D) Normalizing whitespace Answer: D Explanation: Normalizing whitespace ensures consistent spacing, allowing the regex to reliably locate date patterns. Question 36. What is the main purpose of stop‑word removal in text classification pipelines? A) To increase vocabulary size. B) To reduce noise and dimensionality. C) To improve tokenization speed. D) To enhance stemming accuracy. Answer: B Explanation: Removing high‑frequency, low‑information words reduces feature space and often improves classifier performance.
Question 37. Which of the following best describes the “BIO” tagging scheme used in sequence labeling? A) Binary Indicator Output B) Begin‑Inside‑Outside C) Bag‑of‑Instances‑Ordered D) Backward‑Inference‑Optimization Answer: B Explanation: “BIO” stands for Begin, Inside, Outside, indicating the position of a token within a named entity. Question 38. In the context of word embeddings, what does the term “distributional hypothesis” state? A) Words with similar meanings occur in similar contexts. B) All words follow a normal distribution in vector space. C) Embeddings must be normally distributed for training stability. D) Frequency distribution determines embedding dimensionality. Answer: A Explanation: The distributional hypothesis posits that semantic similarity can be inferred from contextual co‑occurrence patterns. Question 39. Which algorithm can be used to automatically learn the optimal number of topics in a corpus? A) K‑means clustering B) Latent Dirichlet Allocation with hierarchical Dirichlet process C) Principal Component Analysis D) Naïve Bayes classifier
C) Multi‑head attention D) Cross‑attention Answer: B Explanation: Causal (or masked) attention prevents a token from attending to future positions, ensuring autoregressive generation. Question 43. Which loss function is typically used for training a binary classification head on top of BERT for sentiment analysis? A) Categorical cross‑entropy B) Binary cross‑entropy C) Hinge loss D) Mean squared error Answer: B Explanation: Binary cross‑entropy measures the error for two‑class problems and is standard for sigmoid‑based classifiers. Question 44. What is the main benefit of using sub‑word tokenization (e.g., WordPiece) in transformer models? A) Reduces model depth. B) Handles out‑of‑vocabulary words efficiently. C) Guarantees grammatical correctness. D) Eliminates the need for positional encodings. Answer: B Explanation: Sub‑word tokenizers break rare words into known pieces, allowing the model to represent unseen words. Question 45. Which of the following best describes “domain adaptation” in NLP?
A) Translating text from one language to another. B) Fine‑tuning a model on data from a specific industry to improve performance there. C) Converting a rule‑based system into a neural one. D) Reducing model size for mobile deployment. Answer: B Explanation: Domain adaptation tailors a pre‑trained model to a target domain (e.g., medical) using domain‑specific data. Question 46. In a language model, what does a high perplexity score indicate? A) The model predicts the test set well. B) The model is overfitting. C) The model struggles to predict the data. D) The model has too many parameters. Answer: C Explanation: Higher perplexity means the model assigns lower probability to the actual tokens, indicating poorer predictive ability. Question 47. Which technique can mitigate the “catastrophic forgetting” problem when fine‑tuning LLMs on new tasks? A) Gradient clipping B) Elastic weight consolidation C) Dropout regularization D) Learning rate decay Answer: B Explanation: Elastic weight consolidation penalizes changes to important weights, preserving knowledge from the original task.
Answer: B Explanation: Large parameter counts lead to computationally intensive inference, causing latency issues in real‑time applications. Question 51. Which embedding evaluation technique measures how well vectors capture analogical relationships (e.g., king – man + woman ≈ queen)? A) Cosine similarity on word pairs B) Word intrusion test C) Word analogy task D) Nearest neighbor classification Answer: C Explanation: The word analogy task evaluates embeddings on solving proportional relationships using vector arithmetic. Question 52. What is the primary purpose of “masking” in BERT’s pre‑training objective? A) To reduce vocabulary size. B) To force the model to predict missing words from context. C) To prevent overfitting on the training data. D) To enable sequence‑to‑sequence translation. Answer: B Explanation: Masked language modeling hides random tokens, requiring BERT to infer them using bidirectional context. Question 53. Which of the following best describes “few‑shot prompting” with GPT‑3? A) Providing a large labeled dataset for fine‑tuning. B) Supplying a short description and a few examples within the prompt.
C) Training the model from scratch on a new language. D) Removing all punctuation from the input. Answer: B Explanation: Few‑shot prompting includes a task description plus a handful of input‑output examples to guide generation. Question 54. In the context of tokenization, what does “byte‑pair encoding (BPE)” aim to achieve? A) Compress the text size for storage. B) Learn a set of sub‑word units based on frequency of character pair merges. C) Encode tokens as fixed‑length binary strings. D) Remove all vowels from words. Answer: B Explanation: BPE iteratively merges the most frequent adjacent character pairs, creating a sub‑word vocabulary. Question 55. Which loss function is commonly used when training a model for next‑sentence prediction (NSP) in BERT? A) Binary cross‑entropy B) Categorical cross‑entropy C) Hinge loss D) Mean absolute error Answer: A Explanation: NSP is a binary classification task (whether two sentences follow each other), so binary cross‑entropy is applied. Question 56. What does the term “gradient clipping” refer to in training deep NLP models?