Data Types and Machine Learning Models in Data Science, Exams of Advanced Education

Data Types and Machine Learning Models in Data Science

Typology: Exams

2025/2026

Available from 04/18/2026

lectben
lectben ๐Ÿ‡บ๐Ÿ‡ธ

5

(1)

7.7K documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Types and Machine Learning
Models in Data Science
Tabular data - CORRECT ANSWER โœ”โœ”โœ” Rows represent observations and columns
represent variables. This is considered traditional social science data. Example:
Employee records with variables like age, salary, and job title.
Text data - CORRECT ANSWER โœ”โœ”โœ” Rather than numbers, the data is text. It
needs to be converted into tabular data for processing. Techniques like tokenization
(e.g., Chat GPT-3 tokenizer, tiktoken, Byte pair encoding) are used for this purpose.
Example: Analyzing employee emails for sentiment or topic classification.
Image data - CORRECT ANSWER โœ”โœ”โœ” AI processes images as matrices of pixel
values. An image consists of a matrix of numbers between 0 and 255, indicating a
pixel's intensity. Example: Identifying defective products using machine vision on
factory lines.
Audio data - CORRECT ANSWER โœ”โœ”โœ” Audio data is often converted to arrays
based on waveform, quantifying the shape into numbers. Concepts related to audio
include pitch (measured in hertz), amplitude (loudness/quietness), and timbre
(uniqueness of sound). Example: Transcribing customer service calls for analysis.
Supervised machine learning - CORRECT ANSWER โœ”โœ”โœ” Involves training an
algorithm on a labeled dataset for predictions or decisions. Input data (only
features) is paired with output data (labels or targets). The algorithm learns
patterns between the input and output. Accuracy is a primary metric, measuring the
proportion of correctly classified instances. Example: Email spam detection where
emails are labeled as spam or not spam.
Unsupervised machine learning - CORRECT ANSWER โœ”โœ”โœ” The algorithm is trained
on an unlabeled dataset. It identifies patterns or relationships without guidance. The
algorithm might cluster similar data points based on shared features and can also
reduce data dimensionality. Evaluation is challenging due to the absence of a
'ground truth'. Example: Customer segmentation for targeted marketing campaigns.
Reinforcement learning - CORRECT ANSWER โœ”โœ”โœ” Has an agent interacting with an
environment to learn optimal decisions. The agent receives rewards or penalties
based on actions and updates its model to learn beneficial actions. The goal is to
maximize total reward. Total reward during training is often used for evaluation,
along with benchmarking against a baseline. Example: Training a robot to navigate
a warehouse while avoiding obstacles.
Generative AI - CORRECT ANSWER โœ”โœ”โœ” New examples of data that are useful.
Large Language Models (LLM) - CORRECT ANSWER โœ”โœ”โœ” Models that require large
amounts of data (billions) and parameters (billions).
Generative Pre-trained Transformer (GPT) - CORRECT ANSWER โœ”โœ”โœ” The algorithm
name for ChatGPT.
pf2

Partial preview of the text

Download Data Types and Machine Learning Models in Data Science and more Exams Advanced Education in PDF only on Docsity!

Data Types and Machine Learning

Models in Data Science

Tabular data - CORRECT ANSWER โœ”โœ”โœ” Rows represent observations and columns represent variables. This is considered traditional social science data. Example: Employee records with variables like age, salary, and job title. Text data - CORRECT ANSWER โœ”โœ”โœ” Rather than numbers, the data is text. It needs to be converted into tabular data for processing. Techniques like tokenization (e.g., Chat GPT-3 tokenizer, tiktoken, Byte pair encoding) are used for this purpose. Example: Analyzing employee emails for sentiment or topic classification. Image data - CORRECT ANSWER โœ”โœ”โœ” AI processes images as matrices of pixel values. An image consists of a matrix of numbers between 0 and 255, indicating a pixel's intensity. Example: Identifying defective products using machine vision on factory lines. Audio data - CORRECT ANSWER โœ”โœ”โœ” Audio data is often converted to arrays based on waveform, quantifying the shape into numbers. Concepts related to audio include pitch (measured in hertz), amplitude (loudness/quietness), and timbre (uniqueness of sound). Example: Transcribing customer service calls for analysis. Supervised machine learning - CORRECT ANSWER โœ”โœ”โœ” Involves training an algorithm on a labeled dataset for predictions or decisions. Input data (only features) is paired with output data (labels or targets). The algorithm learns patterns between the input and output. Accuracy is a primary metric, measuring the proportion of correctly classified instances. Example: Email spam detection where emails are labeled as spam or not spam. Unsupervised machine learning - CORRECT ANSWER โœ”โœ”โœ” The algorithm is trained on an unlabeled dataset. It identifies patterns or relationships without guidance. The algorithm might cluster similar data points based on shared features and can also reduce data dimensionality. Evaluation is challenging due to the absence of a 'ground truth'. Example: Customer segmentation for targeted marketing campaigns. Reinforcement learning - CORRECT ANSWER โœ”โœ”โœ” Has an agent interacting with an environment to learn optimal decisions. The agent receives rewards or penalties based on actions and updates its model to learn beneficial actions. The goal is to maximize total reward. Total reward during training is often used for evaluation, along with benchmarking against a baseline. Example: Training a robot to navigate a warehouse while avoiding obstacles. Generative AI - CORRECT ANSWER โœ”โœ”โœ” New examples of data that are useful. Large Language Models (LLM) - CORRECT ANSWER โœ”โœ”โœ” Models that require large amounts of data (billions) and parameters (billions). Generative Pre-trained Transformer (GPT) - CORRECT ANSWER โœ”โœ”โœ” The algorithm name for ChatGPT.

Positional Encoding - CORRECT ANSWER โœ”โœ”โœ” Scheme for maintaining word order in the model. Self-Attention - CORRECT ANSWER โœ”โœ”โœ” Biases the model to focus on which words it should pay attention to, attempting to provide context. Data Type for ChatGPT - CORRECT ANSWER โœ”โœ”โœ” Tokenized (tiktoken) text data. Algorithms for ChatGPT - CORRECT ANSWER โœ”โœ”โœ” Includes supervised learning (deep neural networks using attention & transformers), unsupervised (word embeddings), and reinforcement learning. Application for ChatGPT - CORRECT ANSWER โœ”โœ”โœ” Web-based chat-bot application. Pre-training - CORRECT ANSWER โœ”โœ”โœ” The model is pre-trained on general language using diverse text data unrelated to specific tasks. Instruction fine-tuning - CORRECT ANSWER โœ”โœ”โœ” Fine-tuning using full scripts of Friends to capture character-specific language. RLHF - CORRECT ANSWER โœ”โœ”โœ” Trained evaluators interact with the LLM to rate dialogue matching character tones. Task-specific metrics - CORRECT ANSWER โœ”โœ”โœ” Metrics like BLEU and ROUGE used for summarization tasks or translation, not for nuanced language. Examples of task-specific metrics - CORRECT ANSWER โœ”โœ”โœ” Using ROUGE or BLEU for summarization tasks. Research Benchmark - CORRECT ANSWER โœ”โœ”โœ” Large sets of Q&A covering many topics for model evaluation. LLM-Self evaluation - CORRECT ANSWER โœ”โœ”โœ” Fast and easy to implement but expensive; useful when evaluation is simpler than the task. Human evaluation - CORRECT ANSWER โœ”โœ”โœ” Most reliable but slow and expensive, especially with expert evaluators. Crowdsourced evaluations - CORRECT ANSWER โœ”โœ”โœ” Provide general skills in ranking but are less useful for task-specific model selection. Expert linguists - CORRECT ANSWER โœ”โœ”โœ” Review translation quality in human evaluations. Platforms for evaluations - CORRECT ANSWER โœ”โœ”โœ” Examples include LMYSYS or chatbot-arena