Data Types and Machine Learning Models in Data Science | Exams Advanced Education

Data Types and Machine Learning

Models in Data Science

Tabular data - CORRECT ANSWER ✔✔✔ Rows represent observations and columns

represent variables. This is considered traditional social science data. Example:

Employee records with variables like age, salary, and job title.

Text data - CORRECT ANSWER ✔✔✔ Rather than numbers, the data is text. It

needs to be converted into tabular data for processing. Techniques like tokenization

(e.g., Chat GPT-3 tokenizer, tiktoken, Byte pair encoding) are used for this purpose.

Example: Analyzing employee emails for sentiment or topic classification.

Image data - CORRECT ANSWER ✔✔✔ AI processes images as matrices of pixel

values. An image consists of a matrix of numbers between 0 and 255, indicating a

pixel's intensity. Example: Identifying defective products using machine vision on

factory lines.

Audio data - CORRECT ANSWER ✔✔✔ Audio data is often converted to arrays

based on waveform, quantifying the shape into numbers. Concepts related to audio

include pitch (measured in hertz), amplitude (loudness/quietness), and timbre

(uniqueness of sound). Example: Transcribing customer service calls for analysis.

Supervised machine learning - CORRECT ANSWER ✔✔✔ Involves training an

algorithm on a labeled dataset for predictions or decisions. Input data (only

features) is paired with output data (labels or targets). The algorithm learns

patterns between the input and output. Accuracy is a primary metric, measuring the

proportion of correctly classified instances. Example: Email spam detection where

emails are labeled as spam or not spam.

Unsupervised machine learning - CORRECT ANSWER ✔✔✔ The algorithm is trained

on an unlabeled dataset. It identifies patterns or relationships without guidance. The

algorithm might cluster similar data points based on shared features and can also

reduce data dimensionality. Evaluation is challenging due to the absence of a

'ground truth'. Example: Customer segmentation for targeted marketing campaigns.

Reinforcement learning - CORRECT ANSWER ✔✔✔ Has an agent interacting with an

environment to learn optimal decisions. The agent receives rewards or penalties

based on actions and updates its model to learn beneficial actions. The goal is to

maximize total reward. Total reward during training is often used for evaluation,

along with benchmarking against a baseline. Example: Training a robot to navigate

a warehouse while avoiding obstacles.

Generative AI - CORRECT ANSWER ✔✔✔ New examples of data that are useful.

Large Language Models (LLM) - CORRECT ANSWER ✔✔✔ Models that require large

amounts of data (billions) and parameters (billions).

Generative Pre-trained Transformer (GPT) - CORRECT ANSWER ✔✔✔ The algorithm

name for ChatGPT.

Partial preview of the text

Download Data Types and Machine Learning Models in Data Science and more Exams Advanced Education in PDF only on Docsity!

Data Types and Machine Learning

Models in Data Science

Tabular data - CORRECT ANSWER ✔✔✔ Rows represent observations and columns represent variables. This is considered traditional social science data. Example: Employee records with variables like age, salary, and job title. Text data - CORRECT ANSWER ✔✔✔ Rather than numbers, the data is text. It needs to be converted into tabular data for processing. Techniques like tokenization (e.g., Chat GPT-3 tokenizer, tiktoken, Byte pair encoding) are used for this purpose. Example: Analyzing employee emails for sentiment or topic classification. Image data - CORRECT ANSWER ✔✔✔ AI processes images as matrices of pixel values. An image consists of a matrix of numbers between 0 and 255, indicating a pixel's intensity. Example: Identifying defective products using machine vision on factory lines. Audio data - CORRECT ANSWER ✔✔✔ Audio data is often converted to arrays based on waveform, quantifying the shape into numbers. Concepts related to audio include pitch (measured in hertz), amplitude (loudness/quietness), and timbre (uniqueness of sound). Example: Transcribing customer service calls for analysis. Supervised machine learning - CORRECT ANSWER ✔✔✔ Involves training an algorithm on a labeled dataset for predictions or decisions. Input data (only features) is paired with output data (labels or targets). The algorithm learns patterns between the input and output. Accuracy is a primary metric, measuring the proportion of correctly classified instances. Example: Email spam detection where emails are labeled as spam or not spam. Unsupervised machine learning - CORRECT ANSWER ✔✔✔ The algorithm is trained on an unlabeled dataset. It identifies patterns or relationships without guidance. The algorithm might cluster similar data points based on shared features and can also reduce data dimensionality. Evaluation is challenging due to the absence of a 'ground truth'. Example: Customer segmentation for targeted marketing campaigns. Reinforcement learning - CORRECT ANSWER ✔✔✔ Has an agent interacting with an environment to learn optimal decisions. The agent receives rewards or penalties based on actions and updates its model to learn beneficial actions. The goal is to maximize total reward. Total reward during training is often used for evaluation, along with benchmarking against a baseline. Example: Training a robot to navigate a warehouse while avoiding obstacles. Generative AI - CORRECT ANSWER ✔✔✔ New examples of data that are useful. Large Language Models (LLM) - CORRECT ANSWER ✔✔✔ Models that require large amounts of data (billions) and parameters (billions). Generative Pre-trained Transformer (GPT) - CORRECT ANSWER ✔✔✔ The algorithm name for ChatGPT.

Positional Encoding - CORRECT ANSWER ✔✔✔ Scheme for maintaining word order in the model. Self-Attention - CORRECT ANSWER ✔✔✔ Biases the model to focus on which words it should pay attention to, attempting to provide context. Data Type for ChatGPT - CORRECT ANSWER ✔✔✔ Tokenized (tiktoken) text data. Algorithms for ChatGPT - CORRECT ANSWER ✔✔✔ Includes supervised learning (deep neural networks using attention & transformers), unsupervised (word embeddings), and reinforcement learning. Application for ChatGPT - CORRECT ANSWER ✔✔✔ Web-based chat-bot application. Pre-training - CORRECT ANSWER ✔✔✔ The model is pre-trained on general language using diverse text data unrelated to specific tasks. Instruction fine-tuning - CORRECT ANSWER ✔✔✔ Fine-tuning using full scripts of Friends to capture character-specific language. RLHF - CORRECT ANSWER ✔✔✔ Trained evaluators interact with the LLM to rate dialogue matching character tones. Task-specific metrics - CORRECT ANSWER ✔✔✔ Metrics like BLEU and ROUGE used for summarization tasks or translation, not for nuanced language. Examples of task-specific metrics - CORRECT ANSWER ✔✔✔ Using ROUGE or BLEU for summarization tasks. Research Benchmark - CORRECT ANSWER ✔✔✔ Large sets of Q&A covering many topics for model evaluation. LLM-Self evaluation - CORRECT ANSWER ✔✔✔ Fast and easy to implement but expensive; useful when evaluation is simpler than the task. Human evaluation - CORRECT ANSWER ✔✔✔ Most reliable but slow and expensive, especially with expert evaluators. Crowdsourced evaluations - CORRECT ANSWER ✔✔✔ Provide general skills in ranking but are less useful for task-specific model selection. Expert linguists - CORRECT ANSWER ✔✔✔ Review translation quality in human evaluations. Platforms for evaluations - CORRECT ANSWER ✔✔✔ Examples include LMYSYS or chatbot-arena

Data Types and Machine Learning Models in Data Science, Exams of Advanced Education

Related documents

Partial preview of the text

Download Data Types and Machine Learning Models in Data Science and more Exams Advanced Education in PDF only on Docsity!

Data Types and Machine Learning

Models in Data Science