

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Types and Machine Learning Models in Data Science
Typology: Exams
1 / 2
This page cannot be seen from the preview
Don't miss anything!


Tabular data - CORRECT ANSWER โโโ Rows represent observations and columns represent variables. This is considered traditional social science data. Example: Employee records with variables like age, salary, and job title. Text data - CORRECT ANSWER โโโ Rather than numbers, the data is text. It needs to be converted into tabular data for processing. Techniques like tokenization (e.g., Chat GPT-3 tokenizer, tiktoken, Byte pair encoding) are used for this purpose. Example: Analyzing employee emails for sentiment or topic classification. Image data - CORRECT ANSWER โโโ AI processes images as matrices of pixel values. An image consists of a matrix of numbers between 0 and 255, indicating a pixel's intensity. Example: Identifying defective products using machine vision on factory lines. Audio data - CORRECT ANSWER โโโ Audio data is often converted to arrays based on waveform, quantifying the shape into numbers. Concepts related to audio include pitch (measured in hertz), amplitude (loudness/quietness), and timbre (uniqueness of sound). Example: Transcribing customer service calls for analysis. Supervised machine learning - CORRECT ANSWER โโโ Involves training an algorithm on a labeled dataset for predictions or decisions. Input data (only features) is paired with output data (labels or targets). The algorithm learns patterns between the input and output. Accuracy is a primary metric, measuring the proportion of correctly classified instances. Example: Email spam detection where emails are labeled as spam or not spam. Unsupervised machine learning - CORRECT ANSWER โโโ The algorithm is trained on an unlabeled dataset. It identifies patterns or relationships without guidance. The algorithm might cluster similar data points based on shared features and can also reduce data dimensionality. Evaluation is challenging due to the absence of a 'ground truth'. Example: Customer segmentation for targeted marketing campaigns. Reinforcement learning - CORRECT ANSWER โโโ Has an agent interacting with an environment to learn optimal decisions. The agent receives rewards or penalties based on actions and updates its model to learn beneficial actions. The goal is to maximize total reward. Total reward during training is often used for evaluation, along with benchmarking against a baseline. Example: Training a robot to navigate a warehouse while avoiding obstacles. Generative AI - CORRECT ANSWER โโโ New examples of data that are useful. Large Language Models (LLM) - CORRECT ANSWER โโโ Models that require large amounts of data (billions) and parameters (billions). Generative Pre-trained Transformer (GPT) - CORRECT ANSWER โโโ The algorithm name for ChatGPT.
Positional Encoding - CORRECT ANSWER โโโ Scheme for maintaining word order in the model. Self-Attention - CORRECT ANSWER โโโ Biases the model to focus on which words it should pay attention to, attempting to provide context. Data Type for ChatGPT - CORRECT ANSWER โโโ Tokenized (tiktoken) text data. Algorithms for ChatGPT - CORRECT ANSWER โโโ Includes supervised learning (deep neural networks using attention & transformers), unsupervised (word embeddings), and reinforcement learning. Application for ChatGPT - CORRECT ANSWER โโโ Web-based chat-bot application. Pre-training - CORRECT ANSWER โโโ The model is pre-trained on general language using diverse text data unrelated to specific tasks. Instruction fine-tuning - CORRECT ANSWER โโโ Fine-tuning using full scripts of Friends to capture character-specific language. RLHF - CORRECT ANSWER โโโ Trained evaluators interact with the LLM to rate dialogue matching character tones. Task-specific metrics - CORRECT ANSWER โโโ Metrics like BLEU and ROUGE used for summarization tasks or translation, not for nuanced language. Examples of task-specific metrics - CORRECT ANSWER โโโ Using ROUGE or BLEU for summarization tasks. Research Benchmark - CORRECT ANSWER โโโ Large sets of Q&A covering many topics for model evaluation. LLM-Self evaluation - CORRECT ANSWER โโโ Fast and easy to implement but expensive; useful when evaluation is simpler than the task. Human evaluation - CORRECT ANSWER โโโ Most reliable but slow and expensive, especially with expert evaluators. Crowdsourced evaluations - CORRECT ANSWER โโโ Provide general skills in ranking but are less useful for task-specific model selection. Expert linguists - CORRECT ANSWER โโโ Review translation quality in human evaluations. Platforms for evaluations - CORRECT ANSWER โโโ Examples include LMYSYS or chatbot-arena