Download Data Mining & Machine Learning at Purdue University and more Lecture notes Data Structures and Algorithms in PDF only on Docsity!
Data Mining & Machine Learning
CS3 7300
Purdue University
August 2 3 , 20 22
Bruno Ribeiro
Course overview
Topics
- Elements of data science algorithms
- Machine Learning
- Data Mining
- Statistics
- Statistical basics and background
- Data preparation and exploration
- Predictive modeling
- Methodology, evaluation
- Descriptive modeling
Syllabus / Logistics
- Syllabus and ALL necessary information (slides, notes, links) will be posted on
our website
- https://www.cs.purdue.edu/homes/ribeirob/courses/Fall2022/
Workload
- Homeworks (6 theory + programming assignments)
- Six assignments including written/math exercises, programming
assignments in python
- Python is an important language to learn in data mining, data
science, and machine learning
- Late policy: No Late Homework (Grade = zero after deadline)
- Submission on Gradescope
- Firm deadlines (6:00pm) with no late penalty until 6:00am next day
- Lowest homework score will be dropped from the average
- Do not skip a homework early: Save for emergencies
- Exams
- Midterm and final exam
Grading
Brightspace: https://purdue.brightspace.com/d2l/home/ 599255
- Attendance: 5%
- ML Competition (Kaggle Competition): up to +5% (extra credit)
- Homework: 45% (the lowest grade homework will be dropped from
average)
- Serious and documented medical or family emergencies will be
automatically counted as a zero grade (i.e., discarded from the average).
Additional extensions (beyond one missed homework) will be granted if
the documented emergency persists for 2+ homeworks.
- Students are advised to not drop a homework for non-emergency
reasons since, if an emergency happens, the student will have two zero
grades and one of them will count towards the average.
- Midterm: 20%
- Final exam: 30%
Computing Resources
Software needed and cluster usage manual
https://www.cs.purdue.edu/homes/ribeirob/courses/Fall2022/howto/cluster-how-to.html
Course introduction
Machine Learning
- Machine learning : How can we build computer systems that
automatically improve with experience? (Mitchell 2006)
Databases
Artificial Intelligence
Visualization
Statistics
Processed
data
Target
data
Data
Selection
Preprocessing
Learning Patterns Interpretation
evaluation
Knowledge
The machine learning process
Machine Learning Process
- Application setup:
knowledge
- Data selection
- Choose data sources
- Identify relevant attributes
- Sample data
- Data preprocessing
- Remove noise or outliers
- Handle missing values
- Account for time or other
changes
- Data transformation
- Find useful features
- Reduce dimensionality
Machine Learning Process
- Data representation: Describe the data
- Task specification: Outline the goal(s)
- Knowledge representation: Describe the rules
- Learning technique:
- Search: Identify a rule
- Evaluation function: Estimate confidence
- Prediction technique: Apply the rule
- Data mining system: Do above in combination
Complexities
- Data size: vastly larger or changing rapidly
- Data representation: can affect ability to learn and interpret models
- Knowledge representation: needs to capture more subtle forms of
probabilistic dependence
- Search space: vastly larger
- Evaluation functions: difficult to assess confidence in model utility