Data Mining & Machine Learning at Purdue University, Lecture notes of Data Structures and Algorithms

An overview of the CS37300 course on Data Mining & Machine Learning at Purdue University. It includes information on the syllabus, textbooks, workload, exams, computing resources, and Python resources. The document also explains the machine learning process, elements of data mining & machine learning algorithms, and provides an example of survival bias. It is a useful resource for students interested in data mining, data science, and machine learning.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

arlie
arlie 🇺🇸

4.6

(18)

245 documents

1 / 55

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining & Machine Learning
CS37300
Purdue University
August 23, 2022
Bruno Ribeiro
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37

Partial preview of the text

Download Data Mining & Machine Learning at Purdue University and more Lecture notes Data Structures and Algorithms in PDF only on Docsity!

Data Mining & Machine Learning

CS3 7300

Purdue University

August 2 3 , 20 22

Bruno Ribeiro

Course overview

Topics

  • Elements of data science algorithms
    • Machine Learning
    • Data Mining
    • Statistics
  • Statistical basics and background
  • Data preparation and exploration
  • Predictive modeling
  • Methodology, evaluation
  • Descriptive modeling

Syllabus / Logistics

  • Syllabus and ALL necessary information (slides, notes, links) will be posted on

our website

  • https://www.cs.purdue.edu/homes/ribeirob/courses/Fall2022/

Workload

  • Homeworks (6 theory + programming assignments)
    • Six assignments including written/math exercises, programming

assignments in python

  • Python is an important language to learn in data mining, data

science, and machine learning

  • Late policy: No Late Homework (Grade = zero after deadline)
    • Submission on Gradescope
    • Firm deadlines (6:00pm) with no late penalty until 6:00am next day
  • Lowest homework score will be dropped from the average
    • Do not skip a homework early: Save for emergencies
  • Exams
  • Midterm and final exam

Grading

  • Grades will be posted on

Brightspace: https://purdue.brightspace.com/d2l/home/ 599255

  • Attendance: 5%
  • ML Competition (Kaggle Competition): up to +5% (extra credit)
  • Homework: 45% (the lowest grade homework will be dropped from

average)

  • Serious and documented medical or family emergencies will be

automatically counted as a zero grade (i.e., discarded from the average).

Additional extensions (beyond one missed homework) will be granted if

the documented emergency persists for 2+ homeworks.

  • Students are advised to not drop a homework for non-emergency

reasons since, if an emergency happens, the student will have two zero

grades and one of them will count towards the average.

  • Midterm: 20%
  • Final exam: 30%

Computing Resources

  • Scholar Cluster

Software needed and cluster usage manual

https://www.cs.purdue.edu/homes/ribeirob/courses/Fall2022/howto/cluster-how-to.html

Course introduction

Machine Learning

  • Machine learning : How can we build computer systems that

automatically improve with experience? (Mitchell 2006)

Databases

Artificial Intelligence

Visualization

Statistics

Processed

data

Target

data

Data

Selection

Preprocessing

Learning Patterns Interpretation

evaluation

Knowledge

The machine learning process

Machine Learning Process

  1. Application setup:
    • Acquire relevant domain

knowledge

  • Assess user goals
  1. Data selection
  • Choose data sources
  • Identify relevant attributes
  • Sample data
  1. Data preprocessing
  • Remove noise or outliers
  • Handle missing values
  • Account for time or other

changes

  1. Data transformation
    • Find useful features
    • Reduce dimensionality

Machine Learning Process

  • Data representation: Describe the data
  • Task specification: Outline the goal(s)
  • Knowledge representation: Describe the rules
  • Learning technique:
    • Search: Identify a rule
    • Evaluation function: Estimate confidence
  • Prediction technique: Apply the rule
  • Data mining system: Do above in combination

Complexities

  • Data size: vastly larger or changing rapidly
  • Data representation: can affect ability to learn and interpret models
  • Knowledge representation: needs to capture more subtle forms of

probabilistic dependence

  • Search space: vastly larger
  • Evaluation functions: difficult to assess confidence in model utility