













































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of the CS102 course on working with data. It highlights the importance of data-driven scientific discovery, business practices, medicine, education, politics, societal interventions, and the ability to collect data across many domains. The document covers data tools and techniques, pitfalls in working with data, data systems and platforms, promises of working with data, basic data manipulation and analysis, data mining, and machine learning. It also discusses the importance of using data to build models and make predictions. examples of data analysis and data mining techniques.
Typology: Lecture notes
1 / 53
This page cannot be seen from the preview
Don't miss anything!














































§ Explosion in data-driven scientific discovery, business practices, medicine, education, politics, societal interventions, … § And it’s just the beginning Ø Ability to collect data across many domains will continue to accelerate Ø Data analysis techniques will continue to improve “Data is the oil of the 21 st century”
(1) Collect data Via computers, sensors, people, events, … (2) Do something with it Make decisions, confirm hypotheses, gain insights, predict future, … “Data Science” = Going from (1) to (2)
§ Promises of working with data Applications and services § Data tools and techniques Database management systems Data mining and machine learning § Pitfalls in working with data Correlation and causation Underfitting and overfitting Privacy and a few others § Data systems and platforms
(1) Collect data (2) Do something with it
(1) Collect data
(2) Do something with it (1) Collect data
44,000 sensors, over 2 billion measurements Physical, chemical, biological … (1) Collect and curate data (2) Do something with it
§ Weather prediction § Medical diagnosis § Financial markets § Resource management § Computational social science § Smart buildings and cities § The list goes on and on, and it’s still early days
§ Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions (“queries”) § Data Mining Looking for patterns in data § Machine Learning Using data to build models and make predictions § Data Visualization Graphical depiction of data § Data Collection and Preparation
Performing well-defined computations or asking well-defined questions (“queries”) § Average January low temperature for each country over last 20 years § Number of items over $100 bought by females between ages 20 and 30 § Frequency of specific medicine relieving specific symptoms § The ten stocks whose price varied the most over the past year
Looking for patterns in data § Items X,Y,Z are bought together frequently § People who like movie X also like movie Y § Patients who respond well to medicines X and Y also respond well to medicine Z § Students going to the same university are frequently online friends § Wealthier people are moving from cities to suburbs
Using data to build models and make predictions § Customers who are women over age 20 are likely to respond to an advertisement § Students with good grades are predicted to do well on the SAT § The temperature of a city can be estimated as the average of its nearby cities, unless some of the cities are on the coast or in the mountains
Using data to build models and make predictions § Customers who are women over age 20 are likely to respond to an advertisement § Students with good grades are predicted to do well on the SAT § The temperature of a city can be estimated as the average of its nearby cities, unless some of the cities are on the coast or in the mountains Roughly: Basic data analysis and data mining give answers from the available data, while machine learning uses the available data to make predictions about missing or future data Regression Classification Clustering