Introduction to Data Science, Exams of Advanced Education

An overview of the field of data science, covering key concepts such as anomaly detection, apis, artificial intelligence, big data, data analysis, data analytics, data engineering, data preparation, data science, data-driven decision making, descriptive analyses, dimensionality reduction, distributions, heat maps, programming languages for data science, machine learning, predictive models, project management, quantitative analysis, data scraping, and self-generated data. It discusses the data science pathway, including planning, wrangling, modeling, and applying data science techniques. The document also touches on the importance of research ethics, the use of tools and applications for data science, and the role of training and testing data in building predictive models. Overall, this document serves as a comprehensive introduction to the field of data science, highlighting its key components, methodologies, and applications.

Typology: Exams

2023/2024

Available from 10/22/2024

Examproff
Examproff šŸ‡ŗšŸ‡ø

3

(2)

8.3K documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Science Foundation Fundamentals
with Complete Solutions
According to the example calculation in the video, what information do you have to have
in order to use calculus? - ANSWER-a function that describes the relationship between
price and sales
Feedback
In order to use calculus to find the best price for maximizing revenue, you must first
have a formula that says how sales are related to price.
Actionable Insights: - ANSWER-Data and data science is for doing. Need to focus on
things that are controllable (specific); be practical (ROI?) - impact is large enough to
justify the efforts. You want to build up: have sequential steps .
Agency of Algorithms and Decision Makers - ANSWER-Recommendations: algorithm
process you can accept/reject. Based on your shopping patterns may w suggest XYZ;
based on what you've read you may like this. Your own past behavior to give you
recommendations.
Human in the loop make and implement decisions. Self driving cars. You are there is
needed to intervene or make the final decision.
Human Accessible: algorithm makes the design, but you need to be able to understand
how it reached the decision. Online mortgage applications.
Machine-Centric: machine talks to other machines. Smart watch talks to phone. It's the
Internet of Things.
Aggregating Models - ANSWER-Any one guess maybe high maybe low. When you
combine (central limit theorem) several different models the errors tend to cancel out
and you end up with a composite estimate that's generally closer to the true value.
Takes extra time and effort but gives you multiple perspectives compensating on
weakness and improving strength. You can find the signal amid the noise. More stable.
Many eyes on the same problem
Analyst - ANSWER-Day to day data tasks
Web analytics, SQL, visualizations.
Good for business decision-making.
Anomaly detection - ANSWER-the process of identifying rare or unexpected items or
events in a data set that do not conform to other items in the data set. This can be
serendipity - unexpected insights untapped potential/values.
Finding anomalies: it can be fraud, process failure, potential value. All have in common:
they are outliers. they don't follow expected patterns.
Regression
Bayesian Analysis
Hierarchical Clustering
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Introduction to Data Science and more Exams Advanced Education in PDF only on Docsity!

Data Science Foundation Fundamentals

with Complete Solutions

According to the example calculation in the video, what information do you have to have in order to use calculus? - ANSWER-a function that describes the relationship between price and sales Feedback In order to use calculus to find the best price for maximizing revenue, you must first have a formula that says how sales are related to price. Actionable Insights: - ANSWER-Data and data science is for doing. Need to focus on things that are controllable (specific); be practical (ROI?) - impact is large enough to justify the efforts. You want to build up: have sequential steps. Agency of Algorithms and Decision Makers - ANSWER-Recommendations: algorithm process you can accept/reject. Based on your shopping patterns may w suggest XYZ; based on what you've read you may like this. Your own past behavior to give you recommendations. Human in the loop make and implement decisions. Self driving cars. You are there is needed to intervene or make the final decision. Human Accessible: algorithm makes the design, but you need to be able to understand how it reached the decision. Online mortgage applications. Machine-Centric: machine talks to other machines. Smart watch talks to phone. It's the Internet of Things. Aggregating Models - ANSWER-Any one guess maybe high maybe low. When you combine (central limit theorem) several different models the errors tend to cancel out and you end up with a composite estimate that's generally closer to the true value. Takes extra time and effort but gives you multiple perspectives compensating on weakness and improving strength. You can find the signal amid the noise. More stable. Many eyes on the same problem Analyst - ANSWER-Day to day data tasks Web analytics, SQL, visualizations. Good for business decision-making. Anomaly detection - ANSWER-the process of identifying rare or unexpected items or events in a data set that do not conform to other items in the data set. This can be serendipity - unexpected insights untapped potential/values. Finding anomalies: it can be fraud, process failure, potential value. All have in common: they are outliers. they don't follow expected patterns. Regression Bayesian Analysis Hierarchical Clustering

Neural Networks Dealing with rare events - leads to unbalanced models. Difficult data (biometrics, multimedia) API: Application Programming Interface - ANSWER-Isn't a source of data but rather it's a way of sharing data, it can take data from one application to another or from a server to your computer. It's the thing that routes the data, translates it, and gets it ready for use. It allows you to access data and include it in your data science programing. JSON is used here: JavaScript Object Notation (can include in Python and Java). Social API (twitter, facebook) Utilities (drop box, Google) Commerce (stripe, mailchimp, slack) It can become a process or an App. What kind of data can be accessed with APIs? Both proprietary and open data. Area Chart: - ANSWER-Similar to line charts, except the areas under the lines are filled in. Artificial Intelligence - ANSWER-Algorithms that learn from data; broadly: machine learning. Strong or General AI: a replica of the human brain that can solve any cognitive task. Weak or Narrow AI: algorithms that focus on specific well-defined tasks. You can't do AI without data science Bar chart: - ANSWER-Over time for different items. showing distribution. Bayes' Theorem - ANSWER-Posterior probability as a function of the likelihood, the prior probability and the probability of getting the data you found. Used for medical diagnosis. If the person tests positive, on a a test that is 90% effective, what is the probability the person has the disease. Big Data: - ANSWER-Def: unusual volume; velocity and variety. You can do big data without the full toolkit of data science. Bubble Charts - ANSWER-A type of scatter plot with circular symbols used to rank your data with bubbles. Can overlaid on a map. Business Intelligence - ANSWER-Getting insights to do something better in your business. Emphasized speed, accessibility, insight. Often rely on structured dashboards. Data science helps set up BI, makes it possible. Business intelligence gives purpose to data science. Collect and clean data; build model outcomes; find trends and anomalies. C, C++, Java - ANSWER-General purpose languages for back end and maximum speed. (JSON)

Creating Data - ANSWER-Natural Observation. Informal Discussions Formal Interviews Surveys (closed ended questions) Words > numbers. Be as open ended as possible. Start with the big picture and then narrow it down. Start with general and move to more specific. Experiments: A/B testing: 2 versions of a website which one is more effective. D3 - ANSWER-D3.js (Data Driven Documents) is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It makes use of Scalable Vector Graphics, HTML5, and Cascading Style Sheets standardsYou can create charts and graphs for browsers. Many libraries available. You can use other peoples' work to use for your own. NOT so easy to use and learn!. Data Analysis (past) - ANSWER-Predecessor to Data Analytics: used in science and statisticians (insurance and finance). Collecting data was difficult, expensive and slow Data Analytics (present) - ANSWER-Spread from science to the business world. Inexpensive and readily available. Better tools (excel, tableau, R). Easy, inexpensive and fast. Data Availability - ANSWER-In House: fastest way to start; restrictions may not apply; may be able to talk to the people who created the data. Issues: not well documented and maintained; may not exist. Open Data: data that is free available to the public. Can be government, scientific and social media Data engineer - ANSWER-Developers, architects Focus on hardware and software Data preparation - ANSWER-80% of project time is typically spend on data prep. Column = variable Row = case/observation one sheet per file Each file has one level of observations (vendor address file, order file) Tidy Data: each column represents a variable. Data Science - ANSWER-The skills and techniques for dealing with challenging data. Not mutually exclusive from AI You can do Data Science without AI, machine learning or big data, or predictive analytics or prescriptive. Data science methods can contribute to business intelligence by which tasks? - ANSWER-Data cleaning, Data modeling (outcomes), finding trends and anomalies in the data

Data science pathway - ANSWER-1) Planning: define goals; organize resources (right computers, software), coordinate people; schedule the project.

  1. Wrangling: get data; clean data (fits into the program); explore (visualizations); Refine data
  2. Modeling: create the statistical model; validate it; evaluate the model; refine the model.
  3. Applying: presenting the model; deploy the model; revisit the model (how well is it performing); archive the assets. Data visualization can be considered an example of what? - ANSWER-data science without big data Feedback Creative data visualization often requires substantial computer programming and mathematical skills, and so can be considered data science, even if it doesn't require all three Vs of big data. Data-Driven Decision Making (Future) - ANSWER-Democratization of data: won't have to collect anymore - it will be available to anyone in the company to run hypothesis and experiments. Descriptive Analyses - ANSWER-Used to simplify data into manageable levels. It's like cleaning up the mess in your data to find clarity in the meaning of what you have. Three general steps:
  4. Visualize the data - make a graph, bell curve, histogram. Positive Skew: tail is at the end (most of the numbers are at the low end); Negative Skew: tail is in the beginning (most of the numbers are at the high end). It could be U shaped: most of the data is at the far left or right.
  5. Compute univariate descriptive statistics (mean - average or balance point, mode - most common, median-splits the data in two equal halves.).
  6. Go to Measures of association, or the connection between variables in an association. Measure of variability can be: Range - distance between lowest to highest number, Quartiles or IQR; splits the data into 25% groups; the variance and the standard deviation both used in statistics. Associations can be scatterplots. Numerical can be correlation coefficient; regression analysis. Regardless: the data must be representative of the larger group. Must be attentive to outliers; open ended scored can dramatically affect the data. Descriptive work - ANSWER-Counting the frequency of topics on social media. Cluster analysis of a customer database. Deviation - ANSWER-Data points relate to each other while seeing if the data point differs in the mean. We are seeing if is normal or unusual. Easiest to see line in the graphs.

Feature Selection and Creation - ANSWER-A feature is a variable or dimension in the data. You can get data from the features of the data. You can combine these features to create a new feature. (Dimension reduction's often used as a part of getting the data ready so you can then start looking at which features to include in the models you're creating.) Methods: correlation; stepwise regression (all potential variables and looks at correlations); lasso and ridge regression. You should be able to control the variable; look at the ROI; is it sensible (doe it make sense to select that variable) Forms of mathematics - ANSWER-Probability, linear algebra, calculus and regression. You can choose the procedures to judge the fit between your questions and your data and your procedure. Diagnose Problems: know what to do when it fails or gives you impossible results. Heat Map - ANSWER-A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heatmaps are used in various forms of analytics but are most commonly used to show user behaviour on specific webpages or webpage templates. Heat Maps - ANSWER-Show the rate of the data with high and low or high density and/or low density. It can range in color or multiple densities. How do expert systems mimic the decision-making of experts? - ANSWER-by explicitly listing decisions and outcomes in a logical chain like a flow chart Feedback An expert system spells out every step in a decision tree like a flow chart. Infographics - ANSWER-Adobe Illustrator now has a chart rendering tool. Interpretability: - ANSWER-First: identify who is going to use the data. If a machine: an algorithm works - they don't' need to understand the principles. If a Human: they can take the information to apply to new information. You are telling a story - makes sense of you findings to make recommendations. Languages for Data Science - ANSWER-Python: most popular for data science and machine learning. General purpose, easy to learn. Works great with large data. R: programming language specifically for data analysis popular among scientist and researchers.. Works natively. SQL Java Julia Scala Matlab Expand functionality with packages.

Legal issues: - ANSWER-Privacy Laws: GDPR HIPPA: health insurance portability and accountability act FERPA: family educational rights and privacy act. Line Charts - ANSWER-a chart that plots data points which are continuously distributed data to compare trends over time. Place time measurements on the X axis. Machine Learning - ANSWER-The ability of algorithms to learn from data and improve their function in the future. Memorization is easy; spotting patterns is hard; new situations are challenging. Machine learning really can't be done without data science. Sub discipline of data science. Machine Learning Specialits - ANSWER-Extensive work in computer science and mathematics. Deep Learning Artificial Intelligence Math for Data Science: Algebra - ANSWER-Allows to scale up. Your solution should deal efficiently with many instances at once. Generalize: your solution should apply to not just a few specific cases but cases that vary in arbitrary ways. Elementary Algebra (linear regression); Linear algebra ( works with vectors and matrixes). Choose Procedures: know which algorithms will work best with your data to answer your questions. Resolve Problems: know what to do when things don't go as expected so you respond thoughtfully. Math for Data Science: Calculus - ANSWER-Involved anytime you are trying to do maximization and minimization. Revenue maximization. Math for Data Science: Optimization and the combinatorial explosion - ANSWER- Combinatorial explosion: as the number of units and and number of possibilities rise growth is explosive - gets out of hand. You can use excel or calculus or optimization (linear programming). Using Solver for excel you can do optimization. A way to optimize various combinations to determine how you can get everything done with the best possible revenue outcomes. MLaaS: Machine Learning as a Service - ANSWER-SaaS: software as a service: making software accessible through the internet instead of using it from your desktop. MLaaS: it's a way of making the entire process of data science, machine learning, and artificial intelligence easier to access, easier to setup, and easier to get going. Azure ML, Amazon Machine Learning, IBM Watson. They put the analysis where the data is stored. Give you very flexible computing requirements: you can rent hardware as needed. It's a way to democratize the process.

Risk of Disease (Baysean) Classification of Photos. Correlation Work done by Data Science Researchers Predictions that involve difficult data (unstructured); sophisticated model (neural networks). Can do prediction without data science: clean quantitative data sets; common models. Predictive analytics focuses on predicting what's likely to happen in the future. On the other hand, prescriptive analytics focuses on which of these? - ANSWER-identifying cause-and-effect relationships in your data Feedback Prescriptive analytics focuses on cause-and-effect relationships so you can determine the best actions to bring about your goals. Predictive Models: - ANSWER-Find and use relevant past data. Model the outcome Apply to new data Validate the model against new data. Useful: predicting someone will develop an illness, or recover; pay off a loan. Two meanings of prediction: One of them is trying to predict future events, and that's using presently available data to predict something that will happen later in the future, or use past medical records to predict future health. The other possibly more common use is using prediction to refer to alternative events, that is, approximating how a human would perform the same task. Methods: Classification methods: k, nearest neighbors, nearest centroid classifications. Decision trees: a way of tracking the most influential data in determining where a particular case is going to end up Neural networks: a form of machine learning that has proven to be immensely adaptive and powerful. Regression analysis: which gives you an understandable equation to predict a single outcome based on multiple predictor variables. This can be the amount of time a person spends on your website to predict their purchase volume. They are very flexible with data; they can be flexible models and easy to interpret. Prescriptive Analytics - ANSWER-Cause and effect relationships. Observed Correlation (effect is likely when the cause is present). Correlation coefficient Temporal Precedence. (cause comes BEFORE the effect). No other explanation: connection can't be accounted for by anything else. Gold Standard: RCT: randomized controlled sample. Difficult to do. A/B testing: web applications: you have one offer on a website and another one at another website - which gets more clicks. What-If simulations. if this is true, then what will be expect?

Optimization Models. if we spend time and money it will maximize outcome. Another name for this: mathematical programming. Cross-Lag Correlations: Quasi-Experiments: are a useful way to approximate cause-and-effect relationships, as are what-if simulations and optimization models. Iteration is critical. Test it over and over again. You can have Prescriptive without data science. Causality may be impossible; but prescriptive can get you "close enough." Project Managers - ANSWER-Manage the project Big Picture: frame business relevant questions Must "speak data" - may not be able to do. A data science manager oversees the entire project and helps place it in a business context. Python - ANSWER-Use for data science is the best quality. Easiest to learn out there, read, clean and fix. you can use libraries to visualize (matplotlib). It is dull with is aesthetics. You can use Pandas (data structures and as a data analysis tool); and Seaborn (makes it sexy). GGPlot2: is a plotting system requiring minimal code. Very bland. Bokeh: creates interactive charts and graphs. Pygal: interactive SVG's with minimal lines of code. Geoplot Python & R - ANSWER-Programming languages for data manipulation and modeling Quantitative Analysts (Quants) - ANSWER-Use data science to scientifically investigate investment hypotheses and build models to predict investment outcomes. Changed the stock market. They have been replaced by high frequency automated intelligent trading algorithms which account most of the trading done in wall street now. Ranking - ANSWER-Two or more variables that show a greater than, less than or equal to. Research ethics on gathering your own data - ANSWER-1) Informed consent: When you're gathering data from people, they need to know what you want from them, and they also need to know what you're going to do with it so they can make an informed decision about whether they want to participate.

  1. Privacy: You need to keep identifiers to a minimum. Don't gather information that you don't need, and keep the data confidential, and protected. Researcher - ANSWER-Focus on domain-specific research. Physics and genetics r common More statistical expertise. Scatterplot - ANSWER-a graphed cluster of dots, each of which represents the values of two variables. You can see a lot of information to the mix

you at every step of the project pathway, from framing questions, to choosing data and algorithms, and interpreting and applying your results. Feedback Goals influence every step of the data science pathway, from planning to wrangling to modeling to applying. The generation of implicit rules - ANSWER-Implicit rules are rules that help the algorithms function. They are the rules that they develop by analyzing the test data. And they're implicit because they cannot be easily described to humans. Time Series - ANSWER-Tracking a data metric over time. Independent variable on an axis Tools for Data Science: Applications - ANSWER-Apps: more common, more accessible; good for exploring, good for sharing. Most common: spreadsheet (universal, excel, google sheets; good for browsing and exporting) SQL: Structured Query Language: access data stored in data bases. Visualization: Tableau; PowerBI, Qlik: interactive data exploration. Apps for Data Analysis (point an click - makes analysis easier for non specialist to conduct): SBSS, JASP, jamovi. Good for democratizing data. Lt the tools and techniques follow the question. Trend analysis - ANSWER-Figure out the path your data is on, so you can inform decisions about whether to stay on the current path, or whether changes need to be made. Autocorrelation: today's value is associated with yesterday's. You are looking for consistency in change. You can have linear growth; exponential growth, logarithmic growth (rate diminishes); sigmoid, sinusoidal. Change points are changes in the resting state of the data and you may look at historical events that can explain those changes You use R for this. Decomposition: breaking the trend over time and break it down into several separate elements. All start with plotting the dots and connecting them Types of Data - ANSWER-Part to Whole Distribution Nominal Comparison Time-Series Correlation Ranking Deviation Validating Models - ANSWER-Principle: check your work. Will it work with anything else? How?

Training data and testing data. In a dataset a training set is implemented to build up a model, while testing data (or validation) set is to validate the model built. There are two types of testing data: Cross Validation uses the training data splits and use the first set to create the model and the last to test it; and Holdout testing data (or holdout validations). You take 20% of data that you set aside (never looked at or touched), you use it to apply to the model just once against the model and see how it functions. What is a "posterior probability" in Bayes' Theorem? - ANSWER-A posterior probability is the probability of the cause, such as a disease, given the effect, such as a positive medical test for the disease. Feedback Baye's Theorem combines the probability of a hypothesis (the "prior") with the likelihood of the data given the hypothesis and the base rate of the cause to get the posterior probability, or probability of the hypothesis given the data. What is a major advantage of understanding the algebra behind data science procedures? - ANSWER-You will better understand how to diagnose problem and respond when things don't work as expected. Feedback Data doesn't always match the assumptions and requirements of algorithms, so things can go wrong. Understanding the algebra behind the algorithms can help you respond to problems intelligently. What is one of the rare qualities that creates such a high demand for data scientists? - ANSWER-the ability to find order, meaning, and value in unstructured data Feedback Data scientists are valuable because they are able to find value in unstructured data, but they're also able to predict outcomes and automate processes. What is the difference between "general AI" and "narrow AI"? - ANSWER-General AI attempts to build general purpose thinking machine, while narrow AI focuses on algorithms for specific tasks like translating language. Feedback General AI has historically focused on creating machines that can solve any problem, but narrow AI, where most of the recent technical growth has occurred, focuses on well- defined, specific problems.