Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

data science document, Lecture notes of Data Mining

Devi Ahilya Vishwavidyalaya Data Mining

ntroduction to Data Science / What is Data Science? (10 Marks) Data Science is an interdisciplinary field that uses statistics, mathematics, computer science, and domain knowledge to extract meaningful information and insights from structured and unstructured data. Definition Data Science is the process of collecting, cleaning, analyzing, and interpreting large volumes of data to support decision-making, prediction, and automation. Key Components of Data Science Data Collection – Gathering data from databases, sensors, social media, web, etc. Data Processing – Cleaning and transforming raw data. Data Analysis – Finding patterns using statistics and ML. Visualization – Presenting results using graphs and dashboards. Decision Making – Using insights for business or scientific decisions. Characteristics Handles Big Data Uses Machine Learning and AI Supports predictive and prescriptive analysis Works with structured, semi-structured, and unstructured data Example Netflix re

Typology: Lecture notes

2025/2026

Uploaded on 02/13/2026

nishita-agrawal-1 🇮🇳

2 documents

1 / 106

This page cannot be seen from the preview

Don't miss anything!

INTRODUCTION TO DATA SCIENCE

LECTURE

NOTES UNIT - 1

Introduction to data science

Data science:

Data science is the domain of study that deals with vast volumes of data using modern tools and

techniques to find unseen patterns, derive meaningful information, and make business decisions.

Data science uses complex machine learning algorithms to build predictive models.

The data used for analysis can come from many different sources and presented in various

formats.

Data science is about extraction, preparation, analysis, visualization, and maintenance of

information. It is a cross disciplinary field which uses scientific methods and processes to draw

insights from data.

The Data Science Lifecycle

Data science’s lifecycle consists of five distinct stages, each with its own tasks:

Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage involves

gathering raw structured and unstructured data.

Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data

Architecture. This stage covers taking the raw data and putting it in a form that can be used.

Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization. Data

scientists take the prepared data and examine its patterns, ranges, and biases to determine how

useful it will be in predictive analysis.

Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative

Analysis. Here is the real meat of the lifecycle. This stage involves performing the various

analyses on the data.

Discover Lecture notes of Data Mining Devi Ahilya Vishwavidyalaya

Partial preview of the text

Download data science document and more Lecture notes Data Mining in PDF only on Docsity!

INTRODUCTION TO DATA SCIENCE

LECTURE

NOTES UNIT - 1

Introduction to data science

Data science: Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models. The data used for analysis can come from many different sources and presented in various formats. Data science is about extraction, preparation, analysis, visualization, and maintenance of information. It is a cross disciplinary field which uses scientific methods and processes to draw insights from data. The Data Science Lifecycle Data science’s lifecycle consists of five distinct stages, each with its own tasks: Capture : Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage involves gathering raw structured and unstructured data. Maintain : Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture. This stage covers taking the raw data and putting it in a form that can be used. Process : Data Mining, Clustering/Classification, Data Modeling, Data Summarization. Data scientists take the prepared data and examine its patterns, ranges, and biases to determine how useful it will be in predictive analysis. Analyze : Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis. Here is the real meat of the lifecycle. This stage involves performing the various analyses on the data.

Communicate : Data Reporting, Data Visualization, Business Intelligence, Decision Making. In this final step, analysts prepare the analyses in easily readable forms such as charts, graphs, and reports.

Evolution of Data Science: Growth & Innovation

Data science was born from the idea of merging applied statistics with computer science. The resulting field of study would use the extraordinary power of modern computing. Scientists realized they could not only collect data and solve statistical problems but also use that data to solve real-world problems and make reliable fact-driven predictions. 1962: American mathematician John W. Tukey first articulated the data science dream. In his now-famous article “The Future of Data Analysis,” he foresaw the inevitable emergence of a new field nearly two decades before the first personal computers. While Tukey was ahead of his time, he was not alone in his early appreciation of what would come to be known as “data science.” 1977: The theories and predictions of “pre” data scientists like Tukey and Naur became more concrete with the establishment of The International Association for Statistical Computing (IASC), whose mission was “to link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.” 1980s and 1990s: Data science began taking more significant strides with the emergence of the first Knowledge Discovery in Databases (KDD) workshop and the founding of the International Federation of Classification Societies (IFCS). 1994: Business Week published a story on the new phenomenon of “Database Marketing.” It described the process by which businesses were collecting and leveraging enormous amounts of data to learn more about their customers, competition, or advertising techniques.

Data Scientist Data Architect Statistician Business Analyst Data and Analytics Manager

1. Data Analyst Data analysts are responsible for a variety of tasks including visualisation, munging, and processing of massive amounts of data. They also have to perform queries on the databases from time to time. One of the most important skills of a data analyst is optimization. Few Important Roles and Responsibilities of a Data Analyst include: Extracting data from primary and secondary sources using automated tools Developing and maintaining databases Performing data analysis and making reports with recommendations To become a data analyst: SQL, R, SAS, and Python are some of the sought-after technologies for data analysis. 2. Data Engineers Data engineers build and test scalable Big Data ecosystems for the businesses so that the data scientists can run their algorithms on the data systems that are stable and highly optimized. Data engineers also update the existing systems with newer or upgraded versions of the current technologies to improve the efficiency of the databases. Few Important Roles and Responsibilities of a Data Engineer include:

Design and maintain data management systems

 Testing Machine Learning systems  Developing apps/products basis client requirements To become machine learning engineer: technologies like Java, Python, JS, etc. Secondly, you should have a strong grasp of statistics and mathematics.

5. Data Scientist Data scientists have to understand the challenges of business and offer the best solutions using data analysis and data processing. For instance, they are expected to perform predictive analysis and run a fine-toothed comb through an “unstructured/disorganized” data to offer actionable insights. Few Important Roles and Responsibilities of a Data Scientist include:  Identifying data collection sources for business needs  Processing, cleansing, and integrating data  Automation data collection and management process  Using Data Science techniques/tools to improve processes To become a data scientist, you have to be an expert in R, MatLab, SQL, Python, and other complementary technologies. 6. Data Architect A data architect creates the blueprints for data management so that the databases can be easily integrated, centralized, and protected with the best security measures. They also ensure that the data engineers have the best tools and systems to work with. Few Important Roles and Responsibilities of a Data Architect include:  Developing and implementing overall data strategy in line with business/organization  Identifying data collection sources in line with data strategy  Collaborating with cross-functional teams and stakeholders for smooth functioning of database systems

 Planning and managing end-to-end data architecture To become a data architect: requires expertise in data warehousing, data modelling, extraction transformation and loan (ETL), etc. You also must be well versed in Hive, Pig, and Spark, etc.

7. Statistician A statistician, as the name suggests, has a sound understanding of statistical theories and data organization. Not only do they extract and offer valuable insights from the data clusters, but they also help create new methodologies for the engineers to apply. Few Important Roles and Responsibilities of a Statistician include:  Collecting, analyzing, and interpreting data  Analyzing data, assessing results, and predicting trends/relationships using statistical methodologies/tools  Designing data collection processes To become a statistician: SQL, data mining, and the various machine learning technologies. 8. Business Analyst The role of business analysts is slightly different than other data science jobs. While they do have a good understanding of how data-oriented technologies work and how to handle large volumes of data, they also separate the high-value data from the low-value data. Few Important Roles and Responsibilities of a Business Analyst include:  Understanding the business of the organization  Conducting detailed business analysis – outlining problems, opportunities, and solutions  Working on improving existing business processes To become business analyst: understanding of business finances and business intelligence , and also the IT technologies like data modelling, data visualization tools, etc.

 Once you have created this data, you then need to collect it somewhere and in a format that is useful for your model. This will depend on what method you will be using in the modelling phase but it will involve figuring out how you will feed the data into your model.  The final part of this is to then perform any pre-processing steps to ensure that the data is clean enough for the modelling method to work. This may involve removing outliers, or choosing to keep them, manipulating null values, whether a null value is a measure or whether it should be imputed to the average, or standardising the measures. Modelling  The next part, and often the most fun and exciting part, is the modelling phase of the Data Science project. The format this will take will depend primarily on what the problem is and how you defined success in the first step, and secondarily on how you processed the data.  Unfortunately, this is often the part that will take the least amount of time of any Data Science project, especially when there are many frameworks or libraries that exist, such as sklearn, statsmodels, tensorflow and that can be readily utilised.  You should have selected the method that you will be using to model your data in the defining a problem stage, and this may include simple graphical exploration, regression, classification or clustering. Evaluation  Once you have then created and implemented your models, you then need to know how to evaluate it. Again, this goes back to the problem formulation stage where you will have defined your measure of success, but this is often one of the most important stages.  Depending on how you processed your data and set-up your model, you may have a holdout dataset or testing data set that can be used to evaluate your model. On this dataset,

you are aiming to see how well your model performs in terms of both accuracy and reliability. Deployment Finally, once you have robustly evaluated your model and are satisfied with the results, then you can deploy it into production. This can mean a variety of things such as whether you use the insights from the model to make changes in your business, whether you use your model to check whether changes that have been made were successful, or whether the model is deployed somewhere to continually receive and evaluate live data.

6. Image Recognition Currently, Data Science is also used in Image Recognition. For Example, When we upload our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done with the help of machine learning and Data Science. When an Image is Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the faces which are present in the picture matched with someone else profile then Facebook suggests us auto-tagging. 7. Targeting Recommendation Targeting Recommendation is the most important application of Data Science. Whatever the user searches on the Internet, he/she will see numerous posts everywhere. example: Suppose I want a mobile phone, so I just Google search it and after that, I changed my mind to buy offline. Data Science helps those companies who are paying for Advertisements for their mobile. So everywhere on the internet in the social media, in the websites, in the apps everywhere I will see the recommendation of that mobile phone which I searched for. So this will force me to buy online. 8. Airline Routing Planning With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes easy to predict flight delays. It also helps to decide whether to directly land into the destination or take a halt in between like a flight can have a direct route from Delhi to the U.S.A or it can halt in between after that reach at the destination. 9. Data Science in Gaming In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science concepts are used with machine learning where with the help of past data the Computer will improve its performance. There are many games like Chess, EA Sports, etc. will use Data Science concepts. 10. Medicine and Drug Development The process of creating medicine is very difficult and time-consuming and has to be done with full disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of time, resources, and finance or developing new Medicine or drug but with the help of Data Science, it becomes easy because the prediction of success rate can be easily determined based on biological data or factors. The algorithms based on data science will forecast how this will react to the human body without lab experiments. 11. In Delivery Logistics Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science helps these companies to find the best route for the Shipment of their Products, the best time suited for delivery, the best mode of transport to reach the destination, etc.

12. Autocomplete AutoComplete feature is an important part of Data Science where the user will get the facility to just type a few letters or words, and he will get the feature of auto-completing the line. In Google Mail, when we are writing formal mail to someone so at that time data science concept of Autocomplete feature is used where he/she is an efficient choice to auto-complete the whole line. Also in Search Engines in social media, in various apps, AutoComplete feature is widely used.

Data security issues

What is Data Security? Data security is the process of protecting corporate data and preventing data loss through unauthorized access. This includes protecting your data from attacks that can encrypt or destroy data, such as ransomware, as well as attacks that can modify or corrupt your data. Data security also ensures data is available to anyone in the organization who has access to it. Some industries require a high level of data security to comply with data protection regulations. For example, organizations that process payment card information must use and store payment card data securely, and healthcare organizations in the USA must secure private health information (PHI) in line with the HIPAA standard. Data Security vs Data Privacy Data privacy is the distinction between data in a computer system that can be shared with third parties (non-private data), and data that cannot be shared with third parties (private data). There are two main aspects to enforcing data privacy:  Access control —ensuring that anyone who tries to access the data is authenticated to confirm their identity, and authorized to access only the data they are allowed to access.  Data protection —ensuring that even if unauthorized parties manage to access the data, they cannot view it or cause damage to it. Data protection methods ensure encryption, which prevents anyone from viewing data if they do not have a private encryption key, and data loss prevention mechanisms which prevent users from transferring sensitive data outside the organization. Data security has many overlaps with data privacy. The same mechanisms used to ensure data privacy are also part of an organization’s data security strategy. The primary difference is that data privacy mainly focuses on keeping data confidential, while data security mainly focuses on protecting from malicious activity.

infects corporate devices and encrypts data, making it useless without the decryption key.

Attackers display a ransom message asking for payment to release the key, but in many cases, even paying the ransom is ineffective and the data is lost.  Data Loss in the Cloud Many organizations are moving data to the cloud to facilitate easier sharing and collaboration. However, when data moves to the cloud, it is more difficult to control and prevent data loss. Users access data from personal devices and over unsecured networks. It is all too easy to share a file with unauthorized parties, either accidentally or maliciously.  SQL Injection SQL injection (SQLi) is a common technique used by attackers to gain illicit access to databases, steal data, and perform unwanted operations. It works by adding malicious code to a seemingly innocent database query. Common Data Security Solutions and Techniques: Data Discovery and Classification  Modern IT environments store data on servers, endpoints, and cloud systems. Visibility over data flows is an important first step in understanding what data is at risk of being stolen or misused.  To properly protect your data, you need to know the type of data, where it is, and what it is used for. Data discovery and classification tools can help.  Data detection is the basis for knowing what data you have. Data classification allows you to create scalable security solutions, by identifying which data is sensitive and needs to be secured. Data Masking  Data masking lets you create a synthetic version of your organizational data, which you can use for software testing, training, and other purposes that don’t require the real data.  The goal is to protect data while providing a functional alternative when needed.

UNIT –II

DATA COLLECTION AND PREPROCESSING

DATA COLLECTION:

Data collection is the process of collecting, measuring and analyzing different types of information using a set of standard validated techniques. The main objective of data collection is to gather information-rich and reliable data, and analyze them to make critical business decisions. Once the data is collected, it goes through a rigorous process of data cleaning and data processing to make this data truly useful for businesses. There are two main methods of data collection in research based on the information that is required, namely:  Primary Data Collection  Secondary Data Collection

Primary Data Collection Methods

Primary data refers to data collected from first-hand experience directly from the main source. It refers to data that has never been used in the past. The data gathered by primary data collection methods are generally regarded as the best kind of data in research.  The methods of collecting primary data can be further divided into quantitative data collection methods (deals with factors that can be counted) and qualitative data collection methods (deals with factors that are not necessarily numerical in nature). Here are some of the most common primary data collection methods:

1. Interviews Interviews are a direct method of data collection. It is simply a process in which the interviewer asks questions and the interviewee responds to them. It provides a high degree of flexibility because questions can be adjusted and changed anytime according to the situation.

2. Observations In this method, researchers observe a situation around them and record the findings. It can be used to evaluate the behaviour of different people in controlled (everyone knows they are being observed) and uncontrolled (no one knows they are being observed) situations. 3. Surveys and Questionnaires Surveys and questionnaires provide a broad perspective from large groups of people. They can be conducted face-to-face, mailed, or even posted on the Internet to get respondents from anywhere in the world. 4. Focus Groups A focus group is similar to an interview, but it is conducted with a group of people who all have something in common. The data collected is similar to in-person interviews, but they offer a better understanding of why a certain group of people thinks in a particular way. 5. Oral Histories Oral histories also involve asking questions like interviews and focus groups. However, it is defined more precisely and the data collected is linked to a single phenomenon. It involves collecting the opinions and personal experiences of people in a particular event that they were involved in.

Secondary Data Collection Methods

Secondary data refers to data that has already been collected by someone else. It is much more inexpensive and easier to collect than primary data. Here are some of the most common secondary data collection methods:

data science document, Lecture notes of Data Mining

Related documents

Partial preview of the text

Download data science document and more Lecture notes Data Mining in PDF only on Docsity!

INTRODUCTION TO DATA SCIENCE

LECTURE

NOTES UNIT - 1

Introduction to data science

Evolution of Data Science: Growth & Innovation

Data security issues

UNIT –II

DATA COLLECTION AND PREPROCESSING

DATA COLLECTION:

Primary Data Collection Methods

Secondary Data Collection Methods