







































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
It's a pdf of lecture notes of computer science
Typology: Summaries
1 / 79
This page cannot be seen from the preview
Don't miss anything!








































































(Autonomous) Dundigal, Hyderabad- 500043
The volume of data that one has to deal has exploded to unimaginable levels in the past decade, and at the same time, the price of data storage has systematically reduced. Private companies and research institutions capture terabytes of data about their users‘ interactions, business, social media, and also sensors from devices such as mobile phones and automobiles. The challenge of this era is to make sense of this sea of data. This is where big data analytics comes into picture. Big Data Analytics largely involves collecting data from different sources, mange it in a way that it becomes available to be consumed by analysts and finally deliver data products useful to the organization business.
achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used. Data Understanding − The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. Data Preparation − The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools. Modeling − In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, it is often required to step back to the data preparation phase. Evaluation − At this stage in the project, you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to evaluate the model thoroughly and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached. Deployment − Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process.
In many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model, it is important for the customer to understand upfront the actions which will need to be carried out in order to actually make use of the created models.
SEMMA Methodology:
SEMMA is another methodology developed by SAS for data mining modeling. It stands for S ample, E xplore, M odify, M odel, and A sses. Here is a brief description of its stages − Sample − The process starts with data sampling, e.g., selecting the dataset for modeling. The dataset should be large enough to contain sufficient information to retrieve, yet small enough to be used efficiently. This phase also deals with data partitioning. Explore − This phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the variables, and also abnormalities, with the help of data visualization. Modify − The Modify phase contains methods to select, create and transform variables in preparation for data modeling. Model − In the Model phase, the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome. Assess − The evaluation of the modeling results shows the reliability and usefulness of the created models. The main difference between CRISM–DM and SEMMA is that SEMMA focuses on the modeling aspect, whereas CRISP-DM gives more importance to stages of the cycle prior to modeling such as understanding the business problem to be solved, understanding and preprocessing the data to be used as input, for example, machine learning algorithms.
Big Data Life Cycle:
In today‘s big data context, the previous approaches are either incomplete or suboptimal. For example, the SEMMA methodology disregards completely data collection and preprocessing of different data sources. These stages normally constitute most of the work in a successful big data project. A big data analytics cycle can be described by the following stage −
Business Problem Definition Research Human Resources Assessment Data Acquisition Data Munging Data Storage
using two arrows system, one for up voting and the other for down voting. This would imply a response variable of the form y ∈ {positive, negative}. In order to combine both the data sources, a decision has to be made in order to make these two response representations equivalent. This can involve converting the first data source response representation to the second form, considering one star as negative and five stars as positive. This process often requires a large time allocation to be delivered with good quality.
Data Storage
Once the data is processed, it sometimes needs to be stored in a database. Big data technologies offer plenty of alternatives regarding this point. The most common alternative is using the Hadoop File System for storage that provides users a limited version of SQL, known as HIVE Query Language. This allows most analytics task to be done in similar ways as would be done in traditional BI data warehouses, from the user perspective. Other storage options to be considered are MongoDB, Redis, and SPARK. This stage of the cycle is related to the human resources knowledge in terms of their abilities to implement different architectures. Modified versions of traditional data warehouses are still being used in large scale applications. For example, teradata and IBM offer SQL databases that can handle terabytes of data; open source solutions such as postgreSQL and MySQL are still being used for large scale applications. Even though there are differences in how the different storages work in the background, from the client side, most solutions provide a SQL API. Hence having a good understanding of SQL is still a key skill to have for big data analytics. This stage a prioriseems to be the most important topic, in practice, this is not true. It is not even an essential stage. It is possible to implement a big data solution that would be working with real-time data, so in this case, we only need to gather data to develop the model and then implement it in real time. So there would not be a need to formally store the data at all.
Exploratory Data Analysis
Once the data has been cleaned and stored in a way that insights can be retrieved from it, the data exploration phase is mandatory. The objective of this stage is to understand the data, this is normally done with statistical techniques and also plotting the data. This is a good stage to evaluate whether the problem definition makes sense or is feasible.
Data Preparation for Modeling and Assessment
This stage involves reshaping the cleaned data retrieved previously and using statistical preprocessing for missing values imputation, outlier detection, normalization, feature extraction and feature selection.
Modeling
The prior stage should have produced several datasets for training and testing, for example, a predictive model. This stage involves trying different models and looking forward to solving the business problem at hand. In practice, it is normally desired that the model would give some insight into the business. Finally, the best model or combination of models is selected evaluating its performance on a left-out dataset.
Implementation
In this stage, the data product developed is implemented in the data pipeline of the company. This involves setting up a validation scheme while the data product is working, in order to track its performance. For example, in the case of implementing a predictive model, this stage would involve applying the model to new data and once the response is available, evaluate the model.
Big Data Analytics – Methodology:
In terms of methodology, big data analytics differs significantly from the traditional statistical approach of experimental design. Analytics starts with data. Normally we model the data in a way to explain a response. The objectives of this approach are to predict the response behavior or understand how the input variables relate to a response. Normally in statistical experimental designs, an experiment is developed and data is retrieved as a result. This allows generating data in a way that can be used by a statistical model, where certain assumptions hold such as independence, normality, and randomization. In big data analytics, we are presented with the data. We cannot design an experiment that fulfills our favorite statistical model. In large-scale applications of analytics, a large amount of work (normally 80% of the effort) is needed just for cleaning the data, so it can be used by a machine learning model. We don‘t have a unique methodology to follow in real large-scale applications. Normally once the business problem is defined, a research stage is needed to design the methodology to be used. However general guidelines are relevant to be mentioned and apply to almost all problems. One of the most important tasks in big data analytics is statistical modeling, meaning supervised and unsupervised classification or regression problems. Once the data is cleaned and preprocessed, available for modeling, care should be taken in evaluating different models with reasonable loss metrics and then once the model is implemented, further evaluation and results should be reported. A common pitfall in predictive modeling is to just implement the model and never measure its performance.
storage or in reporting general business results. Data warehousing is by no means simple, it is just different to what a data scientist does. Many organizations struggle hard to find competent data scientists in the market. It is however a good idea to select prospective data analysts and teach them the relevant skills to become a data scientist. This is by no means a trivial task and would normally involve the person doing a master degree in a quantitative field, but it is definitely a viable option. The basic skills a competent data analyst must have are listed below −
Business understanding SQL programming Report design and implementation Dashboard development
Big Data Analytics - Data Scientist:
The role of a data scientist is normally associated with tasks such as predictive modeling, developing segmentation algorithms, recommender systems, A/B testing frameworks and often working with raw unstructured data. The nature of their work demands a deep understanding of mathematics, applied statistics and programming. There are a few skills common between a data analyst and a data scientist, for example, the ability to query databases. Both analyze data, but the decision of a data scientist can have a greater impact in an organization. Here is a set of skills a data scientist normally needs to have −
Programming in a statistical package such as: R, Python, SAS, SPSS, or Julia Able to clean, extract, and explore data from different sources Research, design, and implementation of statistical models Deep statistical, mathematical, and computer science knowledge In big data analytics, people normally confuse the role of a data scientist with that of a data architect. In reality, the difference is quite simple. A data architect defines the tools and the architecture the data would be stored at, whereas a data scientist uses this architecture. Of course, a data scientist should be able to set up new tools if needed for ad-hoc projects, but the infrastructure definition and design should not be a part of his task.
Big Data Analytics - Problem Definition:
Through this tutorial, we will develop a project. Each subsequent chapter in this tutorial deals with a part of the larger project in the mini-project section. This is thought to be an applied tutorial section that will provide exposure to a real-world problem. In this case, we would start with the problem definition of the project.
Project Description
The objective of this project would be to develop a machine learning model to predict the hourly salary of people using their curriculum vitae (CV) text as input. Using the framework defined above, it is simple to define the problem. We can define X = {x 1 , x 2 , …, xn} as the CV‘s of users, where each feature can be, in the simplest way possible, the amount of times this word appears. Then the response is real valued, we are trying to predict the hourly salary of individuals in dollars. These two considerations are enough to conclude that the problem presented can be solved with a supervised regression algorithm.
Problem Definition
Problem Definition is probably one of the most complex and heavily neglected stages in the big data analytics pipeline. In order to define the problem a data product would solve, experience is mandatory. Most data scientist aspirants have little or no experience in this stage. Most big data problems can be categorized in the following ways −
Supervised classification Supervised regression Unsupervised learning Learning to rank Let us now learn more about these four concepts.
Supervised Classification
Given a matrix of features X = {x 1 , x 2 , ..., xn} we develop a model M to predict different classes defined as y = {c 1 , c 2 , ..., cn}. For example: Given transactional data of customers in an insurance company, it is possible to develop a model that will predict if a client would churn or not. The latter is a binary classification problem, where there are two classes or target variables: churn and not churn.
Data formats: Data can mean many different things, and there are many ways to classify it. Two of the more common are:
Primary and Secondary: Primary data is data that you collect or generate. Secondary data is created by other researchers, and could be their primary data, or the data resulting from their research. Qualitative and Quantitative: Qualitative refers to text, images, video, sound recordings, observations, etc. Quantitative refers to numerical data.
There are typically five main categories that it can be sorted into for management purposes. The category that you choose will then have an effect upon the choices that you make throughout the rest of your data management plan.
Observational Captured in real-time Cannot be reproduced or recaptured. Sometimes called ‗unique data‘. Examples include sensor readings, telemetry, survey results, images, and human observation.
Experimental Data from lab equipment and under controlled conditions Often reproducible, but can be expensive to do so Examples include gene sequences, chromatograms, magnetic field readings, and spectroscopy.
Simulation Data generated from test models studying actual or theoretical systems Models and metadata where the input more important than the output data Examples include climate models, economic models, and systems engineering.
Derived or compiled The results of data analysis, or aggregated from multiple sources Reproducible (but very expensive) Examples include text and data mining, compiled database, and 3D models
Reference or canonical Fixed or organic collection datasets, usually peer-reviewed, and often published and curate Examples include gene sequence databanks, census data, chemical structures.
Data can come in many forms. Some common ones are text, numeric, multimedia, models, audio, code, software, discipline specific (i.e., FITS in astronomy, CIF in chemistry), video, and instrument.
Data format in information technology may refer to:
Data type, constraint placed upon the interpretation of data in a type system Signal (electrical engineering), a format for signal data used in signal processing Recording format, a format for encoding data for storage on a storage medium File format, a format for encoding data for storage in a computer file Container format (digital), a format for encoding data for storage by means of a standardized audio/video codecs file format Content format, a format for representing media content as data Audio format, a format for encoded sound data Video format, a format for encoded video data
Recommended Digital Data Formats: Text, Documentation, Scripts: XML, PDF/A, HTML, Plain Text. Still Image: TIFF, JPEG 2000, PNG, JPEG/JFIF, DNG (digital negative), BMP, GIF. Geospatial: Shapefile (SHP, DBF, SHX), GeoTIFF, NetCDF. Graphic Image: raster formats: TIFF, JPEG2000, PNG, JPEG/JFIF, DNG, BMP, GIF. vector formats: Scalable vector graphics, AutoCAD Drawing Interchange Format, Encapsulated Postscripts, Shape files. cartographic: Most complete data, GeoTIFF, GeoPDF, GeoJPEG2000, Shapefile.
Audio: WAVE, AIFF, MP3, MXF, FLAC. Video: MOV, MPEG-4, AVI, MXF. Database: XML, CSV, TAB.
Parsing and Transformation:
In data transformation process data are transformed from one format to another format, that is more appropriate for data mining.
Some Data Transformation Strategies:-
Smoothing: Smoothing is a process of removing noise from the data.
Aggregation: Aggregation is a process where summary or aggregation operations are applied to the data.
Generalization: In generalization low-level data are replaced with high-level data by using concept hierarchies climbing.
Normalization: Normalization scaled attribute data so as to fall within a small specified range, such as 0.0 to 1.0.
Attribute Construction: In Attribute construction, new attributes are constructed from the given set of attributes.
A scalable online transaction processing system or database management system is one that
can be upgraded to process more transactions by adding new processors, devices and storage, and which can be upgraded easily and transparently without shutting it down.
Some early peer-to-peer (P2P) implementations of Gnutella had scaling issues. Each node
query flooded its requests to all peers. The demand on each peer would increase in proportion to the total number of peers, quickly overrunning the peers' limited capacity. Other P2P systems like BitTorrent scale well because the demand on each peer is independent of the total number of peers. There is no centralized bottleneck, so the system may expand indefinitely without the addition of supporting resources (other than the peers themselves).
The distributed nature of the Domain Name System allows it to work efficiently even when
all hosts on the worldwide Internet are served, so it is said to "scale well".
Data Cleaning: Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies in data. Data cleaning involves transformations to correct the wrong data. Data cleaning is performed as a data preprocessing step while preparing the data for a data warehouse.
KDP is a process of finding knowledge in data; it does this by using data mining methods
(algorithms) in order to extract demanding knowledge from large amount of data.
Knowledge Discovery Process (KDP)
Data consistency refers to the usability of data:
The degree to which a set of measures are equivalent in across systems ( Consistency). Inconsistency occurs when two data items in the data set contradict each other: e.g., a customer is recorded in two different systems as having two different current addresses, and only one of them can be correct. Fixing inconsistency is not always possible: it requires a variety of strategies - e.g., deciding which data were recorded more recently, which data source is likely to be most reliable (the latter knowledge may be specific to a given organization), or simply trying to find the truth by testing both data items (e.g., calling up the customer).
Point-in-time consistency:
Point-in-time consistency is an important property of backup files and a critical objective of
software that creates backups. It is also relevant to the design of disk memory systems,
specifically relating to what happens when they are unexpectedly shut down.
As a relevant backup example, consider a website with a database such as the online
encyclopedia Wikipedia, which needs to be operational around the clock, but also must be
backed up with regularity to protect against disaster. Portions of Wikipedia are constantly being
updated every minute of every day, meanwhile, Wikipedia's database is stored on servers in the
form of one or several very large files which require minutes or hours to back up.
These large files - as with any database - contain numerous data structures which reference each
other by location. For example, some structures are indexes which permit the database subsystem
to quickly find search results. If the data structures cease to reference each other properly, then
the database can be said to be corrupted.
Counter example:
The importance of point-in-time consistency can be illustrated with what would happen if a
backup were made without it.
Assume Wikipedia's database is a huge file, which has an important index located 20% of the
way through, and saves article data at the 75% mark. Consider a scenario where an editor comes
and creates a new article at the same time a backup is being performed, which is being made as a
simple "file copy" which copies from the beginning to the end of the large file(s) and doesn't
consider data consistency - and at the time of the article edit, it is 50% complete. The new article
is added to the article space (at the 75% mark) and a corresponding index entry is added (at the
20% mark).
Because the backup is already halfway done and the index already copied, the backup will be
written with the article data present, but with the index reference missing. As a result of the
inconsistency, this file is considered corrupted.
In real life, a real database such as Wikipedia's may be edited thousands of times per hour, and
references are virtually always spread throughout the file and can number into the millions,
billions, or more. A sequential "copy" backup would literally contain so many small corruptions
that the backup would be completely unusable without a lengthy repair process which could
provide no guarantee as to the completeness of what has been recovered.
A backup process which properly accounts for data consistency ensures that the backup is a
snapshot of how the entire database looked at a single moment. In the given Wikipedia example,
it would ensure that the backup was written without the added article at the 75% mark, so that the
article data would be consistent with the index data previously written.
Disk caching systems:
Point-in-time consistency is also relevant to computer disk subsystems.
Specifically, operating systems and file systemsare designed with the expectation that the
computer system they are running on could lose power, crash, fail, or otherwise cease operating
at any time. When properly designed, they ensure that data will not be unrecoverably corrupted if
the power is lost. Operating systems and file systems do this by ensuring that data is written to a
hard disk in a certain order, and rely on that in order to detect and recover from unexpected
shutdowns.
On the other hand, rigorously writing data to disk in the order that maximizes data integrity also
impacts performance. A process of write caching is used to consolidate and re-sequence write
operations such that they can be done faster by minimizing the time spent moving disk heads.
Data consistency concerns arise when write caching changes the sequence in which writes are
carried out, because it there exists the possibility of an unexpected shutdown that violates the
operating system's expectation that all writes will be committed sequentially.
For example, in order to save a typical document or picture file, an operating system might write
the following records to a disk in the following order: