










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Introduction Data Science Using R
Typology: Essays (university)
1 / 18
This page cannot be seen from the preview
Don't miss anything!











In this chapter we will also look at the five disruptions that are caused in the market place by data science. Once the context and its importance is understood it’s easy to simplify and demonstrate what data science actually is. We will also study traditional architecture versus Data science and understand the importance of Signal detection, which we shall study as chapter 2 and the machine learning techniques that help with this signal detection is studied from chapter 8 onwards, although we have covered few machine learning concepts in this chapter. This chapter shall also discuss solution architecture and the three critical components that are required for any solution.
The five quick disruptive products launched in the market place will be discussed now:
1. A very simple Japanese App 2. Healthcare App 3. Coursera 4. Sensory device in Agriculture Sector 5. Autonomous Car
The first one is a very simple Japanese app, which essentially helps two people to discover each other. Essentially, what the App does is, for every individual a set of questions has to be answered. When these questions are answered it gives a characteristics score that tells if the person likes music, books, viewpoints on philosophy, religion etc. Whatever the parameters are, the questions have to be answered and each person gets a score attached to each question answered.
11111
22222 Introduction to Data Science Using ‘R’
The other score that is attached to this device is the location. If a device is carried while walking on the street it will tell how many people with similar scores are around you within a 1 km radius. This app will enable strangers to look up at one another and have coffee, chat or get to know one another better. Using similarity score and location they are able to discover one another. Disruption: An app that leveraged and capitalized on new social norms of today’s casual meetups. Revolutionized the way people find others with similar taste/interests. Usage of data to find patterns and clusters from humongous set of entries and present to the users in a meaningful way, which is ‘right match’ in this case. Turning Data to Insights.
FIGURE 1.1 Japanese dating app
The second one is in the healthcare space. In this healthcare app a heart implant is able to communicate information such as rate of heartbeat, condition of heart in real time with your mobile phone. The mobile app also communicates remotely to the doctor. Disruption: Reduction in visits to the clinic, reduction in non-medical costs. Continuous monitoring of organ health vs. one time data captured during the physician visit. Presents an opportunity to track patterns and higher chance of identifying an anomaly and hence act early/on time.
44444 Introduction to Data Science Using ‘R’
to continuously improve the quality of the content delivered by collecting data on focus/topics of interest from thousands of students from across the world. By redesigning UX, and fine tuning content COURSERA disrupted the way online education was delivered by its predecessors like Khanacademy, MIT OCW, etc.
Fourth, disruptive product is in the Agriculture sector. Netherlands agriculture is a big part of their economy. They make the worlds best cheese and butter. One of the problems farmers face there is understanding the health of cows, which are carrying. Therefore now they have attached a sensory device to the cow’s ears, through which farmers can remotely (communicated via a satellite), monitor their cow’s health. Disruption: Livestock farming techniques and the sensors help with cattle health monitoring and action can be taken immediately if the cattle are unwell. This helps within time detection of disease and helps prevention of spread of disease to the other cows through prediction.
FIGURE 1.4 Sensored cows in Netherland
Lastly, the autonomous car, an autonomous car is special in that the car moves without a driver. This device tracks and scans the surroundings of the car at high speeds. It has the intelligence to process all kinds of real- time information and communicates it back to the steering wheel.
CHAPTER 1 Data Science: Key Concepts 55555
Disruption: Processing data from images and supplementary sensors, self- driving cars create a virtual world through which they navigate. By reducing the reaction time by millions of folds than human level, they aim to eliminate human error driven accidents and traffic congestions. Significant improvement in time and fuel efficiency whilst saving lives.
FIGURE 1.5 Googles autonomous car
A look at all the five uses shows one thing that is common to all of these and that is a data product which is working behind the scenes, very silently humming. To create a data product a data science process is needed, which will unlearn patterns from that data and create a bigger product. So in the five examples that happen in our everyday like how our heath gets taken care of, how we learn, how we fall in love, how we farm and how we drive, all of these are touched increasingly by data products. Data science needs to be an integral part of any organization you consider, else there is a very high probability that you will lose the market place. One of the biggest secrets of winners is that they are able to see patterns faster. So a core team, which uses data science techniques to process all the structured, unstructured data and looks at patterns around it and acts on it in real time is what most companies are aiming at today.
It’s similar to an iceberg floating on water. Most organizations just see the tip of the iceberg. For example they just know how much sales is happening.
CHAPTER 1 Data Science: Key Concepts 77777
The goal of Data scientist is to use data to discover signals that cause changes and which ultimately have an impact on the revenue of the firm. Even for a data scientist, it is humanely impossible to analyze big data. But with the aid of a computer, it can be easily done. Yet, a computer can only compute what has been programmed into it. So how do data scientists cope with this scenario, where analysis of the data will require the computer to pick up the ‘trends’ on its own? This is where machine learning comes in. Machine Learning is a remarkable application of artificial intelligence that enables computing systems to perform tasks through a process of “self- learning” without their being specifically programmed for the same. As data scientists cannot pinpoint exactly what sorts of patterns, the computer should recognize, this application of “machine learning comes in extremely handy. Thus, machine learning facilitates the computer to automatically adapt to new patterns and signals in data, while “learning” or recognizing previous trends and data computations. When Google’s search bar uses “auto- complete” before you type in your query, it is an example of machine learning, as the Google server has learnt to give you ‘predictions’ of what you might want to search based on your previous search history. We will now familiarize with five techniques
This process involves breaking data into various chunks based on shared characteristics. The analyst then picks the clusters through an iterative process looking for uniqueness between segments. We could segment based on demographic, need based, behavior based etc. The statistical techniques that we use for segmentation are K Means, Hierarchical clustering and Discriminant analysis, as shown in figure 1.7. Some business questions that are answered by segmentation are:
88888 Introduction to Data Science Using ‘R’
FIGURE 1.7 A Real ife customer segmentation case study
FIGURE 1.8 Behavioral components considered for fleet card segmentation
FIGURE 1.9 Dimensions of fleet behavior measured and segmented
1 01 01 01 01 0 Introduction to Data Science Using ‘R’
FIGURE 1.12 Segmentation in banking industry
The greater the similarity within a group the better is the cluster. The greater the dissimilarity between groups the cluster is more distinct. One technique of clustering is the k means technique. This technique is used to separate data into the best-suited group based on information the algorithm already has. Once data is separated one has to specify the number of cluster that will be created to be able to produce effective data mining results. Each cluster had a centre point called the centroid, which each observation is assigned to. Associating every observation with the nearest mean creates K clusters. Then one has to calculate the centroid mean for each cluster. This becomes the new mean and the above two steps are repeated till convergence has been reached.
FIGURE 1.13 The matematic behind clustering
CHAPTER 1 Data Science: Key Concepts 1 11 11 11 11 1
TECHNIQUE 2: UNSTRUCTURED TEXT MINING
The second technique is unstructured text mining. Here we use data to discover signals and process changes that create an impact. So lets take a store manager as an example. A store manager gets a lot of feedback and there is a lot of unstructured data that comes in. It’s very important to process this into a structure. Text mining can process all this feedback and give a glimpse of what is called the sentiment analysis as how many people like the store and how many don’t like the store. Another example of text mining is in the health care domain. For example, a patient goes to a doctor and gets admitted to a hospital. He comes in contact with a doctor, lab technicians and the nurse. The Doctor makes a record of the state of his health condition, the Lab technician tests his blood and writes down some inferences and the nurse regularly checks his vital parameters and makes notes of the health condition of the patients. Now all this data can be run through a text mining activity and triangulate the state of the patient by mashing up all the three datasets, the doctors dataset, the lab technicians dataset and the nurse’s dataset.
FIGURE 1.14 Real world unstructured text mining in the health care
Another example of text mining is insurance domain. Auto insurance companies deal with large number of claims every day from collision damage, fire and theft damage, accidental damage. Large amount of time and money is spent in identifying fraudulent claims. One of the datasets used is historical claims, coverage and settlements. A text mining solution
CHAPTER 1 Data Science: Key Concepts 1 31 31 31 31 3
Recommenders are one aspect of data analytics that most of us in some form or the other would have come in contact in our day-to-day activities. When you buy items from a grocery store you could generally get coupons to use for next purchase. When you visit a salon, you could be offered a combo deal with manicure / pedicure along with your hair cut rather than just haircut. These generally happen because the company knows that people who buy a set of products, for e.g., eggs and milk, could buy bread and cheese next and hence recommendation happens via coupons. In the past new comers who have been to the salon have generally tried manicure along with a haircut so when you for go the first time the company markets both services to you.
FIGURE 1.16 Forecasting
FIGURE 1.17 Recommenders
1 41 41 41 41 4 Introduction to Data Science Using ‘R’
Snapshot of Machine Learning Techniques As in the above topic we have just seen few machine learning technique below is a graph that displays a snapshot of six techniques and a brief idea of what falls under each of the techniques. We will study Machine learning as a whole topic in chapter 8.
FIGURE 1.18 Machine learning reference architecture
A look at the reference architecture for any of these projects has three layers. Machine Learning Reference Architecture is an algorithm that can process raw data to provide a big picture that combines all the major and minor aspects of the data being analyzed. Real time data analysis is carried out using an optimization process,
1. The store layer that captures and stores data: where one can store the data in Hadoop, Hive, Hana or any other database. More important than storing is what is done with the data, how to extract signals from the data and that’s where data science comes in. 2. The Sense Layer: in this layer a text-mining model or a scoring model is used to detect a pattern. Then it mines the collected data from historical trends and patterns that act as reference points, this pattern is then monetized. 3. The Respond Layer: compare previous trends with the latest data collected, to predict an outcome and to recommend the next decision to be taken. This is where the analysis of the data is presented in relatable
1 61 61 61 61 6 Introduction to Data Science Using ‘R’
2. To visualize the data we use the “ggplot” command. On the X axis we have plotted average purchase value and on the Y axis we have plotted average items purchased for two cities Bangalore and Chennai.
on keywords usage frequencies in a visualization format producing some accurate and some not expected findings. At the foundation of each micro gene is a broader genre, like romance, comedy or drama. From here Netflix then has other advanced filters like location, time period, actors to get to the precise category. Since Netflix uses algorithms to cater to its customer base, a human analyzes the films and then its put through a Netflix process, which brings certain patterns. Example micro genes about friendship and marriage were among the most common. He also discovered that Drama, classic were some of Netflix’s widely used adjectives for movies.
You will learn the path to download R in chapter 2, here you can have a quick view of some commands in R specific to clustering technique and its visualization.
1. To segment the dataset we use the “kmeans” command to find clusters. To view the cluster statistics we input the “fit”command.
CHAPTER 1 Data Science: Key Concepts 1 71 71 71 71 7
3. Another way of visualizing is in 3D format as shown in the figure below using the “scatterplot3d” command.