Download Big data students notes and more Study notes Java Programming in PDF only on Docsity!
I- The few common concepts and terms in the big data world : i. Relational database management system (RDBMS) Structured data in a predetermined schema (tables), scalable vertically through large SMP servers, or horizontally through clustering software. These databases are usually easy to create, access, and extend. The standard language for relational database interoperability is the Structured Query Language (SQL). ii. Non-relational database A database that does not store data into tables, but made them accessible through special query APIs. The standard language used is Not Only SQL (NoSQL): it does not present a fixed schema, it uses BASE system to scale vertically (basically available, soft-state, eventually consistent), and sharding (horizontal partitioning) to scale horizontally. Examples are MongoDB and CouchDB (they differ mainly because in MongoDB the main objects are documents, while in CouchDB are collections, which in turn contain documents). NoSQL commonly used JavaScript Object Notation (JSON) data format (BSON in MongoDB — binary JSON), and it mainly works through Key Value Store (KSV), i.e., a collection of different unknown data types (while an RDBMS stores data into table knowing exactly the data type). iii. Programming language It is a formally constructed language designed to communicate instructions to a machine. The main ones for data science applications are Java, C, C++, C#, R, and Matlab. Scala is another language that is becoming extremely popular right now, but it is an example of functional language. iv. Hadoop An open source software for analyzing a huge amount of data on a distributed system. His primary storage system is called Hadoop distributed file system (HDFS), which duplicates the data and allocates them in different nodes. It has been written in Java. It is a core technology in the big data revolution and stores data into their native raw format, and it can be used for several purposes (Dull, 2014), such as a simple data staging or landing platform complementary to the existing EDW (as an enterprise data hub, i.e., EDH), or managing data (even small), transforming those into a specific format in the HDFS and sending them back to the EDW, lowering thus the costs while increasing the processing power. Furthermore, it can integrate external data sources and archive data (both on-premises or into the cloud), and reduce the burden for a standard EDW. v. MapReduce Software for parallel processing huge amount of data. vi. Flume Service to gather, aggregate, and move chunks of data from several sources to a centralized system. vii. Cassandra An open source database system for analyzing a large amount of data on a distributed system. It is characterized by a high performance and by a high availability with no single point of failure (i.e., a part of the system that if fails stops the whole system). It fosters data denormalization, which means grouping data or adding redundant information, in order to optimize the database performance. viii. Distributed System Multiple terminals communicating between them. The problem is divided into many tasks and assigned to each terminal. It is a highly scalable system as further nodes are added. ix. Google File System Proprietary distributed file system for managing efficiently large datasets. x. HBase An open source non-relational database (column-oriented) developed on an HDFS. It is very useful for real-time random read and write access to data, as well as to store sparse data
(small specific chunk of data within a vast amount of them). The relational counterpart is called Big Table. xi. Enterprise Data Warehouse (EDW) A system used for analysis and reporting that consists of central repositories of integrated data from a wide spectrum of different sources. The typical form of an EDW is the extract- transform-load (ETL), that is the most representative case of bulk data movement, but other three important examples of these systems are data marts (i.e., a subset of the EDW extracted out in order to address a specific question), Online analytical processing (OLAP) — used for multidimensional low-frequency analytical query — and Online transaction processing (OLTP) — used rather for high volume fast transactional data processing. The wider system that includes instead a set of servers, storage, operating systems, database, business intelligence, data mining, etc., is called data warehouse appliance (DWA). xii. Resilient Distributed Datasets (RDD) A logical collection of data partitioned across machines. The most known example is Spark, an open source clustering computing that has been designed to accelerate analytics on Hadoop thanks to the multi-stage in-memory primitives (that are basic data types de ned in programming languages or built it with their support). It seems to run 100 times faster than Hadoop, but its disadvantage is that it does not provide its own distributed storage system. xiii. Hive An additional example of EDW infrastructure that facilitates data summarization, ad-hoc queries, and specific analysis. xiv. Pig A platform for processing huge amount of data through a native programming language called Pig Latin. It runs at the same time sequences of MapReduce. xv. Scripting Language It is a programming language that supports scripts, which are pieces of code written for a run-time environment that interprets (rather than compile) and automates the execution of tasks. The main ones in the big data field are Python, JavaScript, PHP, Perl, Ruby and Visual Basic Script. xvi. Data Mart It is a subset of the data warehouse used for a specific purpose. Data marts are then department-specific or related to a single line of business (LoB). The next level of data marts is the Virtual Data Marts, i.e., a virtual layer that creates various views of data slices — in other words, instead of physically creating a data mart, it just takes a snapshot of them. The final evolution is instead called Data Lakes, which are massive repositories of unstructured data with an incredible computational capability. Hence, data marts physically create repositories (slices) of data, virtual data marts leave the data where they are and create virtual constructs — reducing the cost of transferring and replicating them — while data lakes work as the virtual data marts but with any kind of data format. II- Data Analysis Data Analysis is a process of collecting, transforming, cleaning, and modeling data with the goal of discovering the required information. The results so obtained are communicated, suggesting conclusions, and supporting decision-making. Data visualization is at times used to portray the data for the ease of discovering the useful patterns in the data. The terms Data Modeling and Data Analysis mean the same. Data Analysis Process consists of the following phases that are iterative in nature − i. Data Requirements Specification The data required for analysis is based on a question or an experiment. Based on the requirements of those directing the analysis, the data necessary as
III- Data visualization techniques The data analysts can choose data visualization techniques, such as tables and charts, which help in communicating the message clearly and efficiently to the users. The analysis tools provide facility to highlight the required information with color codes and formatting in tables and charts. The requirement analysis is primarily grouped into 5 elements namely
- Functional requirements
- Non-functional requirements
- BI and analytics use caseb
- Data exploration
- Agile 1. Functional requirements – These are the requirements for big data solution which need to be developed including all the functional features, business rules, system capabilities, and processes along with assumptions and constraints. Though the functional requirements have detailed information, it lacks the 360-degree view. For example, a Channel management dashboard should be generated every day. While the requirements are collated for the channel dashboard, it may fail to look at all the aspects of channel management resulting in the partial analysis. 2. Non-functional requirements – It defines how the developed system should work. Apart from usability, reliability, performance, and supportability, there are many other aspects that the solution should consider and ensure that they are taken care of. Some of the important requirements are; a. Security – Multiple levels of security like firewalls, network isolation, user authentication, encryption at rest using keys, encryption of data in transit using SSL, end- user training, intrusion protection, and intrusion detection systems (IDS) are some of the key requirements for many of the modern data lakes. b. Compliance – As the Big data solutions are becoming more matured, various industry- standard compliances and regulations are taking center stage. The challenges of industry compliances with ever-increasing chaos of standards, rules, regulations and contractual obligations are increasing the risk of non-compliance to multifold. Regulations like HIPAA(Healthcare), GDPR (European union) ensures customer privacy while some of the regulations mandate the organizations to keep track of customers’ information for a variety of reasons like prevention of fraud. This may end up in conflict or violation of new law while complying with the old laws. c. Cloud platform – Selection of a cloud platform is specific to each and every organization however some of the aspects like adherence to compliance and regulations, security, data governance, technology footprint, roadmap and partnership, migration supportability, regional availability/services of components and cost are the prime factors while selecting the cloud service provider. d. Self-serve data prep – It is one of the up-coming concepts which facilitates business users, analysts or data scientists to analyze and prepare the datasets so that these datasets can be used further without relying on data specialists/data technical specialists. e. How long it takes for a business user to get the data from the application to data lake/datamart is defined as latency. Data volume is about how much of daily data is extracted from a source application to the data lake. It also covers the historical data that is required in the data lake/Datamart to cater to the data needs of business users. The older data that is infrequently used need to be taken out of Datamarts/data lake. This process of periodic data extraction out of datamart/data lake to low-cost storage is part of a data archival While we focus on functional and non-functional requirements, there are other important facets that define the success of the Big data engagement. BI use case and Analytics patterns are the game changers and act as a nucleus which ensures that the Big data engagement is fully accepted by the business community and there are absolutely no surprises while it is being implemented.
3. Use case – These are grouped into 2 categories namely BI and Analytics use cases, depending on the requirements. 3.1 BI Use-case – A use-case defines the action to achieve a particular goal along with the required features so that the particular KPIs can be defined and tracked. This section starts where the functional requirements end. It covers the detailed view of functional requirements by enlisting all the use cases whether they are used or not in the engagement. By doing so, we will end up listing all the use cases by creating a complete 360-degree view of the solution. Further, a use case is divided into multiple subsections and each subsection has its own detailed analysis. For example, as part of insurance, channel management is one of the popular use cases that many BI applications offer. The channel management has sub-sections like; - Sales from various channels for specific products - Sales behavior of sales associates, agents, and partners - Impact of rewards on various sales associates and partners - Partner retention strategy - Claims by each of the channel - Revisiting the product strategy based on the business expansion and underwriting processes which is based on the claim’s ratios…many more These granular requirements for each of the use cases ensure that there are no gaps in understanding the use-case and its patterns. If any of these are missed during requirements and taken up at a later part of the program, they may derail the schedule and result in cost overrun. 3.2 Analytics Use case: The first step for an analytics model is the identification of business use cases. These use cases are different from BI use cases focusing primarily on analytical needs. Building a requirements model to specify a use case at the beginning of analytics is the key aspect. It means, just defining the use case is not enough as there is a need to explore these use cases with the following critical items; - Business objective and measure - Characteristics like business processes, relationships, and dependencies - Selection and preparation of the data - Data validation To illustrate, product optimization and pricing are some of the popular use cases in insurance. Its business objective is to build and optimize a product that is best suited for dynamic and risky market conditions i.e., at what price the product with its features can be sold.
As per Bill Wake’s INVEST model, these user stories should be independent, negotiable, valuable, estimable, small and testable so that these can be modularized for effective implementation. The user stories from product backlog are prioritized before being added to sprint backlog during the sprint planning and “burned down” over the duration of the sprint. Also, the dependencies, story points, the capacity of the team, productivity and timeliness are discussed during the sprint planning. Finally, after the implementation of all the stories and sprints, the backlog completion will be flagged as completed. IV- The three primary sources of Big Data Social data comes from the Likes, Tweets & Retweets, Comments, Video Uploads, and general media that are uploaded and shared via the world’s favorite social media platforms. This kind of data provides invaluable insights into consumer behavior and sentiment and can be enormously influential in marketing analytics. The public web is another good source of social data, and tools like Google Trends can be used to good effect to increase the volume of big data. Machine data is defined as information which is generated by industrial equipment, sensors that are installed in machinery, and even web logs which track user behavior. This type of data is expected to grow exponentially as the internet of things grows ever more pervasive and expands around the world. Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly growing Internet Of Things will deliver high velocity, value, volume and variety of data in the very near future. Transactional data is generated from all the daily transactions that take place both online and offline. Invoices, payment orders, storage records, delivery receipts – all are characterized as transactional data yet data alone is almost meaningless, and most organizations struggle to make sense of the data that they are generating and how it can be put to good use. V- Data sampling Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. It enables data scientists, predictive modelers and other data analysts to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly, while still producing accurate findings. Advantages and challenges of data sampling Sampling can be particularly useful with data sets that are too large to efficiently analyze in full -- for example, in big data analytics applications or surveys. Identifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of the data or population. An important consideration, though, is the size of the required data sample and the possibility of introducing a sampling error. In some cases, a small sample can reveal the most important information about a data set. In others, using a larger sample can increase the likelihood of accurately representing the data as a whole, even though the increased size of the sample may impede ease of manipulation and interpretation. Types of data sampling methods There are many different methods for drawing samples from data; the ideal one depends on the data set and situation.
Sampling can be based on probability , an approach that uses random numbers that correspond to points in the data set to ensure that there is no correlation between points chosen for the sample. Further variations in probability sampling include: Simple random sampling: Software is used to randomly select subjects from the whole population. Stratified sampling: Subsets of the data sets or population are created based on a common factor, and samples are randomly collected from each subgroup. Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed. Multistage sampling: A more complicated form of cluster sampling, this method also involves dividing the larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are identified, clustered and analyzed. Systematic sampling: A sample is created by setting an interval at which to extract data from the larger population -- for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of 20 rows to analyze. Sampling can also be based on nonprobability , an approach in which a data sample is determined and extracted based on the judgment of the analyst. As inclusion is determined by the analyst, it can be more difficult to extrapolate whether the sample accurately represents the larger population than when probability sampling is used. Nonprobability data sampling methods include: Convenience sampling: Data is collected from an easily accessible and available group. Consecutive sampling: Data is collected from every subject that meets the criteria until the predetermined sample size is met. Purposive or judgmental sampling: The researcher selects the data to sample based on predefined criteria. Quota sampling: The researcher ensures equal representation within the sample for all subgroups in the data set or population. Once generated, a sample can be used for predictive analytics. For example, a retail business might use data sampling to uncover patterns about customer behavior and predictive modeling to create more effective sales strategies. VI- Types and Charecteristics of Data “Five V’s of big data”, and their impact on the business. They are volume, velocity, variety, veracity and value. Volume: If we see big data as a pyramid, volume is the base. The volume of data that companies manage skyrocketed around 2012, when they began collecting more than three million pieces of data every data. “Since then, this volume doubles about every 40 months,” Velocity: In addition to managing data, companies need that information to flow quickly – as close to real-time as possible. So much so that the MetLife executive stressed that: “Velocity can be more important than volume because it can give us a bigger competitive advantage. Sometimes it’s better to have limited data in real time than lots of data at a low speed.” Variety: A company can obtain data from many different sources: from in-house devices to smartphone GPS technology or what people are saying on social networks. The importance of these sources of information varies depending on the nature of the business. For example, a mass-market service or product should be more aware of social networks than an industrial business. Veracity: The fourth V is veracity, which in this context is equivalent to quality. We have all the data, but could we be missing something? Are the data “clean” and accurate? Do they really have something to offer? Value: Finally, the V for value sits at the top of the big data pyramid. This refers to the ability to transform a tsunami of data into business.
VIII- Missing values Missing values are the Achilles’s heel for a data scientist. If not handled properly, the entire analysis will be futile and provide misleading results which could potentially harm the business stakeholders. Types of Missing Data:
- D.B Rubin (1976) classified missing data problems into three categories. In his theory every data point has some likelihood of being missing. The process that governs these probabilities is called the ‘missing data mechanism’ or ‘response mechanism’. The model for the process is called the ‘missing data model’ or ‘response model’.
- Rubin’s distinction sets the conditions under which a missing data handling method can provide valid statistical inferences. Missing Completely at Random (MCAR) If the probability of being missing is the same for all cases, then the data are said to be missing completely at random (MCAR). This effectively implies that causes of the missing data are unrelated to the data. It is safe to ignore many of the complexities that arise because of the missing data, apart from the obvious loss of information. Most simple fixes only work under the restrictive and often unrealistic MCAR assumption. Example: Estimate the gross anual income of a household within a certain population, which you obtain via questionnaires. In the case of MCAR, the missingness is completely random, as if some questionnaires were lost by mistake. Missing at Random (MAR) If the probability of being missing is the same only within groups defined by the observed data, then the data are missing at random (MAR). It is more general and more realistic than MCAR. Modern missing data methods generally start from the MAR assumption. Example: Suppose some household income information is missing.In the case of MAR, the missingness is random within subgroups of other observed variables. For instance, suppose you also collected data on the profession of each subject in the questionare and deduce that managers, VIPs etc are more likely not the share their income, then, within subgroups of the profession, missingness is random. Not Missing at Random (NMAR) If neither MCAR nor MAR holds, then we speak of missing not at random (MNAR). In the literature one can also find the term NMAR (not missing at random) for the same concept. MNAR means that the probability of being missing varies for reasons that are unknown to us. MNAR includes the possibility that the scale produces more missing values for the heavier objects (as above), a situation that might be difficult to recognize and handle. An example of MNAR in public opinion research occurs if those with weaker opinions respond less often. MNAR is the most complex case. Strategies to handle MNAR are to find more data about the causes for the missingness, or to perform what-if analyses to see how sensitive the results are under various scenarios. Example: In the case of MNAR when the reason for missingness depends on the missing values itself. For instance, suppose people don’t want to share their income as it is less and they are ashamed of it. Ways to Handle Missing Values When it comes to handling missing values, you can take the easy way or you can take the professional way. The Easy Way:
- Ignore tuples with missing values: This approach is suitable only when the dataset is quite large and multiple values are missing within a tuple.
- Is an option only if the tuples containing missing values are about 2% or less. Works with MCAR.
- Drop missing values: Only ideal if you can afford to loose a bit of data.
- Is an option only if the number of missing values is 2% of the whole dataset or less. Do not use this as your first approach.
- Leave it to the algorithm: Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms throw an error about the missing values (ie. Scikit learn — LinearRegression) is an option only if the missing values are about 5% or less. Works with MCAR. The Professional Way: The drawback of dropping missing values is that you loose the entire row just for the a few missing values. That is a lot of valuable data. So instead of dropping the missing values, or even ignoring them in the case of tuples, try filling in the missing values with a well calulated estimate. Professionals use two main methods of calculating missing values. They are imputation and interpolation. Imputation The mean or median of the other variables within a dataset. The relationship of the data need not be linear. Types of Imputation Easy Imputations
- Mean/Median Imputation a.k.a Constant Values Imputation
- Calculate the mean of the observed values for the variable for all individuals which are non- missing. It has the advantage of keeping the same mean and the same sample sizes. Advantages:
- Quick and easy
- Ideal for small numerical datasets Disadvantages:
- Doesn’t factor the correlations between features. It only works on the column level.
- Will give poor results on encoded categorical features (do NOT use it on categorical features).
- Not very accurate.
- Doesn’t account for the uncertainty in the imputations. Most Frequent (Values) Imputation
- Most Frequent is another statistical strategy to impute missing values and works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column. Advantages:
- Works well with categorical features. Disadvantages:
- It also doesn’t factor the correlations between features.
- It can introduce bias in the data.
- Zeros Imputation
- It replaces the missing values with either zero or any constant value you specify.
- Perfect for when the null value does not add value to your analysis but requires an integer in order to produce results.
Outliers can also come in different flavours, depending on the environment: point outliers, contextual outliers, or collective outliers. Point outliers are single data points that lay far from the rest of the distribution. Contextual outliers can be noise in data, such as punctuation symbols when realizing text analysis or background noise signal when doing speech recognition. Collective outliers can be subsets of novelties in data such as a signal that may indicate the discovery of new phenomena (As in figure B). Most common causes of outliers on a data set:
- Data entry errors (human errors)
- Measurement errors (instrument errors)
- Experimental errors (data extraction or experiment planning/executing errors)
- Intentional (dummy outliers made to test detection methods)
- Data processing errors (data manipulation or data set unintended mutations)
- Sampling errors (extracting or mixing data from wrong or various sources)
- Natural (not an error, novelties in data) In the process of producing, collecting, processing and analyzing data, outliers can come from many sources and hide in many dimensions. Those that are not a product of an error are called novelties. Detecting outliers is of major importance for almost any quantitative discipline (ie: Physics, Economy, Finance, Machine Learning, Cyber Security). In machine learning and in any quantitative discipline the quality of data is as important as the quality of a prediction or classification model. When trying to detect outliers in a dataset it is very important to keep in mind the context and try to answer the question: “Why do I want to detect outliers?” The meaning of your findings will be dictated by the context. Also, when starting an outlier detection quest you have to answer two important questions about your dataset: Which and how many features am I taking into account to detect outliers? (univariate / multivariate) Can I assume a distribution(s) of values for my selected features? (parametric / non-parametric) Some of the most popular methods for outlier detection are:
- Z-Score or Extreme Value Analysis (parametric)
- Probabilistic and Statistical Modeling (parametric)
- Linear Regression Models (PCA, LMS)
- Proximity Based Models (non-parametric)
- Information Theory Models
- High Dimensional Outlier Detection Methods (high dimensional sparse data)
- Z-Score The z-score or standard score of an observation is a metric that indicates how many standard deviations a data point is from the sample’s mean, assuming a gaussian distribution. This makes z-score a parametric method. Very frequently data points are not to described by a gaussian distribution, this problem can be solved by applying transformations to data ie: scaling it. By ‘tagging’ or removing the data points that lay beyond a given threshold we are classifying data into outliers and not outliers Z-score is a simple, yet powerful method to get rid of outliers in data if you are dealing with parametric distributions in a low dimensional feature space. For nonparametric problems Dbscan and Isolation Forests can be good solutions.
- Dbscan (Density Based Spatial Clustering of Applications with Noise) In machine learning and data analytics clustering methods are useful tools that help us visualize and understand data better. Relationships between features, trends and populations in a data set can be graphically represented via clustering methods like dbscan, and can also be applied to detect outliers in nonparametric distributions in many dimensions. Dbscan is a density based clustering algorithm, it is focused on finding neighbors by density (MinPts) on an ‘n-dimensional sphere’ with radius ɛ. A cluster can be defined as the maximal set of ‘density connected points’ in the feature space. Dbscan then defines different classes of points: - Core point: A is a core point if its neighborhood (defined by ɛ) contains at least the same number or more points than the parameter MinPts. - Border point: C is a border point that lies in a cluster and its neighborhood does not contain more points than MinPts, but it is still ‘density reachable’ by other points in the cluster. Outlier: N is an outlier point that lies in no cluster and it is not ‘density reachable’ nor ‘density connected’ to any other point. Thus this point will have “his own cluster”. An outlier score can computed for each observation: outlier score Where h(x) is the path length of the sample x, and c(n) is the ‘unsuccessful length search’ of a binary tree (the maximum path length of a binary tree from root to external node) n is the number of external nodes. After giving each observation a score ranging from 0 to 1; 1 meaning more outlyingness and 0 meaning more normality. A threshold can be specified (ie. 0.55 or 0.60) Tip: In the Scikit-Learn’s library the score is shifted by 0.5 and reversed, so it returns values from -0.5 to 0.5, bigger is less abnormal, and smaller is more abnormal. Conclusions: Z-Score pros:
- It is a very effective method if you can describe the values in the feature space with a gaussian distribution. (Parametric)
- The implementation is very easy using pandas and scipy.stats libraries. Z-Score cons:
- It is only convenient to use in a low dimensional feature space, in a small to medium sized dataset.
- Is not recommended when distributions can not be assumed to be parametric. Dbscan pros:
- It is a super effective method when the distribution of values in the feature space can not be assumed.
- Works well if the feature space for searching outliers is multidimensional (ie. 3 or more dimensions)
- Sci-kit learn’s implementation is easy to use and the documentation is superb.
- Visualizing the results is easy and the method itself is very intuitive. Dbscan cons:
- The values in the feature space need to be scaled accordingly.
- Selecting the optimal parameters eps, MinPts and metric can be difficult since it is very sensitive to any of the three params.
XI- Big data is classified in three ways: ◦ Structured Data ◦ Unstructured Data ◦ Semi-Structured Data
- These three terms, while technically applicable at all levels of analytics, are paramount in big data. Understanding where the raw data comes from and how it has to be treated before analyzing it only becomes more important when working with the volume of big data. Because there’s so much of it, information extraction needs to be efficient to make the endeavor worthwhile.
- The structure of the data is the key to not only how to go about working with it, but also what insights it can produce. All data goes through a process called extract, transform, load (ETL) before it can be analyzed. It’s a very literal term: data is harvested, formatted to be readable by an application, and then stored for use. The ETL process for each structure of data varies. Structured Data
- Structured data is the easiest to work with. It is highly organized with dimensions defined by set parameters.
- Think spreadsheets; every piece of information is grouped into rows and columns. Specific elements defined by certain variables are easily discoverable. It’s all your quantitative data: Age Billing Contact Address Expenses Debit/credit card numbers
- Because structured data is already tangible numbers, it’s much easier for a program to sort through and collect data.
- Structured data follows schemas: essentially road maps to specific data points. These schemas outline where each datum is and what it means.
- A payroll database will lay out employee identification information, pay rate, hours worked, how compensation is delivered, etc. The schema will define each one of these dimensions for whatever application is using it. The program won’t have to dig into data to discover what it actually means, it can go straight to work collecting and processing it. Working With It
- Structured data is the easiest type of data to analyze because it requires little to no preparation before processing. A user might need to cleanse data and pare it down to only relevant points, but it won’t need to be interpreted or converted too deeply before a true inquiry can be performed.
- One of the major perks of using structured data is the streamlined process of merging enterprise data with relational. Because pertinent data dimensions are usually defined and specific elements are in a uniform format, very little preparation needs to be done to make all sources compatible.
- The ETL process for structured data stores the finished product in what is called a data warehouse. These databases are highly structured and filtered for the specific analytics purpose the initial data was harvested for.
- Relational databases are easily-queried datasets. They allow users to find external information and either study it standalone or integrate it with their internal data for more context. Relational database management systems use SQL, or Structured Query Language, to access data, providing a uniform language across a network of data platforms and sources.
- This standardization enables scalability in data processing. Time spent on defining data sources and making them cooperate with each other is reduced, expediting the delivery of actionable insight.
- The qualitative nature and readability of this classification also grant compatibility with almost any relevant source of information. The amount of data used is limited only by what the user can get their hands on.
- Unfortunately, there’s only so much structured data available, and it denotes a slim minority of all data in existence. Unstructured Data
- Not all data is as neatly packed and sorted with instructions on how to use as structured data is. The consensus is no more than 20% of all data is structured.
- So what’s the remaining four-fifths of all the information out there? Since it isn’t structured, we naturally call this unstructured data. Unstructured data is all your unorganized data:
- You might be able to figure out why it constitutes so much of the modern data library. Almost everything you do with a computer generates unstructured data. No one is transcribing their phone calls or assigning semantic tags to every tweet they send.
- While structured data saves time in an analytical process, taking the time and effort to give unstructured data some level of readability is cumbersome.
- For structured data, the ETL process is very simple. It is simply cleansed and validated in the transform stage before loading into a database. But for unstructured data, that second step is much more complicated.
- To gain anything resembling useful information, the dataset needs to be interpretable. But the effort can be much more rewarding than processing unstructured data’s simpler counterpart. As they say in sports, you get out what you put in. Working With It
- The hardest part of analyzing unstructured data is teaching an application to understand the information it’s extracting. More often than not, this means translating it into some form of structured data.
- This isn’t easy and the specifics of how it is done vary from format to format and with the end goal of the analytics. Methods like text parsing, natural language processing and developing content hierarchies via taxonomy are common.
- Almost universally, it involves a complex algorithm blending the processes of scanning, interpreting and contextualizing functions.
- This brings us to an important point: context is almost, if not as, important as the information wrung out of the data. Alissa Lorentz, then the vice president of creative, marketing and design at Augify, explained in a guest article for Wired: a query on an unstructured data set might yield the number 31, but without context it’s meaningless. It could be “the number of days in a month, the amount of dollars a stock increased…, or the number of items sold today.”
- The contextual aspect is what makes unstructured data ubiquitous in big data: merging internal data with external context makes it more meaningful. The more context (and data in general), the more accurate any sort of model or analysis is.
- This context can be created from unstructured datasets, like NoSQL databases, or human dictation. We can tell applications and AI what data means. In fact, you’ve probably been
Subtypes of Data
- Though not formally considered big data, there are subtypes of data that hold some level of pertinence to the field of analytics. Often, these refer to the origin of the data, such as geospatial (locational), machine (operational logging), social media or event-triggered. XII- What is Weight of Evidence (WOE)? The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan. Distribution of Goods - % of Good Customers in a particular group Distribution of Bads - % of Bad Customers in a particular group ln - Natural Log Positive WOE means Distribution of Goods > Distribution of Bads Negative WOE means Distribution of Goods < Distribution of Bads Hint : Log of a number > 1 means positive value. If less than 1, it means negative value. Many people do not understand the terms goods/bads as they are from different background than the credit risk. It's good to understand the concept of WOE in terms of events and non-events. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events. WOE = In(% of non-events ➗% of events) Steps of Calculating WOE
- For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
- Calculate the number of events and non-events in each group (bin)
- Calculate the % of events and % of non-events in each group.
- Calculate WOE by taking natural log of division of % of non-events and % of events
- Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps) **Terminologies related to WOE
- Fine Classing** Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable 2. Coarse Classing Combine adjacent categories with similar WOE scores
Usage of WOE Weight of Evidence (WOE) helps to transform a continuous independent variable into a set of groups or bins based on similarity of dependent variable distribution i.e. number of events and non- events. For continuous independent variables : First, create bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. For categorical independent variables : Combine categories with similar WOE and then create new categories of an independent variable with continuous WOE values. In other words, use WOE values rather than raw categories in your model. The transformed variable will be a continuous variable with WOE values. It is same as any continuous variable. Why combine categories with similar WOE? It is because the categories with similar WOE have almost same proportion of events and non- events. In other words, the behavior of both the categories is same. Rules related to WOE
- Each category (bin) should have at least 5% of the observations.
- Each category (bin) should be non-zero for both non-events and events.
- The WOE should be distinct for each category. Similar groups should be aggregated.
- The WOE should be monotonic, i.e. either growing or decreasing with the groupings.
- Missing values are binned separately. XIII- Variable Selection
- Variable selection is a collection of candidate model variables tested for significance during model training. Candidate model variables are also known as independent variables, predictors, attributes, model factors, covariates, regressors, features, or characteristics.
- Variable selection is a parsimonious process that aims to identify a minimal set of predictors for maximum gain (predictive accuracy). This approach is the opposite of data preparation, where as many meaningful variables as possible are added to the mining view. These opposing requirements are achieved using optimization; that is, finding the minimal selection bias under the given constraints.
- The key objective is to find a right set of variables so the scorecard model would be able not only to rank customers based on their likelihood of bad debt but also to estimate the probability of their bad debt. This usually means selecting statistically significant variables in the predictive model and having a balanced set of predictors (usually 8-15 is considered a good balance) to converge to a 360-degree customer view. In addition to customer-specific risk characteristics, we should also consider including systematic risk factors to account for economic drifts and volatilities.
- Easier said than done; when selecting variables, there are a number of limitations. First, the model will usually contain some highly predictive variables — the use of which is prohibited by legal, ethical or regulatory rules. Second, some variables might not be available or might be of poor quality during modeling or production stages. In addition, there might be important variables that have not been recognized as such, for example, because of a biased population sample, or because their model effect would be counter- intuitive as a result of multicollinearity. And finally, the business will always have the last word and might insist that only business-sound variables are included or request monotonically increasing or decreasing effects.
- All of these constraints are potential sources of bias, which gives the data scientists a challenging task to minimise the selection bias. Typical preventive measures during variable selection include:
- Collaboration with experts in the field to identify the important variables.
- Awareness of any problems in relation to data source, reliability or mismeasurement.