

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
BRIEF SUMMERY OF EDA TO SHOWCASE THE STEPS REQUIRED. CONTAINS DEFINITION OF EDA AND ITS IMPORTANCE IN DATA ANALYSIS WITH DESCRIPTION TO EACH STEP IN THE PROTOCOL
Typology: Cheat Sheet
1 / 2
This page cannot be seen from the preview
Don't miss anything!


Exploratory Data Analysis Exploratory Data Analysis (EDA) is an essential step in the data analysis process. It involves the process of examining, summarizing, and visualizing the main characteristics of a dataset to extract insights and identify patterns. Fundamentals of EDA include: Collection: May be from various sources such as surveys, web scraping, or sensors. Cleaning: Once the data has been collected, it needs to be cleaned. This process involves removing any errors, duplicates, or outliers from the data. Exploration: The next step is to explore the data by performing summary statistics such as mean, median, mode, standard deviation, variance, and correlation. Visualization: Data visualization is an essential part of EDA. It involves creating graphs, charts, and histograms to help understand the distribution of data. Hypothesis Testing: Once the data has been explored and visualized, the next step is to test hypotheses. This involves using statistical tests to determine if there is a significant relationship between variables.
1. What is EDA? The process of reviewing and cleaning data to drive insights (such as descriptive statistics and correlation) and generate hypotheses for experiments. 2. Describe the steps for initial EDA 1. View the data. Call .head ( ) 2. Gather more information, call .info ( ) to examine missing values, data type, memory usage 3. Categorical columns of interest ~.Value_counts~ 4. Numerical columns of interest ~.describe ( ) ~ 5. Exploring the distribution of data using histograms to visualise numerical data. 3. Why is data validation is important? Proper analysis is performed when the data types are correctly specified and validated. 4. How to validate data types? The info function gives an overview of the datatype in the dataframe. If unhappy about a datatype for a column, change it by calling the as.types function. Updating data types books[‘year’] = books[‘year’].astype(int) 5. How to validate categorical data? The function .isin( ), ~.isin 6. How to validate numerical data?
7. What is the method for grouping data by category? Use the .group_by ( ) to group data by category. 8. Can aggregating functions be used with .group_by( )? Yes, it can be used to provide statistical summery of the data.
9. What are the most commonly used aggregating functions? A. Sum. sum( ) B. Count. count( ) C. Minimum. min( ) D. Maximum. max( ) E. Variance **.var( )
12. Why is it important to deal with missing data? - it affects distribution - Thus, data will be less representative of the population As certain groups disproportionately represented. - can result in drawing incorrect conclusion 13. Wha is the method for checking missing values? df.isna( ) .sum( )) 14. What are the strategies for addressing missing data? A) drop missing values if they account for 5% or less of total values B) Impute mean, media, mode (replace the missing values) C) Impute by sub-groups different experience levels have different median salary 15. How to drop missing values threshold = len(dataframe) * 0. print(threshold)