Exploratory Data Analysis (EDA) - A Comprehensive Guide, Cheat Sheet of Advanced Data Analysis

BRIEF SUMMERY OF EDA TO SHOWCASE THE STEPS REQUIRED. CONTAINS DEFINITION OF EDA AND ITS IMPORTANCE IN DATA ANALYSIS WITH DESCRIPTION TO EACH STEP IN THE PROTOCOL

Typology: Cheat Sheet

2022/2023

Uploaded on 03/07/2023

madlan7
madlan7 🇬🇧

1 document

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an essential step in the data analysis process. It involves the
process of examining, summarizing, and visualizing the main characteristics of a dataset to
extract insights and identify patterns. Fundamentals of EDA include:!
"Collection:!
"May be from various sources such as surveys, web scraping, or sensors.!
"Cleaning:!
"Once the data has been collected, it needs to be cleaned. This process involves removing "
"any errors, duplicates, or outliers from the data.!
"Exploration:!
"The next step is to explore the data by performing summary statistics such as mean, " "
"median, mode, standard deviation, variance, and correlation.!
"Visualization:!
"Data visualization is an essential part of EDA. It involves creating graphs, charts, and " "
"histograms to help understand the distribution of data.!
"Hypothesis Testing:!
"Once the data has been explored and visualized, the next step is to test hypotheses. This "
"involves using statistical tests to determine if there is a significant relationship between ""
"variables.!
"!
1. What is EDA?
The process of reviewing and cleaning data to drive insights (such as descriptive statistics and
correlation) and generate hypotheses for experiments.!
2. Describe the steps for initial EDA
1. View the data. Call .head ( )
"2. Gather more information, call .info ( ) to examine missing values, data type, memory " "
""usage !
"3. Categorical columns of interest ~.Value_counts~
"4. Numerical columns of interest ~.describe ( ) ~
"5. Exploring the distribution of data using histograms to visualise numerical data. !
3. Why is data validation is important?
Proper analysis is performed when the data types are correctly specified and validated. !
4. How to validate data types?
The info function gives an overview of the datatype in the dataframe. !
If unhappy about a datatype for a column, change it by calling the as.types function. !
Updating data types!
"books[‘year’] = books[‘year’].astype(int)!
5. How to validate categorical data?
The function .isin( ), ~.isin!
6. How to validate numerical data?
Data summarisation
7. What is the method for grouping data by category?
Use the .group_by ( ) to group data by category. !
"!
8. Can aggregating functions be used with .group_by( )?
Yes, it can be used to provide statistical summery of the data. !
pf2

Partial preview of the text

Download Exploratory Data Analysis (EDA) - A Comprehensive Guide and more Cheat Sheet Advanced Data Analysis in PDF only on Docsity!

Exploratory Data Analysis Exploratory Data Analysis (EDA) is an essential step in the data analysis process. It involves the process of examining, summarizing, and visualizing the main characteristics of a dataset to extract insights and identify patterns. Fundamentals of EDA include: Collection: May be from various sources such as surveys, web scraping, or sensors. Cleaning: Once the data has been collected, it needs to be cleaned. This process involves removing any errors, duplicates, or outliers from the data. Exploration: The next step is to explore the data by performing summary statistics such as mean, median, mode, standard deviation, variance, and correlation. Visualization: Data visualization is an essential part of EDA. It involves creating graphs, charts, and histograms to help understand the distribution of data. Hypothesis Testing: Once the data has been explored and visualized, the next step is to test hypotheses. This involves using statistical tests to determine if there is a significant relationship between variables.

1. What is EDA? The process of reviewing and cleaning data to drive insights (such as descriptive statistics and correlation) and generate hypotheses for experiments. 2. Describe the steps for initial EDA 1. View the data. Call .head ( ) 2. Gather more information, call .info ( ) to examine missing values, data type, memory usage 3. Categorical columns of interest ~.Value_counts~ 4. Numerical columns of interest ~.describe ( ) ~ 5. Exploring the distribution of data using histograms to visualise numerical data. 3. Why is data validation is important? Proper analysis is performed when the data types are correctly specified and validated. 4. How to validate data types? The info function gives an overview of the datatype in the dataframe. If unhappy about a datatype for a column, change it by calling the as.types function. Updating data types books[‘year’] = books[‘year’].astype(int) 5. How to validate categorical data? The function .isin( ), ~.isin 6. How to validate numerical data?

Data summarisation

7. What is the method for grouping data by category? Use the .group_by ( ) to group data by category. 8. Can aggregating functions be used with .group_by( )? Yes, it can be used to provide statistical summery of the data.

9. What are the most commonly used aggregating functions? A. Sum. sum( ) B. Count. count( ) C. Minimum. min( ) D. Maximum. max( ) E. Variance **.var( )

  1. What function allows for aggregating ungrouped data?** .agg( ) applies aggregation functions across a DataFrame 11. Can you specify particular aggregation to columns of interest? Yes, by using dictionary method with .agg({ }) books.agg({‘column’:[“mean”, std”], ‘year’:[‘media’]}) 12. How to create named summery column? books.groupby(‘genere’).agg( mean_rating = (‘rating’, ‘mean’), std_rating = (‘rating’, ‘std’), median_year = (‘year’, ‘median’)) 12. What is the correct method to visualise categorical summaries? Bar-plots will automatically calculate the mean of a quantitate variables across grouped categorical data.

Data cleaning and imputation

12. Why is it important to deal with missing data? - it affects distribution - Thus, data will be less representative of the population As certain groups disproportionately represented. - can result in drawing incorrect conclusion 13. Wha is the method for checking missing values? df.isna( ) .sum( )) 14. What are the strategies for addressing missing data? A) drop missing values if they account for 5% or less of total values B) Impute mean, media, mode (replace the missing values) C) Impute by sub-groups different experience levels have different median salary 15. How to drop missing values threshold = len(dataframe) * 0. print(threshold)