Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Exploratory Data Analysis (EDA) - A Comprehensive Guide, Cheat Sheet of Advanced Data Analysis

University of Liverpool Advanced Data Analysis

BRIEF SUMMERY OF EDA TO SHOWCASE THE STEPS REQUIRED. CONTAINS DEFINITION OF EDA AND ITS IMPORTANCE IN DATA ANALYSIS WITH DESCRIPTION TO EACH STEP IN THE PROTOCOL

Typology: Cheat Sheet

2022/2023

Uploaded on 03/07/2023

madlan7 🇬🇧

1 document

1 / 2

This page cannot be seen from the preview

Don't miss anything!

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an essential step in the data analysis process. It involves the

process of examining, summarizing, and visualizing the main characteristics of a dataset to

extract insights and identify patterns. Fundamentals of EDA include:!

"Collection:!

"May be from various sources such as surveys, web scraping, or sensors.!

"Cleaning:!

"Once the data has been collected, it needs to be cleaned. This process involves removing "

"any errors, duplicates, or outliers from the data.!

"Exploration:!

"The next step is to explore the data by performing summary statistics such as mean, " "

"median, mode, standard deviation, variance, and correlation.!

"Visualization:!

"Data visualization is an essential part of EDA. It involves creating graphs, charts, and " "

"histograms to help understand the distribution of data.!

"Hypothesis Testing:!

"Once the data has been explored and visualized, the next step is to test hypotheses. This "

"involves using statistical tests to determine if there is a significant relationship between ""

"variables.!

"!

1. What is EDA?

The process of reviewing and cleaning data to drive insights (such as descriptive statistics and

correlation) and generate hypotheses for experiments.!

2. Describe the steps for initial EDA

1. View the data. Call .head ( )

"2. Gather more information, call .info ( ) to examine missing values, data type, memory " "

""usage !

"3. Categorical columns of interest ~.Value_counts~

"4. Numerical columns of interest ~.describe ( ) ~

"5. Exploring the distribution of data using histograms to visualise numerical data. !

3. Why is data validation is important?

Proper analysis is performed when the data types are correctly specified and validated. !

4. How to validate data types?

The info function gives an overview of the datatype in the dataframe. !

If unhappy about a datatype for a column, change it by calling the as.types function. !

Updating data types!

"books[‘year’] = books[‘year’].astype(int)!

5. How to validate categorical data?

The function .isin( ), ~.isin!

6. How to validate numerical data?

Data summarisation

7. What is the method for grouping data by category?

Use the .group_by ( ) to group data by category. !

"!

8. Can aggregating functions be used with .group_by( )?

Yes, it can be used to provide statistical summery of the data. !

Discover Cheat Sheet of Advanced Data Analysis University of Liverpool

Partial preview of the text

Download Exploratory Data Analysis (EDA) - A Comprehensive Guide and more Cheat Sheet Advanced Data Analysis in PDF only on Docsity!

Exploratory Data Analysis Exploratory Data Analysis (EDA) is an essential step in the data analysis process. It involves the process of examining, summarizing, and visualizing the main characteristics of a dataset to extract insights and identify patterns. Fundamentals of EDA include: Collection: May be from various sources such as surveys, web scraping, or sensors. Cleaning: Once the data has been collected, it needs to be cleaned. This process involves removing any errors, duplicates, or outliers from the data. Exploration: The next step is to explore the data by performing summary statistics such as mean, median, mode, standard deviation, variance, and correlation. Visualization: Data visualization is an essential part of EDA. It involves creating graphs, charts, and histograms to help understand the distribution of data. Hypothesis Testing: Once the data has been explored and visualized, the next step is to test hypotheses. This involves using statistical tests to determine if there is a significant relationship between variables.

1. What is EDA? The process of reviewing and cleaning data to drive insights (such as descriptive statistics and correlation) and generate hypotheses for experiments. 2. Describe the steps for initial EDA 1. View the data. Call .head ( ) 2. Gather more information, call .info ( ) to examine missing values, data type, memory usage 3. Categorical columns of interest ~.Value_counts~ 4. Numerical columns of interest ~.describe ( ) ~ 5. Exploring the distribution of data using histograms to visualise numerical data. 3. Why is data validation is important? Proper analysis is performed when the data types are correctly specified and validated. 4. How to validate data types? The info function gives an overview of the datatype in the dataframe. If unhappy about a datatype for a column, change it by calling the as.types function. Updating data types books[‘year’] = books[‘year’].astype(int) 5. How to validate categorical data? The function .isin( ), ~.isin 6. How to validate numerical data?

Data summarisation

7. What is the method for grouping data by category? Use the .group_by ( ) to group data by category. 8. Can aggregating functions be used with .group_by( )? Yes, it can be used to provide statistical summery of the data.

9. What are the most commonly used aggregating functions? A. Sum. sum( ) B. Count. count( ) C. Minimum. min( ) D. Maximum. max( ) E. Variance **.var( )

What function allows for aggregating ungrouped data?** .agg( ) applies aggregation functions across a DataFrame 11. Can you specify particular aggregation to columns of interest? Yes, by using dictionary method with .agg({ }) books.agg({‘column’:[“mean”, std”], ‘year’:[‘media’]}) 12. How to create named summery column? books.groupby(‘genere’).agg( mean_rating = (‘rating’, ‘mean’), std_rating = (‘rating’, ‘std’), median_year = (‘year’, ‘median’)) 12. What is the correct method to visualise categorical summaries? Bar-plots will automatically calculate the mean of a quantitate variables across grouped categorical data.

Data cleaning and imputation

12. Why is it important to deal with missing data? - it affects distribution - Thus, data will be less representative of the population As certain groups disproportionately represented. - can result in drawing incorrect conclusion 13. Wha is the method for checking missing values? df.isna( ) .sum( )) 14. What are the strategies for addressing missing data? A) drop missing values if they account for 5% or less of total values B) Impute mean, media, mode (replace the missing values) C) Impute by sub-groups different experience levels have different median salary 15. How to drop missing values threshold = len(dataframe) * 0. print(threshold)

Exploratory Data Analysis (EDA) - A Comprehensive Guide, Cheat Sheet of Advanced Data Analysis

Related documents

Partial preview of the text

Download Exploratory Data Analysis (EDA) - A Comprehensive Guide and more Cheat Sheet Advanced Data Analysis in PDF only on Docsity!

Data summarisation

Data cleaning and imputation