



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Schematic and complete data wrangling cheat sheet with R and Python
Typology: Cheat Sheet
1 / 6
This page cannot be seen from the preview
Don't miss anything!




On special offer
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com Syntax - Helpful conventions for wrangling
Source: local data frame [150 x 5] Sepal.Length Sepal.Width Petal.Length 1 5.1 3.5 1. 2 4.9 3.0 1. 3 4.7 3.2 1. 4 4.6 3.1 1. 5 5.0 3.6 1. .. ... ... ... Variables not shown: Petal.Width (dbl), Species (fctr)
iris %>% group_by(Species) %>% summarise(avg = mean(Sepal.Width)) %>% arrange(avg)
Reshaping Data - Change the layout of a data set Subset Observations (Rows) Subset Variables (Columns)
In a tidy data set: & Tidy Data - A foundation for wrangling in R
M * A
ww ww ww 1005 A 1013 A 1010 A 1010 A wwp 1101101007 45100945 wwp 1101101007 45100945 wwp 1101101007 45100945 wwp 1101101007 45100945 wppw 11010071007110 100945100945 wwwww 110110110110110 wwww
< (^) Less than != (^) Not equal to
(^) Greater than %in% (^) Group membership
== (^) Equal to is.na (^) Is NA <= (^) Less than or equal to !is.na (^) Is not NA
= (^) Greater than or equal to &,|,!,xor,any,all (^) Boolean operators Logic in R - ?Comparison, ?base::Logic
Helper functions for select - ?select select(iris, contains(".")) Select columns whose name contains a character string. select(iris, ends_with("Length")) Select columns whose name ends with a character string. select(iris, everything()) Select every column. select(iris, matches(".t.")) Select columns whose name matches a regular expression. select(iris, num_range("x", 1:5)) Select columns named x1, x2, x3, x4, x5. select(iris, one_of(c("Species", "Genus"))) Select columns whose names are in a group of names. select(iris, starts_with("Sepal")) Select columns whose name starts with a character string. select(iris, Sepal.Length:Petal.Width) Select all columns between Sepal.Length and Petal.Width (inclusive). select(iris, - Species) Select all columns except Species. Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/1 5 ww ww ww 1005 A 1013 A 1010 A 1010 A devtools::install_github("rstudio/EDAWR") for data sets
group_by( .data, ..., add = FALSE ) Returns copy of table grouped by … g_iris <- group_by(iris, Species) ungroup( x, … ) Returns ungrouped copy of table. ungroup(g_iris) wwwwww w Use group_by() to create a "grouped" copy of a table. dplyr functions will manipulate each "group" separately and then combine the results. mtcars %>% group_by(cyl) %>% summarise(avg = mean(mpg)) These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back). VARIATIONS summarise_all() - Apply funs to every column. summarise_at() - Apply funs to specific columns. summarise_if() - Apply funs to all cols of one type. www www summarise (.data, …) Compute table of summaries. summarise(mtcars, avg = mean(mpg)) count (x, ..., wt = NULL, sort = FALSE) Count number of rows in each group defined by the variables in … Also tally (). count(iris, Species) RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017- Each observation , or case , is in its own row Each variable is in its own column & dplyr functions work with pipes and expect tidy data. In tidy data: pipes x %>% f(y) becomes f(x, y) filter( .data, … ) Extract rows that meet logical criteria. filter(iris, Sepal.Length > 7) distinct( .data, ..., .keep_all = FALSE ) Remove rows with duplicate values. distinct(iris, Species) sample_frac( tbl, size = 1, replace = FALSE, weight = NULL, .env = parent.frame() ) Randomly select fraction of rows. sample_frac(iris, 0.5, replace = TRUE) sample_n( tbl, size, replace = FALSE, weight = NULL, .env = parent.frame() ) Randomly select size rows. sample_n(iris, 10, replace = TRUE) slice( .data, … ) Select rows by position. slice(iris, 10:15) top_n( x, n, wt ) Select and order top n entries (by group if grouped data). top_n(iris, 5, Sepal.Width) Row functions return a subset of rows as a new table. See ?base::logic and ?Comparison for help.
= !is.na()! & < <= is.na() %in% | xor() arrange( .data, … ) Order rows by values of a column or columns (low to high), use with desc() to order from high to low. arrange(mtcars, mpg) arrange(mtcars, desc(mpg)) add_row(. data, ..., .before = NULL, .after = NULL ) Add one or more rows to a table. add_row(faithful, eruptions = 1, waiting = 1)
EXTRACT VARIABLES ADD CASES ARRANGE CASES Logical and boolean operators to use with filter() Column functions return a set of columns as a new vector or table. contains( match ) ends_with( match ) matches( match ) : , e.g. mpg:cyl
- , e.g, -Species num_range( prefix, range ) one_of( … ) starts_with( match ) pull( .data, var = -1 ) Extract column values as a vector. Choose by name or index. pull(iris, Sepal.Length)
Use these helpers with select (), e.g. select(iris, starts_with("Sepal")) These apply vectorized functions to columns. Vectorized funs take vectors as input and return vectors of the same length as output (see back). mutate( .data, … ) Compute new column(s). mutate(mtcars, gpm = 1/mpg) transmute( .data, … ) Compute new column(s), drop others. transmute(mtcars, gpm = 1/mpg) mutate_all( .tbl, .funs, … ) Apply funs to every column. Use with funs(). Also mutate_if(). mutate_all(faithful, funs(log(.), log2(.))) mutate_if(iris, is.numeric, funs(log(.))) mutate_at( .tbl, .cols, .funs, … ) Apply funs to specific columns. Use with funs() , vars() and the helper functions for select(). mutate_at(iris, vars( -Species), funs(log(.))) add_column( .data, ..., .before = NULL, .after = NULL ) Add new column(s). Also add_count() , add_tally(). add_column(mtcars, new = 1:32) rename( .data, … ) Rename columns. rename(iris, Length = Sepal.Length) MAKE NEW VARIABLES EXTRACT CASES wwwwww wwwwww wwwwww wwwwww wwwwww wwwwww wwww wwwww wwwwww www wwww w ww wwwwww summary function vectorized function Data Transformation with dplyr : : CHEAT SHEET A B C A B C select( .data, … ) Extract columns as a table. Also select_if(). select(iris, Sepal.Length, Species) wwww
Combine Data Sets Group Data Summarise Data Make New Variables
x1 x A 1 B 2 C 3 x1 x A T B F
a b
x1 x A 1 B 2 x1 x C 3 y z
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/1 5
F M A
with pandas Cheat Sheet http://pandas.pydata.org Syntax – Creating DataFrames Tidy Data – A foundation for wrangling in pandas
F M A
Reshaping Data – Change the layout of a data set M A F
M A
pd.melt(df) Gather columns into rows. df.pivot(columns='var', values='val') Spread rows into columns. pd.concat([df1,df2]) Append rows of DataFrames pd.concat([df1,df2], axis=1) Append columns of DataFrames df.sort_values('mpg') Order rows by values of a column (low to high). df.sort_values('mpg',ascending=False) Order rows by values of a column (high to low). df.rename(columns = {'y':'year'}) Rename the columns of a DataFrame df.sort_index() Sort the index of a DataFrame df.reset_index() Reset index of DataFrame to row numbers, moving index to columns. df.drop(columns=['Length','Height']) Drop columns from DataFrame Subset Observations (Rows) Subset Variables (Columns) a b c 1 4 7 10 2 5 8 11 3 6 9 12 df = pd.DataFrame( {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = [1, 2, 3]) Specify values for each column. df = pd.DataFrame( [[4, 7, 10], [5, 8, 11], [6, 9, 12]], index=[1, 2, 3], columns=['a', 'b', 'c']) Specify values for each row. a b c n v d 1 4 7 10 2 5 8 11 e 2 6 9 12 df = pd.DataFrame( {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = pd.MultiIndex.from_tuples( [('d',1),('d',2),('e',2)], names=['n','v'])) Create DataFrame with a MultiIndex Method Chaining Most pandas methods return a DataFrame so that another pandas method can be applied to the result. This improves readability of code. df = (pd.melt(df) .rename(columns={ 'variable' : 'var', 'value' : 'val'}) .query('val >= 200') ) df[df.Length > 7] Extract rows that meet logical criteria. df.drop_duplicates() Remove duplicate rows (only considers columns). df.head(n) Select first n rows. df.tail(n) Select last n rows. Logic in Python (and pandas) < Less than != Not equal to > Greater than df.column.isin( values ) Group membership == Equals pd.isnull( obj ) Is NaN <= Less than or equals pd.notnull( obj ) Is not NaN >= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants df[['width','length','species']] Select multiple columns with specific names. df['width'] or df.width Select single column with specific name. df.filter(regex=' regex ') Select columns whose name matches regular expression regex. df.loc[:,'x2':'x4'] Select all columns between x2 and x4 (inclusive). df.iloc[:,[1,2,5]] Select columns in positions 1, 2 and 5 (first column is 0). df.loc[df['a'] > 10, ['a','c']] Select rows meeting logical condition, and only the specific columns. regex (Regular Expressions) Examples '.' Matches strings containing a period '.' 'Length$' Matches strings ending with word 'Length' '^Sepal' Matches strings beginning with the word 'Sepal' '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4, '^(?!Species$).'* Matches strings except the string 'Species' df.sample(frac=0.5) Randomly select fraction of rows. df.sample(n=10) Randomly select n rows. df.iloc[10:20] Select rows by position. df.nlargest(n, 'value') Select and order top n entries. df.nsmallest(n, 'value') Select and order bottom n entries.