data wrangling cheat sheet, Cheat Sheet of Computational and Statistical Data Analysis

Schematic and complete data wrangling cheat sheet with R and Python

Typology: Cheat Sheet

2018/2019
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 09/02/2019

nicoth
nicoth 🇺🇸

4.3

(20)

262 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Wrangling
with dplyr and tidyr
Cheat Sheet
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com
Syntax - Helpful conventions for wrangling
dplyr::tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than
data frames. R displays only the data that fits onscreen:
dplyr::glimpse(iris)
Information dense summary of tbl data.
utils::View(iris)
View data set in spreadsheet-like display (note capital V).
Source: local data frame [150 x 5]
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
.. ... ... ...
Variables not shown: Petal.Width (dbl),
Species (fctr)
dplyr::%>%
Passes object on left hand side as first argument (or .
argument) of function on righthand side.
"Piping" with %>% makes code more readable, e.g.
iris %>%
group_by(Species) %>%
summarise(avg = mean(Sepal.Width)) %>%
arrange(avg)
x %>% f(y) is the same as f(x, y)
y %>% f(x, ., z) is the same as f(x, y, z )
Reshaping Data - Change the layout of a data set
Subset Observations (Rows)
Subset Variables (Columns)
F
M
A
Each variable is saved
in its own column
F
M
A
In a tidy
data set: &
Tidy Data - A foundation for wrangling in R
Tidy data complements R’s vectorized
operations. R will automatically preserve
observations as you manipulate variables.
No other format works as intuitively with R.
F
A
M
M * A
*
tidyr::gather(cases, "year", "n", 2:4)
Gather columns into rows.
tidyr::unite(data, col, ..., sep)
Unite several columns into one.
dplyr::data_frame(a = 1:3, b = 4:6)
Combine vectors into data frame
(optimized).
dplyr::arrange(mtcars, mpg)
Order rows by values of a column
(low to high).
dplyr::arrange(mtcars, desc(mpg))
Order rows by values of a column
(high to low).
dplyr::rename(tb, y = year)
Rename the columns of a data
frame.
tidyr::spread(pollution, size, amount)
Spread rows into columns.
tidyr::separate(storms, date, c("y", "m", "d"))
Separate one column into several.
w
w
w
w
w
w
A
1005
A
1013
A
1010
A
1010
w
w
p
110
110
1007
45
45
1009
w
w
p
110
110
1007
45
45
1009
w
w
p
110
110
1007
45
45
1009
w
w
p
110
110
1007
45
45
1009
w
p
p
w
110
1007
1007
110
45
1009
1009
45
w
w
w
w
w
110
110
110
110
110
w
w
w
w
dplyr::filter(iris, Sepal.Length > 7)
Extract rows that meet logical criteria.
dplyr::distinct(iris)
Remove duplicate rows.
dplyr::sample_frac(iris, 0.5, replace = TRUE)
Randomly select fraction of rows.
dplyr::sample_n(iris, 10, replace = TRUE)
Randomly select n rows.
dplyr::slice(iris, 10:15)
Select rows by position.
dplyr::top_n(storms, 2, date)
Select and order top n entries (by group if grouped data).
<
Less than
!=
Not equal to
>
Greater than
%in%
Group membership
==
Equal to
is.na
Is NA
<=
Less than or equal to
!is.na
Is not NA
>=
Greater than or equal to
&,|,!,xor,any,all
Boolean operators
Logic in R - ?Comparison, ?base::Logic
dplyr::select(iris, Sepal.Width, Petal.Length, Species)
Select columns by name or helper function.
Helper functions for select - ?select
select(iris, contains("." ))
Select columns whose name contains a character string.
select(iris, ends_with("Length"))
Select columns whose name ends with a character string.
select(iris, everything())
Select every column.
select(iris, matches(".t ." ))
Select columns whose name matches a regular expression.
select(iris, num_range("x", 1:5))
Select columns named x1, x2, x3, x4, x5.
select(iris, one_of(c("Species", "Genus")))
Select columns whose names are in a group of names.
select(iris, starts_with("Sepal"))
Select columns whose name starts with a character string.
select(iris, Sepal.Length:Petal.Width)
Select all columns between Sepal.Length and Petal.Width (inclusive).
select(iris, -Species)
Select all columns except Species.
Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
w
w
w
w
w
w
A
1005
A
1013
A
1010
A
1010
devtools::install_github("rstudio/EDAWR") for data sets
pf3
pf4
pf5
Discount

On special offer

Partial preview of the text

Download data wrangling cheat sheet and more Cheat Sheet Computational and Statistical Data Analysis in PDF only on Docsity!

Data Wrangling

with dplyr and tidyr

Cheat Sheet

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com Syntax - Helpful conventions for wrangling

dplyr:: tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than
data frames. R displays only the data that fits onscreen:
dplyr:: glimpse(iris)
Information dense summary of tbl data.
utils:: View(iris)
View data set in spreadsheet-like display (note capital V).

Source: local data frame [150 x 5] Sepal.Length Sepal.Width Petal.Length 1 5.1 3.5 1. 2 4.9 3.0 1. 3 4.7 3.2 1. 4 4.6 3.1 1. 5 5.0 3.6 1. .. ... ... ... Variables not shown: Petal.Width (dbl), Species (fctr)

dplyr:: %>%
Passes object on left hand side as first argument (or.
argument) of function on righthand side.
"Piping" with %>% makes code more readable, e.g.

iris %>% group_by(Species) %>% summarise(avg = mean(Sepal.Width)) %>% arrange(avg)

x %>% f(y) is the same as f(x, y)
y %>% f(x, ., z) is the same as f(x, y, z )

Reshaping Data - Change the layout of a data set Subset Observations (Rows) Subset Variables (Columns)

F M A
Each variable is saved
in its own column
F M A
Each observation is
saved in its own row

In a tidy data set: & Tidy Data - A foundation for wrangling in R

Tidy data complements R’s vectorized
operations. R will automatically preserve
observations as you manipulate variables.
No other format works as intuitively with R.
M A F

M * A

tidyr:: gather(cases, "year", "n", 2:4)
Gather columns into rows.
tidyr:: unite(data, col, ..., sep)
Unite several columns into one.
dplyr:: data_frame(a = 1:3, b = 4:6)
Combine vectors into data frame
(optimized).
dplyr:: arrange(mtcars, mpg)
Order rows by values of a column
(low to high).
dplyr::arrange(mtcars, desc(mpg) )
Order rows by values of a column
(high to low).
dplyr:: rename(tb, y = year)
Rename the columns of a data
frame.
tidyr:: spread(pollution, size, amount)
Spread rows into columns.
tidyr:: separate(storms, date, c("y", "m", "d"))
Separate one column into several.

ww ww ww 1005 A 1013 A 1010 A 1010 A wwp 1101101007 45100945 wwp 1101101007 45100945 wwp 1101101007 45100945 wwp 1101101007 45100945 wppw 11010071007110 100945100945 wwwww 110110110110110 wwww

dplyr:: filter(iris, Sepal.Length > 7)
Extract rows that meet logical criteria.
dplyr:: distinct(iris)
Remove duplicate rows.
dplyr:: sample_frac(iris, 0.5, replace = TRUE)
Randomly select fraction of rows.
dplyr:: sample_n(iris, 10, replace = TRUE)
Randomly select n rows.
dplyr:: slice(iris, 10:15)
Select rows by position.
dplyr:: top_n(storms, 2, date)
Select and order top n entries (by group if grouped data).

< (^) Less than != (^) Not equal to

(^) Greater than %in% (^) Group membership

== (^) Equal to is.na (^) Is NA <= (^) Less than or equal to !is.na (^) Is not NA

= (^) Greater than or equal to &,|,!,xor,any,all (^) Boolean operators Logic in R - ?Comparison, ?base::Logic

dplyr:: select(iris, Sepal.Width, Petal.Length, Species)
Select columns by name or helper function.

Helper functions for select - ?select select(iris, contains(".")) Select columns whose name contains a character string. select(iris, ends_with("Length")) Select columns whose name ends with a character string. select(iris, everything()) Select every column. select(iris, matches(".t.")) Select columns whose name matches a regular expression. select(iris, num_range("x", 1:5)) Select columns named x1, x2, x3, x4, x5. select(iris, one_of(c("Species", "Genus"))) Select columns whose names are in a group of names. select(iris, starts_with("Sepal")) Select columns whose name starts with a character string. select(iris, Sepal.Length:Petal.Width) Select all columns between Sepal.Length and Petal.Width (inclusive). select(iris, - Species) Select all columns except Species. Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/1 5 ww ww ww 1005 A 1013 A 1010 A 1010 A devtools::install_github("rstudio/EDAWR") for data sets

Summarise Cases

group_by( .data, ..., add = FALSE ) Returns copy of table grouped by … g_iris <- group_by(iris, Species) ungroup( x, … ) Returns ungrouped copy of table. ungroup(g_iris) wwwwww w Use group_by() to create a "grouped" copy of a table. dplyr functions will manipulate each "group" separately and then combine the results. mtcars %>% group_by(cyl) %>% summarise(avg = mean(mpg)) These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back). VARIATIONS summarise_all() - Apply funs to every column. summarise_at() - Apply funs to specific columns. summarise_if() - Apply funs to all cols of one type. www www summarise (.data, …) Compute table of summaries. summarise(mtcars, avg = mean(mpg)) count (x, ..., wt = NULL, sort = FALSE) Count number of rows in each group defined by the variables in … Also tally (). count(iris, Species) RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017- Each observation , or case , is in its own row Each variable is in its own column & dplyr functions work with pipes and expect tidy data. In tidy data: pipes x %>% f(y) becomes f(x, y) filter( .data, … ) Extract rows that meet logical criteria. filter(iris, Sepal.Length > 7) distinct( .data, ..., .keep_all = FALSE ) Remove rows with duplicate values. distinct(iris, Species) sample_frac( tbl, size = 1, replace = FALSE, weight = NULL, .env = parent.frame() ) Randomly select fraction of rows. sample_frac(iris, 0.5, replace = TRUE) sample_n( tbl, size, replace = FALSE, weight = NULL, .env = parent.frame() ) Randomly select size rows. sample_n(iris, 10, replace = TRUE) slice( .data, … ) Select rows by position. slice(iris, 10:15) top_n( x, n, wt ) Select and order top n entries (by group if grouped data). top_n(iris, 5, Sepal.Width) Row functions return a subset of rows as a new table. See ?base::logic and ?Comparison for help.

= !is.na()! & < <= is.na() %in% | xor() arrange( .data, … ) Order rows by values of a column or columns (low to high), use with desc() to order from high to low. arrange(mtcars, mpg) arrange(mtcars, desc(mpg)) add_row(. data, ..., .before = NULL, .after = NULL ) Add one or more rows to a table. add_row(faithful, eruptions = 1, waiting = 1)

Group Cases

Manipulate Cases

EXTRACT VARIABLES ADD CASES ARRANGE CASES Logical and boolean operators to use with filter() Column functions return a set of columns as a new vector or table. contains( match ) ends_with( match ) matches( match ) : , e.g. mpg:cyl

- , e.g, -Species num_range( prefix, range ) one_of() starts_with( match ) pull( .data, var = -1 ) Extract column values as a vector. Choose by name or index. pull(iris, Sepal.Length)

Manipulate Variables

Use these helpers with select (), e.g. select(iris, starts_with("Sepal")) These apply vectorized functions to columns. Vectorized funs take vectors as input and return vectors of the same length as output (see back). mutate( .data, … ) Compute new column(s). mutate(mtcars, gpm = 1/mpg) transmute( .data, … ) Compute new column(s), drop others. transmute(mtcars, gpm = 1/mpg) mutate_all( .tbl, .funs, … ) Apply funs to every column. Use with funs(). Also mutate_if(). mutate_all(faithful, funs(log(.), log2(.))) mutate_if(iris, is.numeric, funs(log(.))) mutate_at( .tbl, .cols, .funs, … ) Apply funs to specific columns. Use with funs() , vars() and the helper functions for select(). mutate_at(iris, vars( -Species), funs(log(.))) add_column( .data, ..., .before = NULL, .after = NULL ) Add new column(s). Also add_count() , add_tally(). add_column(mtcars, new = 1:32) rename( .data, … ) Rename columns. rename(iris, Length = Sepal.Length) MAKE NEW VARIABLES EXTRACT CASES wwwwww wwwwww wwwwww wwwwww wwwwww wwwwww wwww wwwww wwwwww www wwww w ww wwwwww summary function vectorized function Data Transformation with dplyr : : CHEAT SHEET A B C A B C select( .data, … ) Extract columns as a table. Also select_if(). select(iris, Sepal.Length, Species) wwww

dplyr

dplyr:: group_by(iris, Species)
Group data into rows with the same value of Species.
dplyr:: ungroup(iris)
Remove grouping information from data frame.
iris %>% group_by(Species) %>% summarise(…)
Compute separate summary row for each group.

Combine Data Sets Group Data Summarise Data Make New Variables

ir ir
C
dplyr:: summarise(iris, avg = mean(Sepal.Length))
Summarise data into single row of values.
dplyr:: summarise_each(iris, funs(mean))
Apply summary function to each column.
dplyr:: count(iris, Species, wt = Sepal.Length)
Count number of rows with each unique value of
variable (with or without weights).
dplyr:: mutate(iris, sepal = Sepal.Length + Sepal. Width)
Compute and append one or more new columns.
dplyr:: mutate_each(iris, funs(min_rank))
Apply window function to each column.
dplyr:: transmute(iris, sepal = Sepal.Length + Sepal. Width)
Compute one or more new columns. Drop original columns.
Summarise uses summary functions , functions that
take a vector of values and return a single value, such as:
Mutate uses window functions , functions that take a vector of
values and return another vector of values, such as:
window
function
summary
function
dplyr:: first
First value of a vector.
dplyr:: last
Last value of a vector.
dplyr:: nth
Nth value of a vector.
dplyr:: n
# of values in a vector.
dplyr:: n_distinct
# of distinct values in
a vector.
IQR
IQR of a vector.
min
Minimum value in a vector.
max
Maximum value in a vector.
mean
Mean value of a vector.
median
Median value of a vector.
var
Variance of a vector.
sd
Standard deviation of a
vector.
dplyr:: lead
Copy with values shifted by 1.
dplyr:: lag
Copy with values lagged by 1.
dplyr:: dense_rank
Ranks with no gaps.
dplyr:: min_rank
Ranks. Ties get min rank.
dplyr:: percent_rank
Ranks rescaled to [0, 1].
dplyr:: row_number
Ranks. Ties got to first value.
dplyr:: ntile
Bin vector into n buckets.
dplyr:: between
Are values between a and b?
dplyr:: cume_dist
Cumulative distribution.
dplyr:: cumall
Cumulative all
dplyr:: cumany
Cumulative any
dplyr:: cummean
Cumulative mean
cumsum
Cumulative sum
cummax
Cumulative max
cummin
Cumulative min
cumprod
Cumulative prod
pmax
Element-wise max
pmin
Element-wise min
iris %>% group_by(Species) %>% mutate(…)
Compute new variables by group.

x1 x A 1 B 2 C 3 x1 x A T B F

  • (^) D T = x1 x2 x A 1 T B 2 F C 3 NA x1 x3 x A T 1 B F 2 D T NA x1 x2 x A 1 T B 2 F x1 x2 x A 1 T B 2 F C 3 NA D NA T x1 x A 1 B 2 C 3 x1 x B 2 C 3
  • (^) D 4 = x1 x B 2 C 3 x1 x A 1 B 2 C 3 D 4 x1 x A 1 x1 x A 1 B 2 C 3 B 2 C 3 D 4 x1 x2 x1 x A 1 B 2 B 2 C 3 C 3 D 4 Mutating Joins Filtering Joins Binding Set Operations
dplyr:: left_join(a, b, by = "x1")
Join matching rows from b to a.

a b

dplyr:: right_join(a, b, by = "x1")
Join matching rows from a to b.
dplyr:: inner_join(a, b, by = "x1")
Join data. Retain only rows in both sets.
dplyr:: full_join(a, b, by = "x1")
Join data. Retain all values, all rows.

x1 x A 1 B 2 x1 x C 3 y z

dplyr:: semi_join(a, b, by = "x1")
All rows in a that have a match in b.
dplyr:: anti_join(a, b, by = "x1")
All rows in a that do not have a match in b.
dplyr:: intersect(y, z)
Rows that appear in both y and z.
dplyr:: union(y, z)
Rows that appear in either or both y and z.
dplyr:: setdiff(y, z)
Rows that appear in y but not z.
dplyr:: bind_rows(y, z)
Append z to y as new rows.
dplyr:: bind_cols(y, z)
Append z to y as new columns.
Caution: matches rows by position.

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/1 5

F M A

Data Wrangling

with pandas Cheat Sheet http://pandas.pydata.org Syntax – Creating DataFrames Tidy Data – A foundation for wrangling in pandas

In a tidy
data set:

F M A

Each variable is saved
in its own column
Each observation is
saved in its own row
Tidy data complements pandas’s vectorized
operations. pandas will automatically preserve
observations as you manipulate variables. No
other format works as intuitively with pandas.

Reshaping Data – Change the layout of a data set M A F

M A

pd.melt(df) Gather columns into rows. df.pivot(columns='var', values='val') Spread rows into columns. pd.concat([df1,df2]) Append rows of DataFrames pd.concat([df1,df2], axis=1) Append columns of DataFrames df.sort_values('mpg') Order rows by values of a column (low to high). df.sort_values('mpg',ascending=False) Order rows by values of a column (high to low). df.rename(columns = {'y':'year'}) Rename the columns of a DataFrame df.sort_index() Sort the index of a DataFrame df.reset_index() Reset index of DataFrame to row numbers, moving index to columns. df.drop(columns=['Length','Height']) Drop columns from DataFrame Subset Observations (Rows) Subset Variables (Columns) a b c 1 4 7 10 2 5 8 11 3 6 9 12 df = pd.DataFrame( {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = [1, 2, 3]) Specify values for each column. df = pd.DataFrame( [[4, 7, 10], [5, 8, 11], [6, 9, 12]], index=[1, 2, 3], columns=['a', 'b', 'c']) Specify values for each row. a b c n v d 1 4 7 10 2 5 8 11 e 2 6 9 12 df = pd.DataFrame( {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = pd.MultiIndex.from_tuples( [('d',1),('d',2),('e',2)], names=['n','v'])) Create DataFrame with a MultiIndex Method Chaining Most pandas methods return a DataFrame so that another pandas method can be applied to the result. This improves readability of code. df = (pd.melt(df) .rename(columns={ 'variable' : 'var', 'value' : 'val'}) .query('val >= 200') ) df[df.Length > 7] Extract rows that meet logical criteria. df.drop_duplicates() Remove duplicate rows (only considers columns). df.head(n) Select first n rows. df.tail(n) Select last n rows. Logic in Python (and pandas) < Less than != Not equal to > Greater than df.column.isin( values ) Group membership == Equals pd.isnull( obj ) Is NaN <= Less than or equals pd.notnull( obj ) Is not NaN >= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants df[['width','length','species']] Select multiple columns with specific names. df['width'] or df.width Select single column with specific name. df.filter(regex=' regex ') Select columns whose name matches regular expression regex. df.loc[:,'x2':'x4'] Select all columns between x2 and x4 (inclusive). df.iloc[:,[1,2,5]] Select columns in positions 1, 2 and 5 (first column is 0). df.loc[df['a'] > 10, ['a','c']] Select rows meeting logical condition, and only the specific columns. regex (Regular Expressions) Examples '.' Matches strings containing a period '.' 'Length$' Matches strings ending with word 'Length' '^Sepal' Matches strings beginning with the word 'Sepal' '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4, '^(?!Species$).'* Matches strings except the string 'Species' df.sample(frac=0.5) Randomly select fraction of rows. df.sample(n=10) Randomly select n rows. df.iloc[10:20] Select rows by position. df.nlargest(n, 'value') Select and order top n entries. df.nsmallest(n, 'value') Select and order bottom n entries.