data wrangling cheat sheet | Cheat Sheet Computational and Statistical Data Analysis

Data Wrangling

with dplyr and tidyr

Cheat Sheet

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com

Syntax - Helpful conventions for wrangling

dplyr::tbl_df(iris)

Converts data to tbl class. tbl’s are easier to examine than

data frames. R displays only the data that fits onscreen:

dplyr::glimpse(iris)

Information dense summary of tbl data.

utils::View(iris)

View data set in spreadsheet-like display (note capital V).

Source: local data frame [150 x 5]

Sepal.Length Sepal.Width Petal.Length

1 5.1 3.5 1.4

2 4.9 3.0 1.4

3 4.7 3.2 1.3

4 4.6 3.1 1.5

5 5.0 3.6 1.4

.. ... ... ...

Variables not shown: Petal.Width (dbl),

Species (fctr)

dplyr::%>%

Passes object on left hand side as first argument (or .

argument) of function on righthand side.

"Piping" with %>% makes code more readable, e.g.

iris %>%

group_by(Species) %>%

summarise(avg = mean(Sepal.Width)) %>%

arrange(avg)

x %>% f(y) is the same as f(x, y)

y %>% f(x, ., z) is the same as f(x, y, z )

Reshaping Data - Change the layout of a data set

Subset Observations (Rows)

Subset Variables (Columns)

Each variable is saved

in its own column

Each observation is

saved in its own row

In a tidy

data set: &

Tidy Data - A foundation for wrangling in R

Tidy data complements R’s vectorized

operations. R will automatically preserve

observations as you manipulate variables.

No other format works as intuitively with R.

M * A

tidyr::gather(cases, "year", "n", 2:4)

Gather columns into rows.

tidyr::unite(data, col, ..., sep)

Unite several columns into one.

dplyr::data_frame(a = 1:3, b = 4:6)

Combine vectors into data frame

(optimized).

dplyr::arrange(mtcars, mpg)

Order rows by values of a column

(low to high).

dplyr::arrange(mtcars, desc(mpg))

Order rows by values of a column

(high to low).

dplyr::rename(tb, y = year)

Rename the columns of a data

frame.

tidyr::spread(pollution, size, amount)

Spread rows into columns.

tidyr::separate(storms, date, c("y", "m", "d"))

Separate one column into several.

1005

1013

1010

110

1007

1009

110

1007

1009

110

1007

1009

110

1007

1009

110

1007

110

1009

110

dplyr::filter(iris, Sepal.Length > 7)

Extract rows that meet logical criteria.

dplyr::distinct(iris)

Remove duplicate rows.

dplyr::sample_frac(iris, 0.5, replace = TRUE)

Randomly select fraction of rows.

dplyr::sample_n(iris, 10, replace = TRUE)

Randomly select n rows.

dplyr::slice(iris, 10:15)

Select rows by position.

dplyr::top_n(storms, 2, date)

Select and order top n entries (by group if grouped data).

Less than

Not equal to

Greater than

%in%

Group membership

Equal to

is.na

Is NA

Less than or equal to

!is.na

Is not NA

Greater than or equal to

&,|,!,xor,any,all

Boolean operators

Logic in R - ?Comparison, ?base::Logic

dplyr::select(iris, Sepal.Width, Petal.Length, Species)

Select columns by name or helper function.

Helper functions for select - ?select

select(iris, contains("." ))

Select columns whose name contains a character string.

select(iris, ends_with("Length"))

Select columns whose name ends with a character string.

select(iris, everything())

Select every column.

select(iris, matches(".t ." ))

Select columns whose name matches a regular expression.

select(iris, num_range("x", 1:5))

Select columns named x1, x2, x3, x4, x5.

select(iris, one_of(c("Species", "Genus")))

Select columns whose names are in a group of names.

select(iris, starts_with("Sepal"))

Select columns whose name starts with a character string.

select(iris, Sepal.Length:Petal.Width)

Select all columns between Sepal.Length and Petal.Width (inclusive).

select(iris, -Species)

Select all columns except Species.

Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15

1005

1013

1010

devtools::install_github("rstudio/EDAWR") for data sets

data wrangling cheat sheet, Cheat Sheet of Computational and Statistical Data Analysis

Related documents

Partial preview of the text

Download data wrangling cheat sheet and more Cheat Sheet Computational and Statistical Data Analysis in PDF only on Docsity!

Data Wrangling

with dplyr and tidyr

Cheat Sheet

dplyr:: tbl_df(iris)

Converts data to tbl class. tbl’s are easier to examine than

data frames. R displays only the data that fits onscreen:

dplyr:: glimpse(iris)

Information dense summary of tbl data.

utils:: View(iris)

View data set in spreadsheet-like display (note capital V).

dplyr:: %>%

Passes object on left hand side as first argument (or.

argument) of function on righthand side.

"Piping" with %>% makes code more readable, e.g.

x %>% f(y) is the same as f(x, y)

y %>% f(x, ., z) is the same as f(x, y, z )

F M A

Each variable is saved

in its own column

F M A

Each observation is

saved in its own row

Tidy data complements R’s vectorized

operations. R will automatically preserve

observations as you manipulate variables.

No other format works as intuitively with R.

M A F

tidyr:: gather(cases, "year", "n", 2:4)

Gather columns into rows.

tidyr:: unite(data, col, ..., sep)

Unite several columns into one.

dplyr:: data_frame(a = 1:3, b = 4:6)

Combine vectors into data frame

(optimized).

dplyr:: arrange(mtcars, mpg)

Order rows by values of a column

(low to high).

dplyr::arrange(mtcars, desc(mpg) )

Order rows by values of a column

(high to low).

dplyr:: rename(tb, y = year)

Rename the columns of a data

frame.

tidyr:: spread(pollution, size, amount)

Spread rows into columns.

tidyr:: separate(storms, date, c("y", "m", "d"))

Separate one column into several.

dplyr:: filter(iris, Sepal.Length > 7)

Extract rows that meet logical criteria.

dplyr:: distinct(iris)

Remove duplicate rows.

dplyr:: sample_frac(iris, 0.5, replace = TRUE)

Randomly select fraction of rows.

dplyr:: sample_n(iris, 10, replace = TRUE)

Randomly select n rows.

dplyr:: slice(iris, 10:15)

Select rows by position.

dplyr:: top_n(storms, 2, date)

Select and order top n entries (by group if grouped data).

dplyr:: select(iris, Sepal.Width, Petal.Length, Species)

Select columns by name or helper function.

Summarise Cases

Group Cases

Manipulate Cases

Manipulate Variables

dplyr

dplyr:: group_by(iris, Species)

Group data into rows with the same value of Species.

dplyr:: ungroup(iris)

Remove grouping information from data frame.

iris %>% group_by(Species) %>% summarise(…)

Compute separate summary row for each group.

ir ir

C

dplyr:: summarise(iris, avg = mean(Sepal.Length))

Summarise data into single row of values.