String Processing with stringr: Simplifying and Modernizing Strings in R, Lecture notes of Programming Languages

The stringr package in r, which aims to simplify and modernize string processing in data cleaning and preparation tasks. The package provides a consistent interface to common string operations and pattern matching functions using regular expressions. It also offers functions to duplicate, trim, and pad strings. Examples of using the package to extract phone numbers and replace color names with their hex equivalents.

Typology: Lecture notes

2021/2022

Uploaded on 08/05/2022

jacqueline_nel
jacqueline_nel 🇧🇪

4.4

(242)

3.2K documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CONTRIBUTED ARTI CLE 1
stringr: modern, consistent string
processing
Hadley Wickham
Abstract String processing is not glamorous, but
it is frequently used in data cleaning and prepa-
ration. The existing string functions in R are
powerful, but not friendly. To remedy this, the
stringr package provides string functions that
are simpler and more consistent, and also fixes
some functionality that R is missing compared
to other programming languages.
Introduction
Strings are not glamorous, high-profile components
of R, but they do play a big role in many data clean-
ing and preparations tasks. R provides a solid set of
string operations, but because they have grown or-
ganically over time, they can be inconsistent and a
little hard to learn. Additionally, they lag behind the
string operations in other programming languages,
so that some things that are easy to do in languages
like Ruby or Python are rather hard to do in R.
The stringr package aims to remedy these problems
by providing a clean, modern interface to common
string operations.
More concretely, stringr:
Processes factors and characters in the same
way.
Gives functions consistent names and argu-
ments.
Simplifies string operations by eliminating op-
tions that you don’t need 95% of the time (the
other 5% of the time you can use the base func-
tions).
Produces outputs than can easily be used as in-
puts. This includes ensuring that missing in-
puts result in missing outputs, and zero length
inputs result in zero length outputs.
Completes R’s string handling functions with
useful functions from other programming lan-
guages.
To meet these goals, stringr provides two basic
families of functions:
basic string operations, and
pattern matching functions which use regular
expressions to detect, locate, match, replace,
extract, and split strings.
These are described in more detail in the follow-
ing sections.
Basic string operations
There are three string functions that are closely re-
lated to their base R equivalents, but with a few en-
hancements:
str_c is equivalent to paste, but it uses the
empty string (“”) as the default separator and
silently removes zero length arguments.
str_length is equivalent to nchar, but it pre-
serves NA’s (rather than giving them length
2) and converts factors to characters (not inte-
gers).
str_sub is equivalent to substr but it returns a
zero length vector if any of its inputs are zero
length, and otherwise expands each argument
to match the longest. It also accepts negative
positions, which are calculated from the left of
the last character. The end position defaults to
-1, which corresponds to the last character.
str_str<- is equivalent to substr<-, but like
str_sub it understands negative indices, and
replacement strings not do need to be the same
length as the string they are replacing.
Three functions add new functionality:
str_dup to duplicate the characters within a
string.
str_trim to remove leading and trailing
whitespace.
str_pad to pad a string with extra whitespace
on the left, right, or both sides.
Pattern matching
stringr provides pattern matching functions to de-
tect,locate,extract,match,replace, and split strings:
str_detect detects the presence or absence of
a pattern and returns a logical vector. Based on
grepl.
str_locate locates the first position of a
pattern and returns a numeric matrix with
columns start and end. str_locate_all locates
all matches, returning a list of numeric matri-
ces. Based on regexpr and gregexpr.
The R Journal Vol. X/Y, Month, Year ISSN 2073-4859
pf3

Partial preview of the text

Download String Processing with stringr: Simplifying and Modernizing Strings in R and more Lecture notes Programming Languages in PDF only on Docsity!

stringr: modern, consistent string

processing

Hadley Wickham

Abstract String processing is not glamorous, but it is frequently used in data cleaning and prepa- ration. The existing string functions in R are powerful, but not friendly. To remedy this, the stringr package provides string functions that are simpler and more consistent, and also fixes some functionality that R is missing compared to other programming languages.

Introduction

Strings are not glamorous, high-profile components of R, but they do play a big role in many data clean- ing and preparations tasks. R provides a solid set of string operations, but because they have grown or- ganically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R. The stringr package aims to remedy these problems by providing a clean, modern interface to common string operations. More concretely, stringr :

  • Processes factors and characters in the same way.
  • Gives functions consistent names and argu- ments.
  • Simplifies string operations by eliminating op- tions that you don’t need 95% of the time (the other 5% of the time you can use the base func- tions).
  • Produces outputs than can easily be used as in- puts. This includes ensuring that missing in- puts result in missing outputs, and zero length inputs result in zero length outputs.
  • Completes R’s string handling functions with useful functions from other programming lan- guages. To meet these goals, stringr provides two basic families of functions:
  • basic string operations, and
  • pattern matching functions which use regular expressions to detect, locate, match, replace, extract, and split strings. These are described in more detail in the follow- ing sections.

Basic string operations

There are three string functions that are closely re- lated to their base R equivalents, but with a few en- hancements:

  • str_c is equivalent to paste, but it uses the empty string (“”) as the default separator and silently removes zero length arguments.
  • str_length is equivalent to nchar, but it pre- serves NA’s (rather than giving them length
    1. and converts factors to characters (not inte- gers).
  • str_sub is equivalent to substr but it returns a zero length vector if any of its inputs are zero length, and otherwise expands each argument to match the longest. It also accepts negative positions, which are calculated from the left of the last character. The end position defaults to -1, which corresponds to the last character.
  • str_str<- is equivalent to substr<-, but like str_sub it understands negative indices, and replacement strings not do need to be the same length as the string they are replacing.

Three functions add new functionality:

  • str_dup to duplicate the characters within a string.
  • str_trim to remove leading and trailing whitespace.
  • str_pad to pad a string with extra whitespace on the left, right, or both sides.

Pattern matching

stringr provides pattern matching functions to de- tect , locate , extract , match , replace , and split strings:

  • str_detect detects the presence or absence of a pattern and returns a logical vector. Based on grepl.
  • str_locate locates the first position of a pattern and returns a numeric matrix with columns start and end. str_locate_all locates all matches, returning a list of numeric matri- ces. Based on regexpr and gregexpr.
  • str_extract extracts text corresponding to the first match, returning a character vector. str_extract_all extracts all matches and re- turns a list of character vectors.
  • str_match extracts capture groups formed by () from the first match. It returns a char- acter matrix with one column for the com- plete match and one column for each group. str_match_all extracts capture groups from all matches and returns a list of character matrices.
  • str_replace replaces the first matched pattern and returns a character vector. str_replace_all replaces all matches. Based on sub and gsub.
  • str_split_fixed splits the string into a fixed number of pieces based on a pattern and re- turns a character matrix. str_split splits a string into a variable number of pieces and re- turns a list of character vectors.

Figure 1 shows how the simple (single match) form of each of these functions work.

Arguments

Each pattern matching function has the same first two arguments, a character vector of strings to pro- cess and a single pattern (regular expression) to match. The replace functions have an additional ar- gument specifying the replacement string, and the split functions have an argument to specify the num- ber of pieces. Unlike base string functions, stringr only offers limited control over the type of matching. The fixed() and ignore.case() functions modify the pattern to use fixed matching or to ignore case, but if you want to use perl-style regular expressions or to match on bytes instead of characters, you’re out of luck and you’ll have to use the base string func- tions. This is a deliberate choice made to simplify these functions. For example, while grepl has six ar- guments, str_detect only has two.

Regular expressions

To be able to use these functions effectively, you’ll need a good knowledge of regular expressions (Friedl, 1997), which this paper is not going to teach you. Some useful tools to get you started:

  • A good reference sheet^1
  • A tool that allows you to interactively test^2 what a regular expression will match
  • A tool to build a regular expression^3 from an input string

When writing regular expressions, I strongly rec- ommend generating a list of positive (pattern should match) and negative (pattern shouldn’t match) test cases to ensure that you are matching the correct components.

Functions that return lists

Many of the functions return a list of vectors or ma- trices. To work with each element of the list there are two strategies: iterate through a common set of indices, or use mapply to iterate through the vectors simultaneously. The first approach is usually easier to understand and is illustrated in Figure 2.

Conclusion

stringr provides an opinionated interface to strings in R. It makes string processing simpler by remov- ing uncommon options, and by vigorously enforcing consistency across functions. I have also added new functions that I have found useful from Ruby, and over time, I hope users will suggest useful functions from other programming languages. I will continue to build on the included test suite to ensure that the package behaves as expected and remains bug free.

Bibliography

J. E. Friedl. Mastering Regular Expressions. O’Reilly,

  1. URL http://oreilly.com/catalog/

Hadley Wickham Department of Statistics Rice University 6100 Main St MS# Houston TX 77005- USA [email protected]

(^1) http://www.regular-expressions.info/reference.html (^2) http://gskinner.com/RegExr/ (^3) http://www.txt2re.com