

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The stringr package in r, which aims to simplify and modernize string processing in data cleaning and preparation tasks. The package provides a consistent interface to common string operations and pattern matching functions using regular expressions. It also offers functions to duplicate, trim, and pad strings. Examples of using the package to extract phone numbers and replace color names with their hex equivalents.
Typology: Lecture notes
1 / 3
This page cannot be seen from the preview
Don't miss anything!


Hadley Wickham
Abstract String processing is not glamorous, but it is frequently used in data cleaning and prepa- ration. The existing string functions in R are powerful, but not friendly. To remedy this, the stringr package provides string functions that are simpler and more consistent, and also fixes some functionality that R is missing compared to other programming languages.
Strings are not glamorous, high-profile components of R, but they do play a big role in many data clean- ing and preparations tasks. R provides a solid set of string operations, but because they have grown or- ganically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R. The stringr package aims to remedy these problems by providing a clean, modern interface to common string operations. More concretely, stringr :
There are three string functions that are closely re- lated to their base R equivalents, but with a few en- hancements:
Three functions add new functionality:
stringr provides pattern matching functions to de- tect , locate , extract , match , replace , and split strings:
Figure 1 shows how the simple (single match) form of each of these functions work.
Each pattern matching function has the same first two arguments, a character vector of strings to pro- cess and a single pattern (regular expression) to match. The replace functions have an additional ar- gument specifying the replacement string, and the split functions have an argument to specify the num- ber of pieces. Unlike base string functions, stringr only offers limited control over the type of matching. The fixed() and ignore.case() functions modify the pattern to use fixed matching or to ignore case, but if you want to use perl-style regular expressions or to match on bytes instead of characters, you’re out of luck and you’ll have to use the base string func- tions. This is a deliberate choice made to simplify these functions. For example, while grepl has six ar- guments, str_detect only has two.
To be able to use these functions effectively, you’ll need a good knowledge of regular expressions (Friedl, 1997), which this paper is not going to teach you. Some useful tools to get you started:
When writing regular expressions, I strongly rec- ommend generating a list of positive (pattern should match) and negative (pattern shouldn’t match) test cases to ensure that you are matching the correct components.
Many of the functions return a list of vectors or ma- trices. To work with each element of the list there are two strategies: iterate through a common set of indices, or use mapply to iterate through the vectors simultaneously. The first approach is usually easier to understand and is illustrated in Figure 2.
Conclusion
stringr provides an opinionated interface to strings in R. It makes string processing simpler by remov- ing uncommon options, and by vigorously enforcing consistency across functions. I have also added new functions that I have found useful from Ruby, and over time, I hope users will suggest useful functions from other programming languages. I will continue to build on the included test suite to ensure that the package behaves as expected and remains bug free.
Bibliography
J. E. Friedl. Mastering Regular Expressions. O’Reilly,
Hadley Wickham Department of Statistics Rice University 6100 Main St MS# Houston TX 77005- USA [email protected]
(^1) http://www.regular-expressions.info/reference.html (^2) http://gskinner.com/RegExr/ (^3) http://www.txt2re.com