

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document from the university of san francisco's department of computer science introduces regular expressions, a powerful tool for finding, replacing, and ignoring strings or lines that match specific patterns. Various examples and methods for using regular expressions in python, including escaping characters, finding sequences of characters, and using the re module. Students will learn how to use regular expressions to extract information from strings, such as phone numbers and addresses.
Typology: Study notes
1 / 3
This page cannot be seen from the preview
Don't miss anything!


Chris Brooks Department of Computer Science University of San Francisco ?? Department of Computer Science — University of San Francisco – p.1/
Often, when you’re working with strings or files. you need to find, replace or ignore strings or lines that match some sort of pattern. Find lines containing ’img src’ Find strings that begin with a number Find strings that don’t contain a number Replace all instances of strings beginning with CS with
How can we do this in an automated way? ?? Department of Computer Science — University of San Francisco – p.2/
String module has find, count, replace, and split. this works when you know the exact string you’re looking for. Awkward when you have multiple occurrences, or many different matches. Doesn’t deal well with case ?? Department of Computer Science — University of San Francisco – p.3/
Replace all instances of ’ROAD’ with ’RD’ s.replace(’ROAD’, ’RD’) What about ’111 BROAD ROAD’? s[:-4] + s[-4:].replace(’ROAD’, ’RD’) Better, but brittle and hard to read Need to know that ’ROAD’ is the last four chars re.sub(’ROAD$’, ’RD’, s) ?? Department of Computer Science — University of San Francisco – p.4/
of a line. end $ says to only match occurrences at the What about cases where someone leaves off the street type? Or cases where there’s an apartment number after
Still matches. ’111 BROAD ROAD Apt 3’ Doesn’t match RD,s ) re.sub(’\bROAD\b’, ’\b’ indicates word boundary; needs to be escaped. ?? Department of Computer Science — University of San Francisco – p.5/
character needs to be escaped by putting another ’ín The ’´ front of it. Otherwise, Python thinks it’s a control character for the following character. t’ \ e.g. ’ When you need to work with lots of backslashes, you can use a raw string. r’\bROAD\b’ s2 = No metacharacters are evaluated. ?? Department of Computer Science — University of San Francisco – p.6/
Find addresses that begin with 1-3 4s. 4 Main St, 44 Mulberry Lane, 423 First St., etc = ’ˆ44?4?’ pat ’ indicates beginning of the line. ’ˆ ’?’ indicates 0 or 1 occurrence of the previous character. re.match() returns a Match object if there is a match, None otherwise ?? Department of Computer Science — University of San Francisco – p.7/
Now, let’s suppose we need to find sequences that begin with 1-3 5s, and can end with ’01’, ’02’, or 2 followed by 1-3 zeros. 44?4?’ Start with our previous pattern: ’ ˆ ending with ’01’ is ’(01)?’, with ’02’ is ’(02)?’ 44?4?(01|02)$’ will match 401, 44402, 4402, etc ’ ˆ Add 200?0?0? to parens
?? Department of Computer Science — University of San Francisco – p.8/
We can also use the {n,m} syntax to match repetition. ˆ4{1,3} matches 1-3 4s at the beginning of the string. We can rewrite the previous expression as: 4{1,3}(0(1|2) |20{1,3})$’ p = ’ ˆ ?? Department of Computer Science — University of San Francisco – p.9/
’.’ - matches any character x-y - match a range or set of characters. abc will match a or b or c A-Z will match A,B,C, ..., Z Can also tag the complement of a set - [âbc] will match any character except a,b,or c. ?? Department of Computer Science — University of San Francisco – p.10/
b ’ - word boundary \ ’ d’ - any digit\ ’ 0-9]) D’ - anything that’s not a digit (same as [ ˆ\ ’ w’ - equivalent to ’[A-Za-z0-9_]’\ ’ s’ - whitespace\ ’ S’ - non-whitespace\ ’ ?? Department of Computer Science — University of San Francisco – p.11/
’?’ - 0 or 1 occurrences ’+’ one or more occurrence ’’ 0 or more occurrences ’{m,n}’ - m to n occurrences ’{m}’ - exactly m occurrences ’?’ after another multiple-occurrence matcher - non-greedy. e.g. ’<.?>’ ?? Department of Computer Science — University of San Francisco – p.12/