
I. Python re — Regular Expressions
re module gives you tools to use regex.
**II. Pattern Syntax — The Complete Reference **
1. Literals and Metacharacters (字面量与元字符)
1) Plain literals (普通字面量)
Most characters match themselves exactly.
import re
# Literal match — 'cat' matches exactly the string "cat"print(re.search(r'cat', 'I have a cat')) # matchprint(re.search(r'cat', 'I have a CAT')) # None (case-sensitive by default)print(re.search(r'cat', 'concatenate')) # match (found inside)2) The 14 metacharacters (14个元字符)
These characters have special meaning and must be escaped with \ to match literally:
. ^ $ * + ? { } [ ] \ | ( )import re
# Matching a literal dot — must escape ittext = "price: $3.99"
print(re.search(r'3.99', text)) # matches "3.99" BUT also "3X99" (dot = any char!)print(re.search(r'3\.99', text)) # matches ONLY "3.99" ← correct
# Matching a literal backslashprint(re.search(r'C:\\Users', r'C:\Users')) # matches C:\Usersr'pattern' for regex patterns. Without r, Python processes \n as newline before the regex engine sees it. With r'\n', the regex engine receives the literal two characters \n and interprets them as "newline character".2. The Dot . — Any character (任意字符)
. matches any single character except a newline \n (unless re.DOTALL flag is set).
import re
# . matches exactly ONE character (any except \n)print(re.findall(r'c.t', 'cat cut c t c\nt coot'))# → ['cat', 'cut', 'c t'] ('c\nt' skipped — \n not matched by dot)# Note: 'coot' not matched — dot matches exactly 1 char
# With re.DOTALL, dot matches newline tootext = "first\nsecond"print(re.search(r'first.second', text)) # Noneprint(re.search(r'first.second', text, re.DOTALL)) # match3. Anchors — Position matchers (锚点 — 位置匹配)
Anchors match positions, not characters.
1) ^ and $ — Start and end of string/line
import re
text = "hello world"
print(re.search(r'^hello', text)) # match — 'hello' is at startprint(re.search(r'^world', text)) # None — 'world' is NOT at startprint(re.search(r'world$', text)) # match — 'world' is at endprint(re.search(r'hello$', text)) # None
# With re.MULTILINE: ^ and $ match start/end of EACH LINEmultiline = "line1\nline2\nline3"print(re.findall(r'^\w+', multiline, re.MULTILINE))# → ['line1', 'line2', 'line3']
print(re.findall(r'\w+$', multiline, re.MULTILINE))# → ['line1', 'line2', 'line3']2) \b and \B — Word boundaries (单词边界)
\b matches the boundary between a word character and a non-word character.
import re
# \b matches word boundary — prevents partial matchesprint(re.findall(r'\bcat\b', 'cat cats concatenate scatter'))# → ['cat'] (only the standalone word)
print(re.findall(r'cat', 'cat cats concatenate scatter'))# → ['cat', 'cat', 'cat', 'cat'] (too many!)
# \B matches NON-word boundary (inside a word)print(re.findall(r'\Bcat\B', 'cat cats concatenate'))# → ['cat'] (only the 'cat' inside 'concatenate')3) \A, \Z — Absolute start/end of string (字符串绝对首尾)
import re
# \A and \Z are NOT affected by re.MULTILINE — always match string start/endtext = "line1\nline2"
print(re.search(r'\Aline1', text)) # match — absolute startprint(re.search(r'\Aline2', text)) # None — line2 is NOT at absolute startprint(re.search(r'line2\Z', text)) # match — absolute end4. Character Classes [ ] (字符类)
1) Basic character class (基本字符类)
A character class matches one character that is any of the listed characters.
import re
# [aeiou] matches any single vowelprint(re.findall(r'[aeiou]', 'hello world'))# → ['e', 'o', 'o']
# [a-z] matches any lowercase letter (range syntax)print(re.findall(r'[a-z]+', 'Hello World 123'))# → ['ello', 'orld']
# [A-Za-z0-9] matches any alphanumericprint(re.findall(r'[A-Za-z0-9]+', 'foo_bar-123!'))# → ['foo', 'bar', '123']
# [0-9] is equivalent to \dprint(re.findall(r'[0-9]+', 'abc 123 def 456'))# → ['123', '456']2) Negated character class [^ ] (否定字符类)
[^...] matches any character NOT in the class.
import re
# [^aeiou] matches any consonant (non-vowel)print(re.findall(r'[^aeiou\s]+', 'hello world'))# → ['h', 'll', 'w', 'rld']
# [^0-9] matches any non-digit characterprint(re.findall(r'[^0-9]+', 'abc123def456'))# → ['abc', 'def']
# Strip all non-alphanumeric characterscleaned = re.sub(r'[^A-Za-z0-9]', '', 'Hello, World! 123')print(cleaned) # → HelloWorld1233) Special sequences inside [ ]
import re
# Inside [], most metacharacters lose special meaning# - (dash) is literal if first, last, or escapedprint(re.findall(r'[-+*/]', '3+4-2*1/5')) # → ['+', '-', '*', '/']
# ^ is literal unless it is the FIRST characterprint(re.findall(r'[a^b]', 'a^b c')) # → ['a', '^', 'b'] (literal ^)
# ] must be escaped or placed firstprint(re.findall(r'[]a-z]', 'a]b')) # → ['a', ']', 'b']5. Predefined Character Classes (预定义字符类)
These are shorthand for common character sets:
| Shorthand | Equivalent | Meaning |
|---|---|---|
\d | [0-9] | Any digit (数字) |
\D | [^0-9] | Any non-digit (非数字) |
\w | [A-Za-z0-9_] | Word character (单词字符) |
\W | [^A-Za-z0-9_] | Non-word character (非单词字符) |
\s | [ \t\n\r\f\v] | Whitespace (空白字符) |
\S | [^ \t\n\r\f\v] | Non-whitespace (非空白字符) |
import re
text = "Hello, World! 42 items at $3.99 each.\n"
print(re.findall(r'\d+', text)) # → ['42', '3', '99']print(re.findall(r'\w+', text)) # → ['Hello', 'World', '42', 'items', 'at', '3', '99', 'each']print(re.findall(r'\s+', text)) # → [' ', ' ', ' ', ' ', ' ', '\n']print(re.findall(r'\W+', text)) # → [', ', '! ', ' ', ' $', '.', '\n']
# Combining: \w+ matches whole wordsprint(re.findall(r'\b\w{5}\b', text)) # words of exactly 5 chars# → ['Hello', 'World', 'items']6. Quantifiers — Repetition (量词 — 重复)
1) Basic quantifiers (基本量词)
| Quantifier | Meaning |
|---|---|
* | 0 or more (零次或多次) |
+ | 1 or more (一次或多次) |
? | 0 or 1 (零次或一次,可选) |
{n} | Exactly n times (恰好n次) |
{n,} | n or more times (n次或更多) |
{n,m} | Between n and m times (n到m次) |
import re
s = "colour color colouur"
print(re.findall(r'colou?r', s)) # ? → u is optional# → ['colour', 'color']
print(re.findall(r'colou*r', s)) # * → 0 or more u's# → ['colour', 'color', 'colouur']
print(re.findall(r'colou+r', s)) # + → 1 or more u's# → ['colour', 'colouur']
print(re.findall(r'colou{2}r', s)) # exactly 2 u's# → ['colouur']
print(re.findall(r'colou{1,2}r',s)) # 1 or 2 u's# → ['colour', 'colouur']
# Phone number: exactly 10 digitsprint(re.findall(r'\d{10}', '1234567890 12345'))# → ['1234567890']2) Greedy vs Non-greedy (贪婪 vs 非贪婪)
By default, quantifiers are greedy (贪婪) — they match as much as possible. Adding ? makes them non-greedy (非贪婪/懒惰) — they match as little as possible.
import re
html = "<b>bold</b> and <i>italic</i>"
# Greedy: .* expands as far right as possibleprint(re.findall(r'<.*>', html))# → ['<b>bold</b> and <i>italic</i>'] ← one huge match (too greedy)
# Non-greedy: .*? stops at the FIRST >print(re.findall(r'<.*?>', html))# → ['<b>', '</b>', '<i>', '</i>'] ← each tag separately
# Extracting content between tagsprint(re.findall(r'<b>(.*?)</b>', html))# → ['bold']
# More examplestext = '"first" and "second"'print(re.findall(r'".*"', text)) # → ['"first" and "second"'] greedyprint(re.findall(r'".*?"', text)) # → ['"first"', '"second"'] non-greedy| Pattern | Type | Matches |
|---|---|---|
.* | Greedy | As many chars as possible |
.*? | Non-greedy | As few chars as possible |
.+ | Greedy | 1+ chars, maximum |
.+? | Non-greedy | 1+ chars, minimum |
7. Groups — Capturing and Non-capturing (分组 — 捕获与非捕获)
1) Capturing group ( ) (捕获组)
Groups serve two purposes: grouping for quantifiers, and capturing the matched text.
import re
# Grouping: (ab)+ repeats the whole "ab"print(re.findall(r'(ab)+', 'ab abab ababab'))# → ['ab', 'ab', 'ab'] (returns last captured group)
# Capturing: extract the content inside ()dates = "2024-01-15, 2023-12-31, 2025-06-01"print(re.findall(r'(\d{4})-(\d{2})-(\d{2})', dates))# → [('2024', '01', '15'), ('2023', '12', '31'), ('2025', '06', '01')]# ↑ each match returns a tuple of all captured groups
# .group() on a Match objectm = re.search(r'(\d{4})-(\d{2})-(\d{2})', '2024-01-15')print(m.group(0)) # → 2024-01-15 (entire match)print(m.group(1)) # → 2024 (group 1)print(m.group(2)) # → 01 (group 2)print(m.group(3)) # → 15 (group 3)2) Named group (?P<name>...) (命名捕获组)
import re
# Named groups — access by name instead of indexpattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'m = re.search(pattern, '2024-01-15')
print(m.group('year')) # → 2024print(m.group('month')) # → 01print(m.group('day')) # → 15print(m.groupdict()) # → {'year': '2024', 'month': '01', 'day': '15'}
# Named groups in re.sub — backreference by nameresult = re.sub( r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', r'\g<day>/\g<month>/\g<year>', # reorder: DD/MM/YYYY '2024-01-15')print(result) # → 15/01/20243) Non-capturing group (?:...) (非捕获组)
When you need grouping for quantifiers but don’t want the group in your results:
import re
# Without (?:...) — capturing group pollutes findall resultsprint(re.findall(r'(\d+)(?:px|em|rem)', '12px 3em 100rem'))# → ['12', '3', '100'] ← only numbers, units NOT captured ✅
# With capturing group — units would also appearprint(re.findall(r'(\d+)(px|em|rem)', '12px 3em 100rem'))# → [('12', 'px'), ('3', 'em'), ('100', 'rem')] ← units captured too
# (?:...) for grouping quantifiersprint(re.findall(r'(?:ha)+', 'hahaha haha ha h'))# → ['hahaha', 'haha', 'ha'] (group 'ha' as a unit for +)4) Backreferences \1 \2 (反向引用)
Refer to a previously captured group within the same pattern.
import re
# Find repeated wordstext = "the the quick brown fox fox jumps"print(re.findall(r'\b(\w+)\s+\1\b', text))# → ['the', 'fox'] (\1 refers back to group 1)
# Find doubled charactersprint(re.findall(r'(.)\1', 'aabcddee'))# → ['a', 'd', 'e']
# HTML tag matching: opening and closing tags must matchhtml = "<h1>Title</h1> <h2>Subtitle</h2>"print(re.findall(r'<(\w+)>(.*?)</\1>', html))# → [('h1', 'Title'), ('h2', 'Subtitle')]# \1 ensures the closing tag matches the opening tag8. Lookahead and Lookbehind — Zero-width assertions (零宽断言)
1) Positive lookahead (?=...) (正向先行断言)
“Match X only if followed by Y” — Y is NOT included in the match.
import re
# Match a number only if followed by "px"print(re.findall(r'\d+(?=px)', '12px 3em 100px 5rem'))# → ['12', '100'] (px NOT included in results)
# Match word only if followed by a colontext = "name: Alice age: 30 city: NYC"print(re.findall(r'\w+(?=:)', text))# → ['name', 'age', 'city']
# Password validation: must contain a digitimport redef has_digit(pw): return bool(re.search(r'(?=.*\d)', pw))print(has_digit("abc123")) # → Trueprint(has_digit("abcdef")) # → False2) Negative lookahead (?!...) (负向先行断言)
“Match X only if NOT followed by Y”
import re
# Match a number only if NOT followed by "px"print(re.findall(r'\d+(?!px)\b', '12px 3em 100px 5rem'))# → ['3', '5']
# Match 'foo' not followed by 'bar'print(re.findall(r'foo(?!bar)', 'foobar foobaz foo'))# → ['foo', 'foo'] ('foobar' excluded, 'foobaz' and 'foo' included)3) Positive lookbehind (?<=...) (正向后行断言)
“Match X only if preceded by Y” — Y is NOT included in the match.
import re
# Match digits only if preceded by '$'prices = "items: $10, €20, £30, $50"print(re.findall(r'(?<=\$)\d+', prices))# → ['10', '50'] ($ NOT included in results)
# Match word after a colon and spacetext = "name: Alice, city: NYC, age: 30"print(re.findall(r'(?<=: )\w+', text))# → ['Alice', 'NYC', '30']4) Negative lookbehind (?<!...) (负向后行断言)
“Match X only if NOT preceded by Y”
import re
# Match digits NOT preceded by '$'prices = "items: $10, 20, $50, 100"print(re.findall(r'(?<!\$)\b\d+\b', prices))# → ['20', '100']
# Match 'ing' not preceded by 'run'words = "running swimming singing"print(re.findall(r'(?<!run)ning\b', words))# → ['ning', 'ning'] (swim→ming yes, sing→ning yes, runNING excluded)5) Lookaround summary table (环视断言总结)
| Syntax | Name | Meaning |
|---|---|---|
(?=Y) | Positive lookahead (正向先行) | Followed by Y |
(?!Y) | Negative lookahead (负向先行) | NOT followed by Y |
(?<=Y) | Positive lookbehind (正向后行) | Preceded by Y |
(? | Negative lookbehind (负向后行) | NOT preceded by Y |
9. Alternation | — OR operator (或运算符)
import re
# | matches either the left or right patternprint(re.findall(r'cat|dog|fish', 'I have a cat and a dog'))# → ['cat', 'dog']
# With groups: (cat|dog) scopes the alternationprint(re.findall(r'(cat|dog)s?', 'cats dogs cat dog'))# → ['cat', 'dog', 'cat', 'dog']
# Alternation of longer patternslog = "ERROR: disk full WARNING: low memory INFO: started"print(re.findall(r'ERROR|WARNING|INFO', log))# → ['ERROR', 'WARNING', 'INFO']
# Order matters: first match winsprint(re.search(r'cat|catch', 'I catch cats')) # matches 'cat' (not 'catch'!)print(re.search(r'catch|cat', 'I catch cats')) # matches 'catch' ← correct order10. Flags — Modifying match behavior (标志位)
1) All flags (所有标志位)
| Flag (short) | Flag (long) | Effect |
|---|---|---|
re.I | re.IGNORECASE | Case-insensitive matching (忽略大小写) |
re.M | re.MULTILINE | ^/$ match each line (多行模式) |
re.S | re.DOTALL | . matches \n too (点号匹配换行) |
re.X | re.VERBOSE | Allow whitespace/comments in pattern (详细模式) |
re.A | re.ASCII | \w \d \s match ASCII only (ASCII模式) |
re.L | re.LOCALE | Locale-dependent matching (本地化模式) |
2) re.IGNORECASE (re.I)
import re
print(re.findall(r'hello', 'Hello HELLO hello', re.I))# → ['Hello', 'HELLO', 'hello']
# Case-insensitive word boundaryprint(re.findall(r'\bpython\b', 'Python PYTHON python', re.IGNORECASE))# → ['Python', 'PYTHON', 'python']3) re.MULTILINE (re.M)
import re
log = """ERROR: disk fullWARNING: low memoryERROR: timeoutINFO: done"""
# Without re.M: ^ only matches start of entire stringprint(re.findall(r'^ERROR.*', log))# → ['ERROR: disk full']
# With re.M: ^ matches start of EACH lineprint(re.findall(r'^ERROR.*', log, re.M))# → ['ERROR: disk full', 'ERROR: timeout']4) re.DOTALL (re.S)
import re
html = "<div>\n <p>Hello</p>\n</div>"
# Without re.S: . does not match \nprint(re.search(r'<div>.*</div>', html)) # None
# With re.S: . matches everything including \nprint(re.search(r'<div>.*</div>', html, re.S)) # matchprint(re.search(r'<div>.*?</div>', html, re.S).group())# → <div>\n <p>Hello</p>\n</div>5) re.VERBOSE (re.X) — Readable complex patterns (可读的复杂模式)
import re
# Without re.X — hard to reademail_pattern_compact = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
# With re.X — add whitespace and comments freelyemail_pattern_verbose = re.compile(r''' ^ # start of string [a-zA-Z0-9._%+-]+ # local part (user name) @ # @ symbol [a-zA-Z0-9.-]+ # domain name \. # literal dot [a-zA-Z]{2,} # top-level domain (2+ letters) $ # end of string''', re.VERBOSE)
print(email_pattern_verbose.match('user@example.com')) # matchprint(email_pattern_verbose.match('bad@')) # None6) Combining flags (组合标志位)
import re
# Combine with | (bitwise OR)text = "Hello\nWorld"print(re.findall(r'^.+$', text, re.M | re.I))# re.M → ^ and $ per line# re.I → case-insensitive# → ['Hello', 'World']
# Inline flags in the pattern (?flags) — scoped to patternprint(re.findall(r'(?i)hello', 'Hello HELLO hello'))# → ['Hello', 'HELLO', 'hello']
# Inline flags for part of patternprint(re.findall(r'(?i:hello) world', 'HELLO world hello World'))# → ['HELLO world'] (only 'hello' is case-insensitive, 'world' is not)III. The re Module API — Complete Reference
re module has two usage modes: ① module-level functions like re.search() (convenient for one-off use) and ② compiled Pattern objects via re.compile() (preferred when the same pattern is used repeatedly — avoids recompilation overhead). 1. re.compile() — Pre-compile a pattern (预编译模式)
import re
# compile() returns a Pattern objectpattern = re.compile(r'\d{4}-\d{2}-\d{2}', re.IGNORECASE)
# Call methods on the Pattern object (same names as module-level functions)print(pattern.search('date: 2024-01-15'))print(pattern.findall('from 2024-01-01 to 2024-12-31'))# → ['2024-01-01', '2024-12-31']
# Pattern attributesprint(pattern.pattern) # → \d{4}-\d{2}-\d{2}print(pattern.flags) # → 34 (2 = default + 32 = IGNORECASE)print(pattern.groups) # → 0 (no capturing groups)re.search(pattern, string) use an internal cache of the last 512 compiled patterns. For hot loops, prefer explicit re.compile() to guarantee no cache misses and to make intent clear.2. re.search() — Find first match anywhere (查找第一个匹配)
Returns a Match object if found anywhere in the string, or None.
import re
text = "The price is $42.99 for 3 items"
m = re.search(r'\$(\d+\.\d{2})', text)if m: print(m.group()) # → $42.99 (full match) print(m.group(1)) # → 42.99 (group 1 — no $) print(m.start()) # → 13 (start index) print(m.end()) # → 19 (end index) print(m.span()) # → (13, 19) (start, end) print(m.string) # → "The price is $42.99 for 3 items" (original)3. re.match() — Match at string start (从字符串开头匹配)
Warning: re.match() only matches at the BEGINNING of the string — NOT the same as re.search()!
import re
# match() — only succeeds if pattern starts at position 0print(re.match(r'\d+', '123 abc')) # match — starts at position 0print(re.match(r'\d+', 'abc 123')) # None — 'abc' is not \d+print(re.search(r'\d+', 'abc 123')) # match — search finds it anywhere
# match() with ^ is redundant (both restrict to start)print(re.match(r'hello', 'hello world')) # matchprint(re.match(r'hello', 'say hello')) # None
# Practical: validate that a string is ENTIRELY a numberdef is_integer(s): return bool(re.match(r'^\d+$', s))
print(is_integer("12345")) # → Trueprint(is_integer("123a5")) # → False4. re.fullmatch() — Match entire string (匹配整个字符串)
Requires the pattern to match the complete string from start to end.
import re
# fullmatch() equivalent to match() with ^ and $ anchorsprint(re.fullmatch(r'\d+', '12345')) # match — entire string is digitsprint(re.fullmatch(r'\d+', '123abc')) # None — not ALL digitsprint(re.fullmatch(r'\d+', ' 123 ')) # None — spaces don't match \d
# Validate formats completelyip_pattern = re.compile(r'(\d{1,3}\.){3}\d{1,3}')zip_pattern = re.compile(r'\d{5}(-\d{4})?')email_pat = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
tests = ['192.168.1.1', '12345', 'user@example.com', 'bad_input']for t in tests: results = { 'ip': bool(ip_pattern.fullmatch(t)), 'zip': bool(zip_pattern.fullmatch(t)), 'email': bool(email_pat.fullmatch(t)), } print(f"{t:<25} → {results}")5. re.findall() — Find all matches (查找所有匹配)
Returns a list of all non-overlapping matches.
import re
text = "2024-01-15, 2023-12-31, 2025-06-01"
# No groups → returns list of stringsprint(re.findall(r'\d{4}-\d{2}-\d{2}', text))# → ['2024-01-15', '2023-12-31', '2025-06-01']
# One group → returns list of group contentsprint(re.findall(r'(\d{4})-\d{2}-\d{2}', text))# → ['2024', '2023', '2025'] (only the year group)
# Multiple groups → returns list of tuplesprint(re.findall(r'(\d{4})-(\d{2})-(\d{2})', text))# → [('2024', '01', '15'), ('2023', '12', '31'), ('2025', '06', '01')]findall() changes based on groups: no groups → List[str], one group → List[str], multiple groups → List[tuple]. This is a common source of bugs. Use finditer() for consistent Match objects.6. re.finditer() — Iterator of Match objects (匹配对象迭代器)
Returns an iterator of Match objects. More powerful than findall() because each Match has .start(), .end(), .group(), etc.
import re
text = "Alice scored 95, Bob scored 87, Carol scored 100"
for m in re.finditer(r'(\w+) scored (\d+)', text): name = m.group(1) score = int(m.group(2)) print(f"{name}: {score} pts | span={m.span()}")# → Alice: 95 pts | span=(0, 16)# → Bob: 87 pts | span=(18, 32)# → Carol: 100 pts | span=(34, 49)
# Collect all spans for highlightingpositions = [(m.start(), m.end()) for m in re.finditer(r'\d+', text)]print(positions) # → [(13, 15), (28, 30), (44, 47)]7. re.sub() — Substitute matches (替换匹配)
1) Basic substitution (基本替换)
import re
text = "Hello World Python"
# Replace multiple spaces with single spaceresult = re.sub(r'\s+', ' ', text)print(result) # → Hello World Python
# count parameter: replace only first N occurrencesresult = re.sub(r'\s+', ' ', text, count=1)print(result) # → Hello World Python (only first replaced)2) Backreferences in replacement (替换中的反向引用)
import re
# \1, \2 refer to captured groups in the replacement string# Reformat date from YYYY-MM-DD to DD/MM/YYYYdates = "Born: 2024-01-15, Died: 2099-12-31"result = re.sub(r'(\d{4})-(\d{2})-(\d{2})', r'\3/\2/\1', dates)print(result) # → Born: 15/01/2024, Died: 31/12/2099
# Wrap all numbers in <b> tagsresult = re.sub(r'(\d+)', r'<b>\1</b>', 'I have 3 cats and 2 dogs')print(result) # → I have <b>3</b> cats and <b>2</b> dogs
# Named group backreference \g<name>result = re.sub( r'(?P<last>\w+), (?P<first>\w+)', r'\g<first> \g<last>', 'Smith, John')print(result) # → John Smith3) Replacement function (替换函数)
Pass a callable as the replacement — it receives the Match object and returns the replacement string.
import re
# Convert all numbers to their doubledef double(m): return str(int(m.group()) * 2)
result = re.sub(r'\d+', double, 'I have 3 cats and 10 dogs')print(result) # → I have 6 cats and 20 dogs
# Normalize different date formats to ISO 8601def normalize_date(m): month_map = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6, 'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12} month = month_map.get(m.group('month_name'), int(m.group('month_num') or 0)) day = int(m.group('day')) year = int(m.group('year')) return f"{year:04d}-{month:02d}-{day:02d}"
pattern = re.compile(r''' (?:(?P<month_name>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \s+(?P<day>\d{1,2}),\s+(?P<year>\d{4})) | (?:(?P<month_num>\d{1,2})/(?P<day2>\d{1,2})/(?P<year2>\d{4}))''', re.VERBOSE)
# Just demonstrate the function approach:text = "Meeting on Jan 15, 2024"result = re.sub( r'(?P<month_name>Jan|Feb|Mar)\s+(?P<day>\d{1,2}),\s+(?P<year>\d{4})', normalize_date, text)print(result) # → Meeting on 2024-01-158. re.subn() — Substitute and count (替换并计数)
Like re.sub() but returns a tuple (new_string, count).
import re
text = "foo bar foo baz foo"result, n = re.subn(r'foo', 'qux', text)print(result) # → qux bar qux baz quxprint(n) # → 3 (number of substitutions made)
# Useful for detecting if any replacements occurredtext2 = "no matches here"_, count = re.subn(r'foo', 'qux', text2)if count == 0: print("No substitutions made")9. re.split() — Split by pattern (按模式分割)
import re
# Split on any non-alphanumeric sequencetext = "one,two;;three four\tfive"print(re.split(r'[^a-zA-Z0-9]+', text))# → ['one', 'two', 'three', 'four', 'five']
# Split on commas with optional surrounding whitespacecsv = "Alice , Bob,Carol , Dave"print(re.split(r'\s*,\s*', csv))# → ['Alice', 'Bob', 'Carol', 'Dave']
# maxsplit: only split N timesprint(re.split(r'\s+', 'a b c d e', maxsplit=2))# → ['a', 'b', 'c d e']
# Capturing group: delimiters are INCLUDED in the resulttext = "one+two-three*four"print(re.split(r'([+\-*])', text))# → ['one', '+', 'two', '-', 'three', '*', 'four'] ← operators kept10. re.escape() — Escape special characters (转义特殊字符)
Escapes all non-alphanumeric characters so a raw string can be used as a literal pattern.
import re
# When user input is used as part of a pattern — MUST escape ituser_input = "hello.world (test)"safe_pattern = re.escape(user_input)print(safe_pattern) # → hello\.world\ \(test\)
# Safe searchtext = "I said: hello.world (test) today"m = re.search(re.escape(user_input), text)print(bool(m)) # → True
# Dangerous without escape:print(re.search(user_input, text)) # . and () have special meaning!
# Common use: build a pattern from a list of keywordskeywords = ['c++', 'c#', '.net', 'node.js']pattern = '|'.join(re.escape(k) for k in keywords)print(pattern) # → c\+\+|c\#|\.net|node\.js
found = re.findall(pattern, 'I know c++ and .net and node.js', re.I)print(found) # → ['c++', '.net', 'node.js']11. Match Object — Complete API (匹配对象完整API)
import re
text = "2024-01-15 is a Monday in New York"m = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', text)
# ── Accessing matched text ─────────────────────────────────print(m.group()) # → 2024-01-15 (full match, same as group(0))print(m.group(0)) # → 2024-01-15print(m.group(1)) # → 2024 (group 1 by index)print(m.group(2, 3)) # → ('01', '15') (multiple groups)print(m.group('year')) # → 2024 (group by name)print(m.groupdict()) # → {'year': '2024', 'month': '01', 'day': '15'}print(m.groups()) # → ('2024', '01', '15') (all groups as tuple)print(m.groups(default='N/A')) # groups() with default for non-participating groups
# ── Position information ───────────────────────────────────print(m.start()) # → 0 (start of full match)print(m.end()) # → 10 (end of full match)print(m.span()) # → (0, 10)print(m.start(1)) # → 0 (start of group 1)print(m.end('month')) # → 7 (end of named group)print(m.span('day')) # → (8, 10)
# ── Context ────────────────────────────────────────────────print(m.string) # → full original stringprint(m.re) # → compiled pattern objectprint(m.pos) # → 0 (start position passed to search)print(m.endpos) # → 34 (end position passed to search)print(m.lastindex) # → 3 (index of last matched group)print(m.lastgroup) # → 'day' (name of last matched group)
# ── Expand — backreferences in a template string ───────────print(m.expand(r'\g<day>/\g<month>/\g<year>'))# → 15/01/2024IV. Practical Patterns — Production-Ready Recipes (生产级常用模式)
1. Validation Patterns (验证模式)
import re
patterns = {
# Email (simplified RFC 5321 compliant) 'email': re.compile(r''' ^[a-zA-Z0-9._%+\-]+ # local part @ [a-zA-Z0-9.\-]+ # domain \.[a-zA-Z]{2,}$ # TLD (2+ chars) ''', re.VERBOSE),
# Phone: +1 (555) 123-4567 / 555-123-4567 / 5551234567 'phone_us': re.compile( r'^(\+1[-.\s]?)?' r'(\(?\d{3}\)?[-.\s]?)' r'\d{3}[-.\s]?\d{4}$' ),
# IPv4 address 'ipv4': re.compile( r'^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}' r'(25[0-5]|2[0-4]\d|[01]?\d\d?)$' ),
# URL (http/https) 'url': re.compile( r'^https?://' r'(([a-zA-Z0-9\-]+\.)+[a-zA-Z]{2,})' r'(:\d+)?' r'(/[^\s]*)?$' ),
# Date: YYYY-MM-DD 'date_iso': re.compile( r'^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$' ),
# Strong password: 8+ chars, upper, lower, digit, special 'strong_password': re.compile( r'^(?=.*[a-z])' # at least one lowercase r'(?=.*[A-Z])' # at least one uppercase r'(?=.*\d)' # at least one digit r'(?=.*[!@#$%^&*])' # at least one special char r'.{8,}$' # at least 8 chars total ),
# Credit card (Visa/MC/Amex, with/without spaces) 'credit_card': re.compile( r'^(?:4\d{12}(?:\d{3})?' # Visa r'|5[1-5]\d{14}' # MasterCard r'|3[47]\d{13})$' # Amex ),
# ZIP code (US) 'zip_us': re.compile(r'^\d{5}(-\d{4})?$'),
# Hex color 'hex_color': re.compile(r'^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$'),
# Semantic version: 1.2.3 or 1.2.3-alpha.1 'semver': re.compile( r'^(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)' r'(-[a-zA-Z0-9.\-]+)?(\+[a-zA-Z0-9.\-]+)?$' ),}
# Test themtests = { 'email': ['user@example.com', 'bad@', 'no-at-sign'], 'ipv4': ['192.168.1.1', '256.0.0.1', '10.0.0'], 'date_iso': ['2024-01-15', '2024-13-01', '24-1-1'], 'strong_password': ['Abc@1234', 'weakpass', 'NoSpecial1'], 'hex_color': ['#FF5733', '#abc', '#GGGGGG'], 'semver': ['1.2.3', '1.0.0-alpha.1', '1.2'],}
for field, values in tests.items(): pat = patterns[field] print(f"\n{field}:") for v in values: ok = '✅' if pat.fullmatch(v) else '❌' print(f" {ok} {v!r}")2. Extraction Patterns (提取模式)
import re
# ── Extract all URLs from text ──────────────────────────────def extract_urls(text): pattern = r'https?://[^\s<>"{}|\\^`\[\]]+' return re.findall(pattern, text)
html = 'Visit <a href="https://example.com/path?q=1">site</a> or http://other.org'print(extract_urls(html))# → ['https://example.com/path?q=1', 'http://other.org']
# ── Extract all emails ──────────────────────────────────────def extract_emails(text): pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}' return re.findall(pattern, text)
text = "Contact alice@example.com or bob.smith@company.co.uk for info"print(extract_emails(text))# → ['alice@example.com', 'bob.smith@company.co.uk']
# ── Parse log lines ─────────────────────────────────────────def parse_log(line): pattern = re.compile(r''' (?P<ip>[\d.]+) \s+ # IP address \S+ \s+ # ident \S+ \s+ # auth user \[(?P<time>[^\]]+)\] \s+ # timestamp "(?P<method>\w+) \s+ (?P<path>[^\s"]+) \s+ \S+" \s+ # HTTP version (?P<status>\d{3}) \s+ # status code (?P<size>\d+) # bytes ''', re.VERBOSE) m = pattern.match(line) return m.groupdict() if m else None
log_line = '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326'print(parse_log(log_line))# → {'ip': '127.0.0.1', 'time': '10/Oct/2000:13:55:36 -0700',# 'method': 'GET', 'path': '/apache_pb.gif', 'status': '200', 'size': '2326'}
# ── Extract numbers with units ──────────────────────────────def extract_measurements(text): pattern = r'(\d+(?:\.\d+)?)\s*(px|em|rem|%|pt|vh|vw)' return [(float(v), u) for v, u in re.findall(pattern, text)]
css = "width: 100px; margin: 1.5em; font-size: 16px; height: 50vh"print(extract_measurements(css))# → [(100.0, 'px'), (1.5, 'em'), (16.0, 'px'), (50.0, 'vh')]3. Cleaning and Normalization Patterns (清理与标准化模式)
import re
# ── Normalize whitespace ────────────────────────────────────def normalize_whitespace(text): return re.sub(r'\s+', ' ', text).strip()
print(normalize_whitespace(" Hello World \n\t Python "))# → Hello World Python
# ── Remove HTML tags ────────────────────────────────────────def strip_html(html): clean = re.sub(r'<[^>]+>', '', html) return re.sub(r'\s+', ' ', clean).strip()
html = "<h1>Title</h1><p>Some <b>bold</b> and <em>italic</em> text.</p>"print(strip_html(html))# → Title Some bold and italic text.
# ── Slugify a string ────────────────────────────────────────def slugify(text): text = text.lower() text = re.sub(r'[^\w\s-]', '', text) # remove non-word chars text = re.sub(r'[\s_]+', '-', text) # spaces/underscores → dash text = re.sub(r'-+', '-', text) # multiple dashes → one return text.strip('-')
print(slugify("Hello, World! This is Python 3.12"))# → hello-world-this-is-python-312
# ── Camel case to snake case ────────────────────────────────def camel_to_snake(name): name = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1_\2', name) # ABCDef → ABC_Def name = re.sub(r'([a-z\d])([A-Z])', r'\1_\2', name) # fooBar → foo_Bar return name.lower()
print(camel_to_snake('camelCaseString')) # → camel_case_stringprint(camel_to_snake('parseHTMLContent')) # → parse_html_contentprint(camel_to_snake('MyClassName')) # → my_class_name
# ── Mask sensitive data ─────────────────────────────────────def mask_credit_card(text): return re.sub(r'\b(\d{4})\d{8}(\d{4})\b', r'\1 **** **** \2', text)
def mask_email(text): return re.sub(r'(\w{2})\w+(@[^\s]+)', r'\1***\2', text)
print(mask_credit_card("Card: 4111111111111111"))# → Card: 4111 **** **** 1111print(mask_email("Email alice@example.com to bob@test.org"))# → Email al***@example.com to bo***@test.org4. Common Pitfalls (常见陷阱)
1) Catastrophic backtracking (灾难性回溯)
import re, time
# ⚠️ DANGEROUS pattern: (a+)+ causes exponential backtrackingevil_pattern = r'^(a+)+$'safe_pattern = r'^a+$'
test_string = 'a' * 25 + 'X' # no match — forces max backtracking
# Safe pattern — fastt = time.time()re.search(safe_pattern, test_string)print(f"Safe: {time.time()-t:.6f}s") # → ~0.000001s
# Evil pattern — hangs for long inputs!# (DO NOT run with 'a' * 30 + 'X')t = time.time()re.search(evil_pattern, 'a' * 20 + 'X')print(f"Evil: {time.time()-t:.6f}s") # → much longer
# FIX: use atomic groups or possessive quantifiers, or restructure# In Python 3.11+: use re.POSSESSIVE or regex module2) re.match() vs re.search() confusion
import re
# COMMON MISTAKE: using match() when search() is neededdata = " 123 some text"
# Incorrect — thinking match() searches anywhereresult = re.match(r'\d+', data) # → None! (leading spaces)
# Correctresult = re.search(r'\d+', data) # → '123'
# Or anchor explicitlyresult = re.match(r'\s*(\d+)', data) # → group(1) = '123'3) findall() group return type surprise
import re
text = "2024-01 2024-02"
# Bug: adding a group changes return typeprint(re.findall(r'\d{4}-\d{2}', text)) # → ['2024-01', '2024-02']print(re.findall(r'(\d{4})-\d{2}', text)) # → ['2024', '2024'] (years only!)print(re.findall(r'(\d{4})-(\d{2})', text)) # → [('2024','01'), ('2024','02')]
# Fix: use non-capturing group when you don't need the group valueprint(re.findall(r'(?:\d{4})-(?:\d{2})', text)) # → ['2024-01', '2024-02']4) Forgetting raw strings
import re
# WRONG: \b interpreted by Python as backspace character (ASCII 8)print(re.findall('\bword\b', 'word in a sentence')) # → [] WRONG
# CORRECT: raw stringprint(re.findall(r'\bword\b', 'word in a sentence')) # → ['word']
# WRONG: \d interpreted as literal 'd' in some contextsprint(re.findall('\d+', 'abc 123')) # may work but is fragile# CORRECT:print(re.findall(r'\d+', 'abc 123')) # → ['123']V. Complete API Quick Reference (完整API速查表)
| Function / Method | Returns | Use when |
|---|---|---|
re.compile(pat, flags) | Pattern | Pattern reused multiple times |
re.search(pat, s) | Match or None | Find first match anywhere |
re.match(pat, s) | Match or None | Match only at position 0 |
re.fullmatch(pat, s) | Match or None | Pattern must cover entire string |
re.findall(pat, s) | List[str or tuple] | All matches as a list |
re.finditer(pat, s) | Iterator[Match] | All matches with position info |
re.sub(pat, repl, s) | str | Replace matches |
re.subn(pat, repl, s) | (str, int) | Replace + count substitutions |
re.split(pat, s) | List[str] | Split string by pattern |
re.escape(s) | str | Treat literal string as pattern |
m.group(n) | str | Get captured group text |
m.groups() | tuple | All groups as tuple |
m.groupdict() | dict | Named groups as dict |
m.start() / m.end() | int | Match position |
m.span() | (int, int) | (start, end) tuple |
m.expand(template) | str | Backreference expansion |
Master regex in four steps: ① know the 5 building blocks (literals,
., classes [], quantifiers *+?{}, anchors ^$\b) → ② use groups () to capture, (?:) to group without capturing → ③ add lookaround (?=)(?!) for context-sensitive matching → ④ always use r'' raw strings, prefer non-greedy .*?, and re.VERBOSE for complex patterns.