I. Python `re` — Regular Expressions#

A Regular Expression (正则表达式) is a pattern (模式) used to find and manipulate text. Think of it as a "super-powered search" that can match patterns, not just exact words. Python's re module gives you tools to use regex.

II. Pattern Syntax — The Complete Reference #

Every regex pattern is built from three kinds of building blocks: Literals (字面量) that match themselves, Metacharacters (元字符) that have special meaning, and Quantifiers (量词) that control repetition. Learn these 30-odd symbols and you can write any pattern.

1. Literals and Metacharacters (字面量与元字符)#

1) Plain literals (普通字面量)#

Most characters match themselves exactly.

1
import re
2

3
# Literal match — 'cat' matches exactly the string "cat"
4
print(re.search(r'cat', 'I have a cat'))        # match
5
print(re.search(r'cat', 'I have a CAT'))        # None  (case-sensitive by default)
6
print(re.search(r'cat', 'concatenate'))         # match (found inside)

2) The 14 metacharacters (14个元字符)#

These characters have special meaning and must be escaped with \ to match literally:

1
. ^ $ * + ? { } [ ] \ | ( )
2
import re
3

4
# Matching a literal dot — must escape it
5
text = "price: $3.99"
6

7
print(re.search(r'3.99',  text))    # matches "3.99" BUT also "3X99" (dot = any char!)
8
print(re.search(r'3\.99', text))    # matches ONLY "3.99"  ← correct
9

10
# Matching a literal backslash
11
print(re.search(r'C:\\Users', r'C:\Users'))   # matches C:\Users

Note: Always use raw strings r'pattern' for regex patterns. Without r, Python processes \n as newline before the regex engine sees it. With r'\n', the regex engine receives the literal two characters \n and interprets them as "newline character".

2. The Dot `.` — Any character (任意字符)#

. matches any single character except a newline \n (unless re.DOTALL flag is set).

1
import re
2

3
# . matches exactly ONE character (any except \n)
4
print(re.findall(r'c.t', 'cat cut c t c\nt coot'))
5
# → ['cat', 'cut', 'c t']   ('c\nt' skipped — \n not matched by dot)
6
# Note: 'coot' not matched — dot matches exactly 1 char
7

8
# With re.DOTALL, dot matches newline too
9
text = "first\nsecond"
10
print(re.search(r'first.second',  text))             # None
11
print(re.search(r'first.second',  text, re.DOTALL))  # match

3. Anchors — Position matchers (锚点 — 位置匹配)#

Anchors match positions, not characters.

1) `^` and `$` — Start and end of string/line#

1
import re
2

3
text = "hello world"
4

5
print(re.search(r'^hello', text))   # match  — 'hello' is at start
6
print(re.search(r'^world', text))   # None   — 'world' is NOT at start
7
print(re.search(r'world$', text))   # match  — 'world' is at end
8
print(re.search(r'hello$', text))   # None
9

10
# With re.MULTILINE: ^ and $ match start/end of EACH LINE
11
multiline = "line1\nline2\nline3"
12
print(re.findall(r'^\w+', multiline, re.MULTILINE))
13
# → ['line1', 'line2', 'line3']
14

15
print(re.findall(r'\w+$', multiline, re.MULTILINE))
16
# → ['line1', 'line2', 'line3']

2) `\b` and `\B` — Word boundaries (单词边界)#

\b matches the boundary between a word character and a non-word character.

1
import re
2

3
# \b matches word boundary — prevents partial matches
4
print(re.findall(r'\bcat\b', 'cat cats concatenate scatter'))
5
# → ['cat']   (only the standalone word)
6

7
print(re.findall(r'cat',     'cat cats concatenate scatter'))
8
# → ['cat', 'cat', 'cat', 'cat']  (too many!)
9

10
# \B matches NON-word boundary (inside a word)
11
print(re.findall(r'\Bcat\B', 'cat cats concatenate'))
12
# → ['cat']   (only the 'cat' inside 'concatenate')

3) `\A`, `\Z` — Absolute start/end of string (字符串绝对首尾)#

1
import re
2

3
# \A and \Z are NOT affected by re.MULTILINE — always match string start/end
4
text = "line1\nline2"
5

6
print(re.search(r'\Aline1', text))   # match — absolute start
7
print(re.search(r'\Aline2', text))   # None  — line2 is NOT at absolute start
8
print(re.search(r'line2\Z', text))   # match — absolute end

4. Character Classes `[ ]` (字符类)#

1) Basic character class (基本字符类)#

A character class matches one character that is any of the listed characters.

1
import re
2

3
# [aeiou] matches any single vowel
4
print(re.findall(r'[aeiou]', 'hello world'))
5
# → ['e', 'o', 'o']
6

7
# [a-z] matches any lowercase letter (range syntax)
8
print(re.findall(r'[a-z]+', 'Hello World 123'))
9
# → ['ello', 'orld']
10

11
# [A-Za-z0-9] matches any alphanumeric
12
print(re.findall(r'[A-Za-z0-9]+', 'foo_bar-123!'))
13
# → ['foo', 'bar', '123']
14

15
# [0-9] is equivalent to \d
16
print(re.findall(r'[0-9]+', 'abc 123 def 456'))
17
# → ['123', '456']

2) Negated character class `[^ ]` (否定字符类)#

[^...] matches any character NOT in the class.

1
import re
2

3
# [^aeiou] matches any consonant (non-vowel)
4
print(re.findall(r'[^aeiou\s]+', 'hello world'))
5
# → ['h', 'll', 'w', 'rld']
6

7
# [^0-9] matches any non-digit character
8
print(re.findall(r'[^0-9]+', 'abc123def456'))
9
# → ['abc', 'def']
10

11
# Strip all non-alphanumeric characters
12
cleaned = re.sub(r'[^A-Za-z0-9]', '', 'Hello, World! 123')
13
print(cleaned)   # → HelloWorld123

3) Special sequences inside `[ ]`#

1
import re
2

3
# Inside [], most metacharacters lose special meaning
4
# - (dash) is literal if first, last, or escaped
5
print(re.findall(r'[-+*/]', '3+4-2*1/5'))  # → ['+', '-', '*', '/']
6

7
# ^ is literal unless it is the FIRST character
8
print(re.findall(r'[a^b]', 'a^b c'))       # → ['a', '^', 'b']  (literal ^)
9

10
# ] must be escaped or placed first
11
print(re.findall(r'[]a-z]', 'a]b'))        # → ['a', ']', 'b']

5. Predefined Character Classes (预定义字符类)#

These are shorthand for common character sets:

Shorthand	Equivalent	Meaning
`\d`	`[0-9]`	Any digit (数字)
`\D`	`[^0-9]`	Any non-digit (非数字)
`\w`	`[A-Za-z0-9_]`	Word character (单词字符)
`\W`	`[^A-Za-z0-9_]`	Non-word character (非单词字符)
`\s`	`[ \t\n\r\f\v]`	Whitespace (空白字符)
`\S`	`[^ \t\n\r\f\v]`	Non-whitespace (非空白字符)

1
import re
2

3
text = "Hello, World! 42 items at $3.99 each.\n"
4

5
print(re.findall(r'\d+',  text))  # → ['42', '3', '99']
6
print(re.findall(r'\w+',  text))  # → ['Hello', 'World', '42', 'items', 'at', '3', '99', 'each']
7
print(re.findall(r'\s+',  text))  # → [' ', ' ', ' ', ' ', ' ', '\n']
8
print(re.findall(r'\W+',  text))  # → [', ', '! ', ' ', ' $', '.', '\n']
9

10
# Combining: \w+ matches whole words
11
print(re.findall(r'\b\w{5}\b', text))  # words of exactly 5 chars
12
# → ['Hello', 'World', 'items']

6. Quantifiers — Repetition (量词 — 重复)#

1) Basic quantifiers (基本量词)#

Quantifier	Meaning
`*`	0 or more (零次或多次)
`+`	1 or more (一次或多次)
`?`	0 or 1 (零次或一次，可选)
`{n}`	Exactly n times (恰好n次)
`{n,}`	n or more times (n次或更多)
`{n,m}`	Between n and m times (n到m次)

1
import re
2

3
s = "colour   color   colouur"
4

5
print(re.findall(r'colou?r',    s))   # ? → u is optional
6
# → ['colour', 'color']
7

8
print(re.findall(r'colou*r',    s))   # * → 0 or more u's
9
# → ['colour', 'color', 'colouur']
10

11
print(re.findall(r'colou+r',    s))   # + → 1 or more u's
12
# → ['colour', 'colouur']
13

14
print(re.findall(r'colou{2}r',  s))   # exactly 2 u's
15
# → ['colouur']
16

17
print(re.findall(r'colou{1,2}r',s))   # 1 or 2 u's
18
# → ['colour', 'colouur']
19

20
# Phone number: exactly 10 digits
21
print(re.findall(r'\d{10}', '1234567890 12345'))
22
# → ['1234567890']

2) Greedy vs Non-greedy (贪婪 vs 非贪婪)#

By default, quantifiers are greedy (贪婪) — they match as much as possible. Adding ? makes them non-greedy (非贪婪/懒惰) — they match as little as possible.

1
import re
2

3
html = "<b>bold</b> and <i>italic</i>"
4

5
# Greedy: .* expands as far right as possible
6
print(re.findall(r'<.*>',  html))
7
# → ['<b>bold</b> and <i>italic</i>']   ← one huge match (too greedy)
8

9
# Non-greedy: .*? stops at the FIRST >
10
print(re.findall(r'<.*?>',  html))
11
# → ['<b>', '</b>', '<i>', '</i>']        ← each tag separately
12

13
# Extracting content between tags
14
print(re.findall(r'<b>(.*?)</b>', html))
15
# → ['bold']
16

17
# More examples
18
text = '"first" and "second"'
19
print(re.findall(r'".*"',  text))   # → ['"first" and "second"']  greedy
20
print(re.findall(r'".*?"', text))   # → ['"first"', '"second"']   non-greedy

Pattern	Type	Matches
`.*`	Greedy	As many chars as possible
`.*?`	Non-greedy	As few chars as possible
`.+`	Greedy	1+ chars, maximum
`.+?`	Non-greedy	1+ chars, minimum

7. Groups — Capturing and Non-capturing (分组 — 捕获与非捕获)#

1) Capturing group `( )` (捕获组)#

Groups serve two purposes: grouping for quantifiers, and capturing the matched text.

1
import re
2

3
# Grouping: (ab)+ repeats the whole "ab"
4
print(re.findall(r'(ab)+', 'ab abab ababab'))
5
# → ['ab', 'ab', 'ab']  (returns last captured group)
6

7
# Capturing: extract the content inside ()
8
dates = "2024-01-15, 2023-12-31, 2025-06-01"
9
print(re.findall(r'(\d{4})-(\d{2})-(\d{2})', dates))
10
# → [('2024', '01', '15'), ('2023', '12', '31'), ('2025', '06', '01')]
11
#   ↑ each match returns a tuple of all captured groups
12

13
# .group() on a Match object
14
m = re.search(r'(\d{4})-(\d{2})-(\d{2})', '2024-01-15')
15
print(m.group(0))   # → 2024-01-15  (entire match)
16
print(m.group(1))   # → 2024        (group 1)
17
print(m.group(2))   # → 01          (group 2)
18
print(m.group(3))   # → 15          (group 3)

2) Named group `(?P<name>...)` (命名捕获组)#

1
import re
2

3
# Named groups — access by name instead of index
4
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
5
m = re.search(pattern, '2024-01-15')
6

7
print(m.group('year'))    # → 2024
8
print(m.group('month'))   # → 01
9
print(m.group('day'))     # → 15
10
print(m.groupdict())      # → {'year': '2024', 'month': '01', 'day': '15'}
11

12
# Named groups in re.sub — backreference by name
13
result = re.sub(
14
    r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
15
    r'\g<day>/\g<month>/\g<year>',    # reorder: DD/MM/YYYY
16
    '2024-01-15'
17
)
18
print(result)   # → 15/01/2024

3) Non-capturing group `(?:...)` (非捕获组)#

When you need grouping for quantifiers but don’t want the group in your results:

1
import re
2

3
# Without (?:...) — capturing group pollutes findall results
4
print(re.findall(r'(\d+)(?:px|em|rem)', '12px 3em 100rem'))
5
# → ['12', '3', '100']   ← only numbers, units NOT captured ✅
6

7
# With capturing group — units would also appear
8
print(re.findall(r'(\d+)(px|em|rem)', '12px 3em 100rem'))
9
# → [('12', 'px'), ('3', 'em'), ('100', 'rem')]  ← units captured too
10

11
# (?:...) for grouping quantifiers
12
print(re.findall(r'(?:ha)+', 'hahaha haha ha h'))
13
# → ['hahaha', 'haha', 'ha']   (group 'ha' as a unit for +)

4) Backreferences `\1` `\2` (反向引用)#

Refer to a previously captured group within the same pattern.

1
import re
2

3
# Find repeated words
4
text = "the the quick brown fox fox jumps"
5
print(re.findall(r'\b(\w+)\s+\1\b', text))
6
# → ['the', 'fox']   (\1 refers back to group 1)
7

8
# Find doubled characters
9
print(re.findall(r'(.)\1', 'aabcddee'))
10
# → ['a', 'd', 'e']
11

12
# HTML tag matching: opening and closing tags must match
13
html = "<h1>Title</h1> <h2>Subtitle</h2>"
14
print(re.findall(r'<(\w+)>(.*?)</\1>', html))
15
# → [('h1', 'Title'), ('h2', 'Subtitle')]
16
#   \1 ensures the closing tag matches the opening tag

8. Lookahead and Lookbehind — Zero-width assertions (零宽断言)#

Lookaround (环视断言) matches a position based on what is around it, without consuming characters (不消耗字符). They are "zero-width" — the match position doesn't advance.

1) Positive lookahead `(?=...)` (正向先行断言)#

“Match X only if followed by Y” — Y is NOT included in the match.

1
import re
2

3
# Match a number only if followed by "px"
4
print(re.findall(r'\d+(?=px)', '12px 3em 100px 5rem'))
5
# → ['12', '100']   (px NOT included in results)
6

7
# Match word only if followed by a colon
8
text = "name: Alice age: 30 city: NYC"
9
print(re.findall(r'\w+(?=:)', text))
10
# → ['name', 'age', 'city']
11

12
# Password validation: must contain a digit
13
import re
14
def has_digit(pw): return bool(re.search(r'(?=.*\d)', pw))
15
print(has_digit("abc123"))   # → True
16
print(has_digit("abcdef"))   # → False

2) Negative lookahead `(?!...)` (负向先行断言)#

“Match X only if NOT followed by Y”

1
import re
2

3
# Match a number only if NOT followed by "px"
4
print(re.findall(r'\d+(?!px)\b', '12px 3em 100px 5rem'))
5
# → ['3', '5']
6

7
# Match 'foo' not followed by 'bar'
8
print(re.findall(r'foo(?!bar)', 'foobar foobaz foo'))
9
# → ['foo', 'foo']   ('foobar' excluded, 'foobaz' and 'foo' included)

3) Positive lookbehind `(?<=...)` (正向后行断言)#

“Match X only if preceded by Y” — Y is NOT included in the match.

1
import re
2

3
# Match digits only if preceded by '$'
4
prices = "items: $10, €20, £30, $50"
5
print(re.findall(r'(?<=\$)\d+', prices))
6
# → ['10', '50']   ($ NOT included in results)
7

8
# Match word after a colon and space
9
text = "name: Alice, city: NYC, age: 30"
10
print(re.findall(r'(?<=: )\w+', text))
11
# → ['Alice', 'NYC', '30']

4) Negative lookbehind `(?<!...)` (负向后行断言)#

“Match X only if NOT preceded by Y”

1
import re
2

3
# Match digits NOT preceded by '$'
4
prices = "items: $10, 20, $50, 100"
5
print(re.findall(r'(?<!\$)\b\d+\b', prices))
6
# → ['20', '100']
7

8
# Match 'ing' not preceded by 'run'
9
words = "running swimming singing"
10
print(re.findall(r'(?<!run)ning\b', words))
11
# → ['ning', 'ning']   (swim→ming yes, sing→ning yes, runNING excluded)

5) Lookaround summary table (环视断言总结)#

Syntax	Name	Meaning
`(?=Y)`	Positive lookahead (正向先行)	Followed by Y
`(?!Y)`	Negative lookahead (负向先行)	NOT followed by Y
`(?<=Y)`	Positive lookbehind (正向后行)	Preceded by Y
`(?`	Negative lookbehind (负向后行)	NOT preceded by Y

9. Alternation `|` — OR operator (或运算符)#

1
import re
2

3
# | matches either the left or right pattern
4
print(re.findall(r'cat|dog|fish', 'I have a cat and a dog'))
5
# → ['cat', 'dog']
6

7
# With groups: (cat|dog) scopes the alternation
8
print(re.findall(r'(cat|dog)s?', 'cats dogs cat dog'))
9
# → ['cat', 'dog', 'cat', 'dog']
10

11
# Alternation of longer patterns
12
log = "ERROR: disk full  WARNING: low memory  INFO: started"
13
print(re.findall(r'ERROR|WARNING|INFO', log))
14
# → ['ERROR', 'WARNING', 'INFO']
15

16
# Order matters: first match wins
17
print(re.search(r'cat|catch', 'I catch cats'))   # matches 'cat' (not 'catch'!)
18
print(re.search(r'catch|cat', 'I catch cats'))   # matches 'catch' ← correct order

10. Flags — Modifying match behavior (标志位)#

1) All flags (所有标志位)#

Flag (short)	Flag (long)	Effect
`re.I`	`re.IGNORECASE`	Case-insensitive matching (忽略大小写)
`re.M`	`re.MULTILINE`	`^`/`$` match each line (多行模式)
`re.S`	`re.DOTALL`	`.` matches `\n` too (点号匹配换行)
`re.X`	`re.VERBOSE`	Allow whitespace/comments in pattern (详细模式)
`re.A`	`re.ASCII`	`\w \d \s` match ASCII only (ASCII模式)
`re.L`	`re.LOCALE`	Locale-dependent matching (本地化模式)

2) `re.IGNORECASE` (re.I)#

1
import re
2

3
print(re.findall(r'hello', 'Hello HELLO hello', re.I))
4
# → ['Hello', 'HELLO', 'hello']
5

6
# Case-insensitive word boundary
7
print(re.findall(r'\bpython\b', 'Python PYTHON python', re.IGNORECASE))
8
# → ['Python', 'PYTHON', 'python']

3) `re.MULTILINE` (re.M)#

1
import re
2

3
log = """ERROR: disk full
4
WARNING: low memory
5
ERROR: timeout
6
INFO: done"""
7

8
# Without re.M: ^ only matches start of entire string
9
print(re.findall(r'^ERROR.*',  log))
10
# → ['ERROR: disk full']
11

12
# With re.M: ^ matches start of EACH line
13
print(re.findall(r'^ERROR.*',  log, re.M))
14
# → ['ERROR: disk full', 'ERROR: timeout']

4) `re.DOTALL` (re.S)#

1
import re
2

3
html = "<div>\n  <p>Hello</p>\n</div>"
4

5
# Without re.S: . does not match \n
6
print(re.search(r'<div>.*</div>',  html))          # None
7

8
# With re.S: . matches everything including \n
9
print(re.search(r'<div>.*</div>',  html, re.S))    # match
10
print(re.search(r'<div>.*?</div>', html, re.S).group())
11
# → <div>\n  <p>Hello</p>\n</div>

5) `re.VERBOSE` (re.X) — Readable complex patterns (可读的复杂模式)#

1
import re
2

3
# Without re.X — hard to read
4
email_pattern_compact = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
5

6
# With re.X — add whitespace and comments freely
7
email_pattern_verbose = re.compile(r'''
8
    ^                       # start of string
9
    [a-zA-Z0-9._%+-]+       # local part (user name)
10
    @                       # @ symbol
11
    [a-zA-Z0-9.-]+          # domain name
12
    \.                      # literal dot
13
    [a-zA-Z]{2,}            # top-level domain (2+ letters)
14
    $                       # end of string
15
''', re.VERBOSE)
16

17
print(email_pattern_verbose.match('user@example.com'))   # match
18
print(email_pattern_verbose.match('bad@'))               # None

6) Combining flags (组合标志位)#

1
import re
2

3
# Combine with | (bitwise OR)
4
text = "Hello\nWorld"
5
print(re.findall(r'^.+$', text, re.M | re.I))
6
# re.M → ^ and $ per line
7
# re.I → case-insensitive
8
# → ['Hello', 'World']
9

10
# Inline flags in the pattern (?flags) — scoped to pattern
11
print(re.findall(r'(?i)hello', 'Hello HELLO hello'))
12
# → ['Hello', 'HELLO', 'hello']
13

14
# Inline flags for part of pattern
15
print(re.findall(r'(?i:hello) world', 'HELLO world hello World'))
16
# → ['HELLO world']   (only 'hello' is case-insensitive, 'world' is not)

III. The `re` Module API — Complete Reference#

The re module has two usage modes: ① module-level functions like re.search() (convenient for one-off use) and ② compiled Pattern objects via re.compile() (preferred when the same pattern is used repeatedly — avoids recompilation overhead).

1. `re.compile()` — Pre-compile a pattern (预编译模式)#

1
import re
2

3
# compile() returns a Pattern object
4
pattern = re.compile(r'\d{4}-\d{2}-\d{2}', re.IGNORECASE)
5

6
# Call methods on the Pattern object (same names as module-level functions)
7
print(pattern.search('date: 2024-01-15'))
8
print(pattern.findall('from 2024-01-01 to 2024-12-31'))
9
# → ['2024-01-01', '2024-12-31']
10

11
# Pattern attributes
12
print(pattern.pattern)    # → \d{4}-\d{2}-\d{2}
13
print(pattern.flags)      # → 34  (2 = default + 32 = IGNORECASE)
14
print(pattern.groups)     # → 0   (no capturing groups)

Note: Module-level functions like re.search(pattern, string) use an internal cache of the last 512 compiled patterns. For hot loops, prefer explicit re.compile() to guarantee no cache misses and to make intent clear.

2. `re.search()` — Find first match anywhere (查找第一个匹配)#

Returns a Match object if found anywhere in the string, or None.

1
import re
2

3
text = "The price is $42.99 for 3 items"
4

5
m = re.search(r'\$(\d+\.\d{2})', text)
6
if m:
7
    print(m.group())    # → $42.99   (full match)
8
    print(m.group(1))   # → 42.99    (group 1 — no $)
9
    print(m.start())    # → 13       (start index)
10
    print(m.end())      # → 19       (end index)
11
    print(m.span())     # → (13, 19) (start, end)
12
    print(m.string)     # → "The price is $42.99 for 3 items"  (original)

3. `re.match()` — Match at string start (从字符串开头匹配)#

Warning: re.match() only matches at the BEGINNING of the string — NOT the same as re.search()!

1
import re
2

3
# match() — only succeeds if pattern starts at position 0
4
print(re.match(r'\d+', '123 abc'))    # match   — starts at position 0
5
print(re.match(r'\d+', 'abc 123'))    # None    — 'abc' is not \d+
6
print(re.search(r'\d+', 'abc 123'))   # match   — search finds it anywhere
7

8
# match() with ^ is redundant (both restrict to start)
9
print(re.match(r'hello', 'hello world'))    # match
10
print(re.match(r'hello', 'say hello'))      # None
11

12
# Practical: validate that a string is ENTIRELY a number
13
def is_integer(s):
14
    return bool(re.match(r'^\d+$', s))
15

16
print(is_integer("12345"))    # → True
17
print(is_integer("123a5"))    # → False

4. `re.fullmatch()` — Match entire string (匹配整个字符串)#

Requires the pattern to match the complete string from start to end.

1
import re
2

3
# fullmatch() equivalent to match() with ^ and $ anchors
4
print(re.fullmatch(r'\d+', '12345'))     # match   — entire string is digits
5
print(re.fullmatch(r'\d+', '123abc'))    # None    — not ALL digits
6
print(re.fullmatch(r'\d+', '  123  '))   # None    — spaces don't match \d
7

8
# Validate formats completely
9
ip_pattern  = re.compile(r'(\d{1,3}\.){3}\d{1,3}')
10
zip_pattern = re.compile(r'\d{5}(-\d{4})?')
11
email_pat   = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
12

13
tests = ['192.168.1.1', '12345', 'user@example.com', 'bad_input']
14
for t in tests:
15
    results = {
16
        'ip':    bool(ip_pattern.fullmatch(t)),
17
        'zip':   bool(zip_pattern.fullmatch(t)),
18
        'email': bool(email_pat.fullmatch(t)),
19
    }
20
    print(f"{t:<25} → {results}")

5. `re.findall()` — Find all matches (查找所有匹配)#

Returns a list of all non-overlapping matches.

1
import re
2

3
text = "2024-01-15, 2023-12-31, 2025-06-01"
4

5
# No groups → returns list of strings
6
print(re.findall(r'\d{4}-\d{2}-\d{2}', text))
7
# → ['2024-01-15', '2023-12-31', '2025-06-01']
8

9
# One group → returns list of group contents
10
print(re.findall(r'(\d{4})-\d{2}-\d{2}', text))
11
# → ['2024', '2023', '2025']   (only the year group)
12

13
# Multiple groups → returns list of tuples
14
print(re.findall(r'(\d{4})-(\d{2})-(\d{2})', text))
15
# → [('2024', '01', '15'), ('2023', '12', '31'), ('2025', '06', '01')]

Note: The return type of findall() changes based on groups: no groups → List[str], one group → List[str], multiple groups → List[tuple]. This is a common source of bugs. Use finditer() for consistent Match objects.

6. `re.finditer()` — Iterator of Match objects (匹配对象迭代器)#

Returns an iterator of Match objects. More powerful than findall() because each Match has .start(), .end(), .group(), etc.

1
import re
2

3
text = "Alice scored 95, Bob scored 87, Carol scored 100"
4

5
for m in re.finditer(r'(\w+) scored (\d+)', text):
6
    name  = m.group(1)
7
    score = int(m.group(2))
8
    print(f"{name}: {score} pts  | span={m.span()}")
9
# → Alice: 95 pts  | span=(0, 16)
10
# → Bob: 87 pts    | span=(18, 32)
11
# → Carol: 100 pts | span=(34, 49)
12

13
# Collect all spans for highlighting
14
positions = [(m.start(), m.end()) for m in re.finditer(r'\d+', text)]
15
print(positions)   # → [(13, 15), (28, 30), (44, 47)]

7. `re.sub()` — Substitute matches (替换匹配)#

1) Basic substitution (基本替换)#

1
import re
2

3
text = "Hello   World   Python"
4

5
# Replace multiple spaces with single space
6
result = re.sub(r'\s+', ' ', text)
7
print(result)   # → Hello World Python
8

9
# count parameter: replace only first N occurrences
10
result = re.sub(r'\s+', ' ', text, count=1)
11
print(result)   # → Hello World   Python  (only first replaced)

2) Backreferences in replacement (替换中的反向引用)#

1
import re
2

3
# \1, \2 refer to captured groups in the replacement string
4
# Reformat date from YYYY-MM-DD to DD/MM/YYYY
5
dates = "Born: 2024-01-15, Died: 2099-12-31"
6
result = re.sub(r'(\d{4})-(\d{2})-(\d{2})', r'\3/\2/\1', dates)
7
print(result)   # → Born: 15/01/2024, Died: 31/12/2099
8

9
# Wrap all numbers in <b> tags
10
result = re.sub(r'(\d+)', r'<b>\1</b>', 'I have 3 cats and 2 dogs')
11
print(result)   # → I have <b>3</b> cats and <b>2</b> dogs
12

13
# Named group backreference \g<name>
14
result = re.sub(
15
    r'(?P<last>\w+), (?P<first>\w+)',
16
    r'\g<first> \g<last>',
17
    'Smith, John'
18
)
19
print(result)   # → John Smith

3) Replacement function (替换函数)#

Pass a callable as the replacement — it receives the Match object and returns the replacement string.

1
import re
2

3
# Convert all numbers to their double
4
def double(m):
5
    return str(int(m.group()) * 2)
6

7
result = re.sub(r'\d+', double, 'I have 3 cats and 10 dogs')
8
print(result)   # → I have 6 cats and 20 dogs
9

10
# Normalize different date formats to ISO 8601
11
def normalize_date(m):
12
    month_map = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,
13
                 'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}
14
    month = month_map.get(m.group('month_name'),
15
                          int(m.group('month_num') or 0))
16
    day   = int(m.group('day'))
17
    year  = int(m.group('year'))
18
    return f"{year:04d}-{month:02d}-{day:02d}"
19

20
pattern = re.compile(r'''
21
    (?:(?P<month_name>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
22
       \s+(?P<day>\d{1,2}),\s+(?P<year>\d{4}))
23
    |
24
    (?:(?P<month_num>\d{1,2})/(?P<day2>\d{1,2})/(?P<year2>\d{4}))
25
''', re.VERBOSE)
26

27
# Just demonstrate the function approach:
28
text = "Meeting on Jan 15, 2024"
29
result = re.sub(
30
    r'(?P<month_name>Jan|Feb|Mar)\s+(?P<day>\d{1,2}),\s+(?P<year>\d{4})',
31
    normalize_date,
32
    text
33
)
34
print(result)   # → Meeting on 2024-01-15

8. `re.subn()` — Substitute and count (替换并计数)#

Like re.sub() but returns a tuple (new_string, count).

1
import re
2

3
text = "foo bar foo baz foo"
4
result, n = re.subn(r'foo', 'qux', text)
5
print(result)   # → qux bar qux baz qux
6
print(n)        # → 3  (number of substitutions made)
7

8
# Useful for detecting if any replacements occurred
9
text2 = "no matches here"
10
_, count = re.subn(r'foo', 'qux', text2)
11
if count == 0:
12
    print("No substitutions made")

9. `re.split()` — Split by pattern (按模式分割)#

1
import re
2

3
# Split on any non-alphanumeric sequence
4
text = "one,two;;three   four\tfive"
5
print(re.split(r'[^a-zA-Z0-9]+', text))
6
# → ['one', 'two', 'three', 'four', 'five']
7

8
# Split on commas with optional surrounding whitespace
9
csv = "Alice , Bob,Carol ,  Dave"
10
print(re.split(r'\s*,\s*', csv))
11
# → ['Alice', 'Bob', 'Carol', 'Dave']
12

13
# maxsplit: only split N times
14
print(re.split(r'\s+', 'a b c d e', maxsplit=2))
15
# → ['a', 'b', 'c d e']
16

17
# Capturing group: delimiters are INCLUDED in the result
18
text = "one+two-three*four"
19
print(re.split(r'([+\-*])', text))
20
# → ['one', '+', 'two', '-', 'three', '*', 'four']  ← operators kept

10. `re.escape()` — Escape special characters (转义特殊字符)#

Escapes all non-alphanumeric characters so a raw string can be used as a literal pattern.

1
import re
2

3
# When user input is used as part of a pattern — MUST escape it
4
user_input = "hello.world (test)"
5
safe_pattern = re.escape(user_input)
6
print(safe_pattern)   # → hello\.world\ \(test\)
7

8
# Safe search
9
text = "I said: hello.world (test) today"
10
m = re.search(re.escape(user_input), text)
11
print(bool(m))   # → True
12

13
# Dangerous without escape:
14
print(re.search(user_input, text))   # . and () have special meaning!
15

16
# Common use: build a pattern from a list of keywords
17
keywords = ['c++', 'c#', '.net', 'node.js']
18
pattern  = '|'.join(re.escape(k) for k in keywords)
19
print(pattern)   # → c\+\+|c\#|\.net|node\.js
20

21
found = re.findall(pattern, 'I know c++ and .net and node.js', re.I)
22
print(found)   # → ['c++', '.net', 'node.js']

11. Match Object — Complete API (匹配对象完整API)#

1
import re
2

3
text = "2024-01-15 is a Monday in New York"
4
m    = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', text)
5

6
# ── Accessing matched text ─────────────────────────────────
7
print(m.group())           # → 2024-01-15  (full match, same as group(0))
8
print(m.group(0))          # → 2024-01-15
9
print(m.group(1))          # → 2024         (group 1 by index)
10
print(m.group(2, 3))       # → ('01', '15') (multiple groups)
11
print(m.group('year'))     # → 2024         (group by name)
12
print(m.groupdict())       # → {'year': '2024', 'month': '01', 'day': '15'}
13
print(m.groups())          # → ('2024', '01', '15')  (all groups as tuple)
14
print(m.groups(default='N/A'))  # groups() with default for non-participating groups
15

16
# ── Position information ───────────────────────────────────
17
print(m.start())           # → 0    (start of full match)
18
print(m.end())             # → 10   (end of full match)
19
print(m.span())            # → (0, 10)
20
print(m.start(1))          # → 0    (start of group 1)
21
print(m.end('month'))      # → 7    (end of named group)
22
print(m.span('day'))       # → (8, 10)
23

24
# ── Context ────────────────────────────────────────────────
25
print(m.string)            # → full original string
26
print(m.re)                # → compiled pattern object
27
print(m.pos)               # → 0    (start position passed to search)
28
print(m.endpos)            # → 34   (end position passed to search)
29
print(m.lastindex)         # → 3    (index of last matched group)
30
print(m.lastgroup)         # → 'day' (name of last matched group)
31

32
# ── Expand — backreferences in a template string ───────────
33
print(m.expand(r'\g<day>/\g<month>/\g<year>'))
34
# → 15/01/2024

IV. Practical Patterns — Production-Ready Recipes (生产级常用模式)#

This section provides ready-to-use, battle-tested patterns for the most common real-world tasks. Each pattern is annotated and tested.

1. Validation Patterns (验证模式)#

1
import re
2

3
patterns = {
4

5
    # Email (simplified RFC 5321 compliant)
6
    'email': re.compile(r'''
7
        ^[a-zA-Z0-9._%+\-]+     # local part
8
        @
9
        [a-zA-Z0-9.\-]+          # domain
10
        \.[a-zA-Z]{2,}$          # TLD (2+ chars)
11
    ''', re.VERBOSE),
12

13
    # Phone: +1 (555) 123-4567 / 555-123-4567 / 5551234567
14
    'phone_us': re.compile(
15
        r'^(\+1[-.\s]?)?'
16
        r'(\(?\d{3}\)?[-.\s]?)'
17
        r'\d{3}[-.\s]?\d{4}$'
18
    ),
19

20
    # IPv4 address
21
    'ipv4': re.compile(
22
        r'^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}'
23
        r'(25[0-5]|2[0-4]\d|[01]?\d\d?)$'
24
    ),
25

26
    # URL (http/https)
27
    'url': re.compile(
28
        r'^https?://'
29
        r'(([a-zA-Z0-9\-]+\.)+[a-zA-Z]{2,})'
30
        r'(:\d+)?'
31
        r'(/[^\s]*)?$'
32
    ),
33

34
    # Date: YYYY-MM-DD
35
    'date_iso': re.compile(
36
        r'^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$'
37
    ),
38

39
    # Strong password: 8+ chars, upper, lower, digit, special
40
    'strong_password': re.compile(
41
        r'^(?=.*[a-z])'          # at least one lowercase
42
        r'(?=.*[A-Z])'           # at least one uppercase
43
        r'(?=.*\d)'              # at least one digit
44
        r'(?=.*[!@#$%^&*])'     # at least one special char
45
        r'.{8,}$'                # at least 8 chars total
46
    ),
47

48
    # Credit card (Visa/MC/Amex, with/without spaces)
49
    'credit_card': re.compile(
50
        r'^(?:4\d{12}(?:\d{3})?'     # Visa
51
        r'|5[1-5]\d{14}'             # MasterCard
52
        r'|3[47]\d{13})$'            # Amex
53
    ),
54

55
    # ZIP code (US)
56
    'zip_us': re.compile(r'^\d{5}(-\d{4})?$'),
57

58
    # Hex color
59
    'hex_color': re.compile(r'^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$'),
60

61
    # Semantic version: 1.2.3 or 1.2.3-alpha.1
62
    'semver': re.compile(
63
        r'^(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)'
64
        r'(-[a-zA-Z0-9.\-]+)?(\+[a-zA-Z0-9.\-]+)?$'
65
    ),
66
}
67

68
# Test them
69
tests = {
70
    'email':           ['user@example.com', 'bad@', 'no-at-sign'],
71
    'ipv4':            ['192.168.1.1', '256.0.0.1', '10.0.0'],
72
    'date_iso':        ['2024-01-15', '2024-13-01', '24-1-1'],
73
    'strong_password': ['Abc@1234', 'weakpass', 'NoSpecial1'],
74
    'hex_color':       ['#FF5733', '#abc', '#GGGGGG'],
75
    'semver':          ['1.2.3', '1.0.0-alpha.1', '1.2'],
76
}
77

78
for field, values in tests.items():
79
    pat = patterns[field]
80
    print(f"\n{field}:")
81
    for v in values:
82
        ok = '✅' if pat.fullmatch(v) else '❌'
83
        print(f"  {ok} {v!r}")

2. Extraction Patterns (提取模式)#

1
import re
2

3
# ── Extract all URLs from text ──────────────────────────────
4
def extract_urls(text):
5
    pattern = r'https?://[^\s<>"{}|\\^`\[\]]+'
6
    return re.findall(pattern, text)
7

8
html = 'Visit <a href="https://example.com/path?q=1">site</a> or http://other.org'
9
print(extract_urls(html))
10
# → ['https://example.com/path?q=1', 'http://other.org']
11

12

13
# ── Extract all emails ──────────────────────────────────────
14
def extract_emails(text):
15
    pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
16
    return re.findall(pattern, text)
17

18
text = "Contact alice@example.com or bob.smith@company.co.uk for info"
19
print(extract_emails(text))
20
# → ['alice@example.com', 'bob.smith@company.co.uk']
21

22

23
# ── Parse log lines ─────────────────────────────────────────
24
def parse_log(line):
25
    pattern = re.compile(r'''
26
        (?P<ip>[\d.]+)          \s+   # IP address
27
        \S+                     \s+   # ident
28
        \S+                     \s+   # auth user
29
        \[(?P<time>[^\]]+)\]    \s+   # timestamp
30
        "(?P<method>\w+)        \s+
31
         (?P<path>[^\s"]+)      \s+
32
         \S+"                   \s+   # HTTP version
33
        (?P<status>\d{3})       \s+   # status code
34
        (?P<size>\d+)                 # bytes
35
    ''', re.VERBOSE)
36
    m = pattern.match(line)
37
    return m.groupdict() if m else None
38

39
log_line = '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326'
40
print(parse_log(log_line))
41
# → {'ip': '127.0.0.1', 'time': '10/Oct/2000:13:55:36 -0700',
42
#    'method': 'GET', 'path': '/apache_pb.gif', 'status': '200', 'size': '2326'}
43

44

45
# ── Extract numbers with units ──────────────────────────────
46
def extract_measurements(text):
47
    pattern = r'(\d+(?:\.\d+)?)\s*(px|em|rem|%|pt|vh|vw)'
48
    return [(float(v), u) for v, u in re.findall(pattern, text)]
49

50
css = "width: 100px; margin: 1.5em; font-size: 16px; height: 50vh"
51
print(extract_measurements(css))
52
# → [(100.0, 'px'), (1.5, 'em'), (16.0, 'px'), (50.0, 'vh')]

3. Cleaning and Normalization Patterns (清理与标准化模式)#

1
import re
2

3
# ── Normalize whitespace ────────────────────────────────────
4
def normalize_whitespace(text):
5
    return re.sub(r'\s+', ' ', text).strip()
6

7
print(normalize_whitespace("  Hello   World  \n\t  Python  "))
8
# → Hello World Python
9

10

11
# ── Remove HTML tags ────────────────────────────────────────
12
def strip_html(html):
13
    clean = re.sub(r'<[^>]+>', '', html)
14
    return re.sub(r'\s+', ' ', clean).strip()
15

16
html = "<h1>Title</h1><p>Some <b>bold</b> and <em>italic</em> text.</p>"
17
print(strip_html(html))
18
# → Title Some bold and italic text.
19

20

21
# ── Slugify a string ────────────────────────────────────────
22
def slugify(text):
23
    text = text.lower()
24
    text = re.sub(r'[^\w\s-]', '',  text)   # remove non-word chars
25
    text = re.sub(r'[\s_]+',   '-', text)   # spaces/underscores → dash
26
    text = re.sub(r'-+',       '-', text)   # multiple dashes → one
27
    return text.strip('-')
28

29
print(slugify("Hello, World! This is Python 3.12"))
30
# → hello-world-this-is-python-312
31

32

33
# ── Camel case to snake case ────────────────────────────────
34
def camel_to_snake(name):
35
    name = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1_\2', name)  # ABCDef → ABC_Def
36
    name = re.sub(r'([a-z\d])([A-Z])',      r'\1_\2', name)  # fooBar → foo_Bar
37
    return name.lower()
38

39
print(camel_to_snake('camelCaseString'))    # → camel_case_string
40
print(camel_to_snake('parseHTMLContent'))   # → parse_html_content
41
print(camel_to_snake('MyClassName'))        # → my_class_name
42

43

44
# ── Mask sensitive data ─────────────────────────────────────
45
def mask_credit_card(text):
46
    return re.sub(r'\b(\d{4})\d{8}(\d{4})\b', r'\1 **** **** \2', text)
47

48
def mask_email(text):
49
    return re.sub(r'(\w{2})\w+(@[^\s]+)', r'\1***\2', text)
50

51
print(mask_credit_card("Card: 4111111111111111"))
52
# → Card: 4111 **** **** 1111
53
print(mask_email("Email alice@example.com to bob@test.org"))
54
# → Email al***@example.com to bo***@test.org

4. Common Pitfalls (常见陷阱)#

1) Catastrophic backtracking (灾难性回溯)#

1
import re, time
2

3
# ⚠️ DANGEROUS pattern: (a+)+ causes exponential backtracking
4
evil_pattern   = r'^(a+)+$'
5
safe_pattern   = r'^a+$'
6

7
test_string = 'a' * 25 + 'X'  # no match — forces max backtracking
8

9
# Safe pattern — fast
10
t = time.time()
11
re.search(safe_pattern, test_string)
12
print(f"Safe:  {time.time()-t:.6f}s")   # → ~0.000001s
13

14
# Evil pattern — hangs for long inputs!
15
# (DO NOT run with 'a' * 30 + 'X')
16
t = time.time()
17
re.search(evil_pattern, 'a' * 20 + 'X')
18
print(f"Evil:  {time.time()-t:.6f}s")   # → much longer
19

20
# FIX: use atomic groups or possessive quantifiers, or restructure
21
# In Python 3.11+: use re.POSSESSIVE or regex module

2) `re.match()` vs `re.search()` confusion#

1
import re
2

3
# COMMON MISTAKE: using match() when search() is needed
4
data = "  123 some text"
5

6
# Incorrect — thinking match() searches anywhere
7
result = re.match(r'\d+', data)   # → None!  (leading spaces)
8

9
# Correct
10
result = re.search(r'\d+', data)  # → '123'
11

12
# Or anchor explicitly
13
result = re.match(r'\s*(\d+)', data)  # → group(1) = '123'

3) `findall()` group return type surprise#

1
import re
2

3
text = "2024-01 2024-02"
4

5
# Bug: adding a group changes return type
6
print(re.findall(r'\d{4}-\d{2}',       text))  # → ['2024-01', '2024-02']
7
print(re.findall(r'(\d{4})-\d{2}',     text))  # → ['2024', '2024']  (years only!)
8
print(re.findall(r'(\d{4})-(\d{2})',   text))  # → [('2024','01'), ('2024','02')]
9

10
# Fix: use non-capturing group when you don't need the group value
11
print(re.findall(r'(?:\d{4})-(?:\d{2})', text))  # → ['2024-01', '2024-02']

4) Forgetting raw strings#

1
import re
2

3
# WRONG: \b interpreted by Python as backspace character (ASCII 8)
4
print(re.findall('\bword\b', 'word in a sentence'))   # → []  WRONG
5

6
# CORRECT: raw string
7
print(re.findall(r'\bword\b', 'word in a sentence'))  # → ['word']
8

9
# WRONG: \d interpreted as literal 'd' in some contexts
10
print(re.findall('\d+', 'abc 123'))   # may work but is fragile
11
# CORRECT:
12
print(re.findall(r'\d+', 'abc 123'))  # → ['123']

V. Complete API Quick Reference (完整API速查表)#

Function / Method	Returns	Use when
`re.compile(pat, flags)`	Pattern	Pattern reused multiple times
`re.search(pat, s)`	Match or None	Find first match anywhere
`re.match(pat, s)`	Match or None	Match only at position 0
`re.fullmatch(pat, s)`	Match or None	Pattern must cover entire string
`re.findall(pat, s)`	List[str or tuple]	All matches as a list
`re.finditer(pat, s)`	Iterator[Match]	All matches with position info
`re.sub(pat, repl, s)`	str	Replace matches
`re.subn(pat, repl, s)`	(str, int)	Replace + count substitutions
`re.split(pat, s)`	List[str]	Split string by pattern
`re.escape(s)`	str	Treat literal string as pattern
`m.group(n)`	str	Get captured group text
`m.groups()`	tuple	All groups as tuple
`m.groupdict()`	dict	Named groups as dict
`m.start() / m.end()`	int	Match position
`m.span()`	(int, int)	(start, end) tuple
`m.expand(template)`	str	Backreference expansion

💡 One-line Takeaway
Master regex in four steps: ① know the 5 building blocks (literals, ., classes [], quantifiers *+?{}, anchors ^$\b) → ② use groups () to capture, (?:) to group without capturing → ③ add lookaround (?=)(?!) for context-sensitive matching → ④ always use r'' raw strings, prefer non-greedy .*?, and re.VERBOSE for complex patterns.

I. Python re — Regular Expressions#

**II. Pattern Syntax — The Complete Reference **#

1. Literals and Metacharacters (字面量与元字符)#

1) Plain literals (普通字面量)#

2) The 14 metacharacters (14个元字符)#

2. The Dot . — Any character (任意字符)#

3. Anchors — Position matchers (锚点 — 位置匹配)#

1) ^ and $ — Start and end of string/line#

2) \b and \B — Word boundaries (单词边界)#

3) \A, \Z — Absolute start/end of string (字符串绝对首尾)#

4. Character Classes [ ] (字符类)#

1) Basic character class (基本字符类)#

2) Negated character class [^ ] (否定字符类)#

3) Special sequences inside [ ]#

5. Predefined Character Classes (预定义字符类)#

6. Quantifiers — Repetition (量词 — 重复)#

1) Basic quantifiers (基本量词)#

2) Greedy vs Non-greedy (贪婪 vs 非贪婪)#

7. Groups — Capturing and Non-capturing (分组 — 捕获与非捕获)#

1) Capturing group ( ) (捕获组)#

2) Named group (?P<name>...) (命名捕获组)#

3) Non-capturing group (?:...) (非捕获组)#

4) Backreferences \1 \2 (反向引用)#