Text processing with regular expressions

Text processing with regular expressions#

In this lecture we will discuss extracting information from strings/pieces of text. By information we mean here a substring which satisfies certain specifications, e.g. it represents a number or valid email address. We will see that this task can sometimes get quickly out of hand - capturing all the corner cases will lead to code which maybe hard to read and maintain. This is where regular expressions will enter the stage to save the day.

String methods#

Recall the previously seen methods of str objects

str.find
str.index
str.count
str.replace

which are suitable for working with specifications given by fixed characters. For example, a simple re-implementation of os.path.splitext could read

import os


def splitext(path):
    """Naive implementation of split extension in os.path"""
    if path.count(".") == 0:
        return None
    # Unpack the list as list[:-1], list[-1]
    *noext, ext = path.split(".")
    return ("".join(noext), f".{ext}")


print(splitext("foo_bar"), os.path.splitext("foo_bar"))
print(splitext("foo.bar"), os.path.splitext("foo.bar"))
print(splitext("foo.bar.txt"), os.path.splitext("foo.bar.txt"))

None ('foo_bar', '')
('foo', '.bar') ('foo', '.bar')
('foobar', '.txt') ('foo.bar', '.txt')

What is a number?#

What if the specifications are not as easy to define? In the following we will attempt to write a fuction which returns a list of numbers contained in a string. Before finding many we shold be able to find just one.

# Our first definition motivated by integers


def only_digits(string):
    if set(string) <= {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}:
        return True
    return False


print(only_digits("123a"))
print(only_digits("123"))

False
True

But what about negative numbers

def is_integer(string):
    if len(string) == 1:
        return only_digits(string)
    maybe_sign, rest = string[0], string[1:]
    if maybe_sign == "-" and only_digits(rest):
        return True

    return only_digits(string)


print(is_integer("-123"))

True

Now what about floats, 0.234 and scientific formating 0.234E-04. We could probably handle all these after a while but here we will settle for a different approach: a number string is that string which can be converted to a number

def is_number(string):
    """It is a number if it can be converted"""
    try:
        float(string)
        return True
    except ValueError:
        return False


print(is_number(-3.43e-02))

True

So far we have considered strings that were either number or not. But what if the strings are more involved. For example let us consider some string encoding wind speed and temperature ‘wind27temperature-32’. We don’t expect

is_number('wind27temperature32')

to work but we could start building a solution using is_number in the fundations.

is_number("wind27temperature32")

False

def find_numbers(string):
    numbers = []
    first, n = 0, len(string)
    while first < n:
        maybe = string[first : first + 1]
        # If we have a number, grow it as much as possible
        if is_number(maybe):
            last = first
            while last < n and is_number(string[first : last + 1]):
                maybe = string[first : last + 1]
                last += 1
            numbers.append(maybe)
            first = last
        else:
            first += 1

    return numbers


print(find_numbers("wind27temperature32"))
print(find_numbers("wind27temperature42.23423"))

['27', '32']
['27', '42.23423']

So this looks promising but what about negetive temperatures and that scientific formatting again?

print(find_numbers("wind27temperature-42.234E-23"))

['27', '42.234', '23']

Not quite what we want … .The problem above was matching on the smallest strings so e.g. leading minus sign - is ignore because it is not a number on its own and we don’t include 42.234E-23 because “42.234E” is False according to is_number. We are back to the drawing board.

def get_substrings(string):
    """Generated substring of string order by descending ize"""
    substrings = []
    n = len(string)
    for subl in range(n - 1, 0, -1):
        for i in range(n + 1 - subl):
            substrings.append(string[i : i + subl])
    return substrings


def find_numbers1(string):
    """Match from largest"""
    if not string:
        return []

    if is_number(string):
        return [string]

    numbers = []
    for substring in get_substrings(string):
        # print(substring, is_number(substring))
        # For a match we clip the string and as for matches in the neighbors
        if is_number(substring):
            numbers.append(substring.strip())
            # print('->', substring, string[:string.index(substring)], string[string.index(substring)+len(substring):])
            numbers.extend(find_numbers1(string[: string.index(substring)]))
            numbers.extend(
                find_numbers1(string[string.index(substring) + len(substring) :])
            )
            break
    return numbers


print(find_numbers1("wind27temperature-42.234E-23"))
print(find_numbers1("wind27temperature-42.234E-23 wind27temperature-42.234e123"))

['-42.234E-23', '27']
['-42.234E-23', '27', '-42.234e123', '27']

This seems to work but (while it may have been fun to write this code) you should have a feeling that There must be a better way.

Question: 1. Can you come of with a different/better implementation of a number extractor?

Regular expressions#

sequence of characters that defines a search pattern
the pattern is parsed/interpretted in a special way (cf. programming language)
they are not unique to Python, see RegEx 101

You have seen regexp when working with your file system

# Here is an example
%ls *.ipynb
# rm *

regular-expressions.ipynb

In Python the functionality for RegEx is provided in the re module.

Here we will demonstrate the basics of regex syntax using the top-level function re.search and return to more of the module’s functions later

import re
re.search(pattern:str, string:str, [flags]) -> re.Match or None

Match specifies the part of string where pattern was matched

import re

print(re.search("tractor", "attractor"))  # 'tractor' in 'Attractor'.lower()
print(re.search("tractor", "Attractor"))
print(
    re.search("tractor", "ATTractor", re.IGNORECASE)
)  # re.IGNORECASE is our first flag

<re.Match object; span=(2, 9), match='tractor'>
<re.Match object; span=(2, 9), match='tractor'>
<re.Match object; span=(2, 9), match='Tractor'>

pattern, string = "tractor", "Attractor"
match = re.search(pattern, string)
print(string[match.start() : match.end()])

tractor

And so we see that “simple words” specify a pattern that simply matches itself. However, we could already do that with str methods. To specify more intesting patterns RegEx use special syntax.

Metacharacters#

The following characters have a special meaning in defining the pattern

. ^ $ * + ? { } [ ] \ | ( )

When one wants to match them literally they need to be escaped

print(re.search("^", "2^6"))  # We have a match ?!
print(re.search(r"\^", "2^6"))  # Here as expected

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(1, 2), match='^'>

Square brackets delineate a character class. Characters can be listed individually …

print(re.search("[ea]", "long lean"))  #  Match at e or a
print(re.search("[oea]", "long lean"))  #  Why did we stop here
print(re.search("[ea]", "long last lean"))

<re.Match object; span=(6, 7), match='e'>
<re.Match object; span=(1, 2), match='o'>
<re.Match object; span=(6, 7), match='a'>

… or given as ranges, which can be combined, …

print(re.search("[a-z]", "life"))
print(re.search("[a-zA-Z0-9]", "____7up"))  # Walk though here

<re.Match object; span=(0, 1), match='l'>
<re.Match object; span=(4, 5), match='7'>

and complemented

print(re.search("[^a]", "abc"))
print(re.search("[^a]", "bc"))
print(re.search("[^a-z]", "life"))
print(re.search("[^a-z]", "up7"))

<re.Match object; span=(1, 2), match='b'>
<re.Match object; span=(0, 1), match='b'>
None
<re.Match object; span=(2, 3), match='7'>

Note that metacharacters (except backslash) are not active inside a character class. For example we have seen a match in re.search('^', '2^6') because ^ means (normally) the anchor for start of a string.

Backslash \ is another another metacharacter. We recall that \ has a special meaning in Python strings, e.g. ‘\n’, ‘\t’ are not simply strings of 2 characters but represent carriage return/newline and tab respecively. To supress this beviour we will use raw string r"string"

(len("\n"), len(r"\n"))

(1, 2)

Speical sequences beginning with \

\d : matches a single character that is a digit, [0-9]
\w : matches a word character, [a-zA-Z0-9_] (Note the underscore)
\s : matches a whitespace character, [ \t\n\r\f\v]

And their negations

\D : matches a single character that is a not digit, [^0-9]
\W : matches a non-word character, [^a-zA-Z0-9_]
\S : matches a non-whitespace character, [^ \t\n\r\f\v]

Note that these preserve their meaning inside character classes

# NOTE: we want to match . literally |
# Digit or digit or . or digit       v
print(re.search(r"[\d\d\.\d]", "asasd 938323.23"))  # Walk through
#                                      |
#                                      V
print(re.search(r"[\d\d\.\d]", "asasd ..938323.23"))  # Walk through
# Pattern is digit followed by digit, dot and digit
#                                      <-->
print(re.search(r"\d\d\.\d", "asasd 938323.23"))  # Walk through

<re.Match object; span=(6, 7), match='9'>
<re.Match object; span=(6, 7), match='.'>
<re.Match object; span=(10, 14), match='23.2'>

Other special characters

. matches any character except a newline
| the OR operator

print(re.search(".", "\n"))
print(re.search(".", " "))
print(re.search(".", " vv"))

None
<re.Match object; span=(0, 1), match=' '>
<re.Match object; span=(0, 1), match=' '>

print(re.search("a|b", "xerox"))
print(re.search("a|b", "vibe"))

None
<re.Match object; span=(2, 3), match='b'>

Anchors#

Characters for specifing position or RE within the string

^ match at the beginning of the string (but re.MULTILINE)
\A matches only at the start of the string
$ match at the end of string (but re.MULTILINE)
\Z matches only at the start of the string
\b word boundary
\B not word boundary

print(re.search("^a", "ball"))
print(re.search("^a", "ball\nall", re.MULTILINE))
print(re.search("a", "ball"))

None
<re.Match object; span=(5, 6), match='a'>
<re.Match object; span=(1, 2), match='a'>

# NOTE: '\b' is the bell character hence r'...'
print(re.search(r"\ball\b", "ballistic"))
print(re.search(r"\ball\b", "all"))
print(re.search(r"\ball\b", "allabama"))
print(re.search(r"\ball\B", "allabama"))

None
<re.Match object; span=(0, 3), match='all'>
None
<re.Match object; span=(0, 3), match='all'>

Repeating patterns#

Quantifiers

The characters *, +, ? and {} are reserved as quantifiers.

For example,

* 0 or more occurances of previous character/RE
+ 1 or more occurances or previous character/RE
? 0 or 1 occurances of previous character/RE
{n} exactly n occurances of previous character/RE
{n,} at least n occurances
{n, m} n to m (included) occurances

Question:

What is the equivalent of {0, }
What is the equivalent of {1, }

print(re.search("i*", "team"))
print(re.search("i+", "team"))
print(re.search("i+", "team spirit"))
print(re.search(r"\B(i{1})\B", "spxirit"))

<re.Match object; span=(0, 0), match=''>
None
<re.Match object; span=(7, 8), match='i'>
<re.Match object; span=(3, 4), match='i'>

# With classes
print(re.search("[aeiou]+", "team"))
print(re.search("[aeuio]+", "lynx"))
print(re.search(r"[\d]+", "SR 71"))

<re.Match object; span=(1, 3), match='ea'>
None
<re.Match object; span=(3, 5), match='71'>

Greedines

With operators *, + the resulting action is to consume as much of the pattern as possible/will match with the longest string they can find. That is

# Pattern strarting with < followed by any number of non-newline charancters followed by >
print(re.search("<.*>", "<html><head><title>Title</title>"))
#                        |start                         |stop     in greedy way

<re.Match object; span=(0, 32), match='<html><head><title>Title</title>'>

print(re.search("<.*?>", "<html><head><title>Title</title>"))
#                        |start|stop     in non-greedy way stop at first match

<re.Match object; span=(0, 6), match='<html>'>

Grouping and capturing#

Placing parts of the regex inside () will group that part of the regex together. Let’s look for groups of digits

# Say you have a phone number of the for XXX-YYYY
match = re.search(r"(\d{3})-(\d{4})", "123-3333")

print(match.groups())
print(match.group(0))  # Entire match
print(match.group(1))  # Get the group by order (staring from 1)
print(match.group(2))

('123', '3333')
123-3333
123
3333

We can reffer back to captured groups to build up patterns

# The first match shold be repeated
print(re.search(r"(\d{3})-(\d{4})-\1", "123-3333-444"))
print(re.search(r"(\d{3})-(\d{4})-\1", "123-3333-123"))

None
<re.Match object; span=(0, 12), match='123-3333-123'>

With many groups it might be convenient to name them, define (?P<name>), refer (?P=name)

match = re.search(r"(?P<area>\d{3})-(?P<extension>\d{4})", "123-3333")
print(match.group("area"), match.group("extension"))

match = re.search(r"(?P<area>\d{3})-(?P<extension>\d{4})-(?P=area)", "123-3333-123")
print(match)
print(match.group("area"))

123 3333
<re.Match object; span=(0, 12), match='123-3333-123'>
123

Sometimes we might not want to capture/skip the group, (?:...)

match = re.search(r"(?P<area>\d{3})-(?:\d{4})-(?P=area)", "123-3333-123")
print(match)
print(match.groups())

<re.Match object; span=(0, 12), match='123-3333-123'>
('123',)

Lookahead and Lookbehind assertions and conditional expressions#

(?=<lookahead_regex>) asserts that what follows the regex parser’s current position must match <lookahead_regex>

# Match pattern of 3 digits if they are followed by 4 letter string
# pattern = '\d{3}-(?=[a-zA-Z]{4})'
pattern = r"\d{3}-[a-zA-Z]{4}"
print(re.search(pattern, "123-4568"))

match = re.search(pattern, "123-miro")
print(match)
print(match.groups())

None
<re.Match object; span=(0, 8), match='123-miro'>
()

# We can again name groups
match = re.search(r"(\d{3})-(?=(?P<name>[a-zA-Z]{4}))", "123-miro")
print(match)
match.group(1, "name")

<re.Match object; span=(0, 4), match='123-'>

('123', 'miro')

(?<=<lookbehind_regex>) asserts that what precedes the regex parser’s current position must match <lookbehind_regex>

# Match the 3 digits if they are preceed by string of 4 letters
pattern = r"(?<=(?P<name>^[a-zA-Z]{4})):(?P<phone>\d{3})"
match = re.search(pattern, "miroslav:123")
print(match)  # Note that we are getting the match on the digits

match = re.search(pattern, "miro:123")
print(match)
match.group("name", "phone")

None
<re.Match object; span=(4, 8), match=':123'>

('miro', '123')

(?(id/name)<yes-pattern>|<no-pattern>) If the group id exists the match will use <yes pattern> otherwise <no pattern> is used.

import re

# If there is word phone in the string capture it with group 1 and then the rest of the pattern must be space, digits or word
# missing.
pattern = r"(phone)?(?(1)\s\d+|missing)"

print(re.search(pattern, "phone 123"))
print(re.search(pattern, "phone XYZ"))
print(re.search(pattern, "phone missing"))

<re.Match object; span=(0, 9), match='phone 123'>
None
<re.Match object; span=(6, 13), match='missing'>

Verbose patterns#

Regular expression can become long and convoluted in which case it comes in handy to be able to document them. To this end we can use the re.VERBOSE flag. Suppose we want to parse phone numbers of the form (area code) 4digits 4digits

pattern = r"\((\d{3})\)\D*(\d{4})\D*(\d{4})"
print(re.search(pattern, "800-5555-1234"))
print(re.search(pattern, "(800)-5555-1234"))

None
<re.Match object; span=(0, 15), match='(800)-5555-1234'>

And now a documented version

pattern = r"""
\(         # Area code begins with (
(\d{3})    # is followed by 3 digits
\)         # and close area code
\D*        # There can be some no digits spaces until the next group
(\d{4})    # of four digits
\D*        # Again some separators
(\d{4})    # and finally a 4 digit group
"""
match = re.search(pattern, "(800)- - 5555--1234", re.VERBOSE)
print(match)
print(match.groups)

<re.Match object; span=(0, 19), match='(800)- - 5555--1234'>
<built-in method groups of re.Match object at 0x110ad2f00>

Functions of the `re` module#

re.search stopped once we found a first match. All matches can be found with re.findall

string = "a=1 aa=2 b c aaa=3"
print(re.findall(r"a=\d+", string))
# Contrast with the following patterns where we introduce 2 groups
print(re.findall(r"(a)=(\d+)", string))

['a=1', 'a=2', 'a=3']
[('a', '1'), ('a', '2'), ('a', '3')]

re.split lets us split the string where the pattern matched. In the following example we consider numbers whose digits form repeating patterns. These patterns are our split points.

import sympy as sp

r = sp.Rational(1, 7)
n = str(sp.N(r, 80))
print(n)  # See that they repeat

0.14285714285714285714285714285714285714285714285714285714285714285714285714285714

pattern = r"(\d+?)\1"
re.search(pattern, n).groups()

re.split(pattern, n)

['0.',
 '142857',
 '',
 '142857',
 '',
 '142857',
 '',
 '142857',
 '',
 '142857',
 '',
 '142857',
 '14285714']

re.sub will perform substitution for the matched substring. A cool feature is that we can use the groups to build the replacent string. In the following we assume string in the form DAY MONTH YEAR encoded respectively as 3-letter word and strings of 2 and 4 digits. We are only after the year and month in the replaced string.

pattern = r"([A-Z]{3}) (?:\d{2}) (\d{4})"
string = "today is MON 12 2022 and tomorrow will be"
re.search(pattern, string)


re.sub(pattern, r"replaced: \2 \1", string)

'today is replaced: 2022 MON and tomorrow will be'

So far, we have used top-level functions. A building block of these is a pattern which can be complied by re.compile. Citing the re documentation: Under the hood, these functions simply create a pattern object for you and call the appropriate method on it. They also store the compiled object in a cache, so future calls using the same RE won’t need to parse the pattern again and again.

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

Question: 1. Verify the above by profiling.

Case studies with RegEx#

Here are some practice examples. The solutions are not unique (as there can be different patterns that match the answer). I also do not claim the solutions are 100% correct.

Validating an ip4 address#

The address takes the form X.Y.Z.W where any of X, Y, Z, W represent numbers 0-255. Note that 012 is valid number as is 001.

# https://projects.lukehaas.me/regexhub/


def is_valid(string):
    pattern = r"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"
    # The key observation here is representation of numbers 0-255
    # We have 25[0-5]
    #         2[0-4][0-9]
    #         1[0-9][0-9]
    #         0?[0-9][0-9]?
    return re.match(pattern, string) is not None

print(is_valid("0.0.0.0"))
print(is_valid("255.255.255.0"))
print(is_valid("255.255.255.256"))
print(is_valid("255.55.05.0"))

True
True
False
True

Validating nordic email address#

Suppose we only allow for alpha-numerics caracters in the name and domain name. For valid address we want extract a username and country

def email_user(address):
    pattern = r"^(?P<user>[a-zA-Z0-9]+)@(?:[a-zA-Z0-9]+)\.(?P<country>(?:no)|(?:se)|(?:fi)|(?:is)|(?:dk))$"
    match = re.match(pattern, address)
    if match is None:
        return ""
    return match.group("user"), match.group("country")

print(email_user("john@doe.com"))
print(email_user("john@doe_.no"))
print(email_user("john@doe.no"))

('john', 'no')

Finding numbers#

We come back to our original problem of finding bumbers. The number encoding we want to handle

starts with and optional +- sign
there are digits with an optional . followd by more digits
afterwards we can have an optional exponentiation ‘eE’ followed by optional +- and integers specifying the exponent

We consider the regexp as follows

def find_numbers_re(string):
    pattern = r"""
    [+-]?             # Opional sign
    \d+               # One or more digits
    (?:\.\d+)?        # followed optionally by . with more digits
    (?:[eE][+-]?\d+)? # Optional exponentiation: eE with optional sign and integer exponent
    """
    pattern = re.compile(pattern, re.VERBOSE)
    return re.findall(pattern, string)

print(find_numbers_re("12"), find_numbers1("12"))
print(find_numbers_re("12.3E+3"), find_numbers1("12.3E+3"))
print(find_numbers_re("aaa"), find_numbers1("aaa"))
print(find_numbers_re("12.334E-04"), find_numbers1("12.334E-04"))

['12'] ['12']
['12.3E+3'] ['12.3E+3']
[] []
['12.334E-04'] ['12.334E-04']

Perhaps the interesting question now that we have two implementations is whether they yield the same answers and what is their speed.

from secrets import token_urlsafe

string = token_urlsafe(100)
result_nore = find_numbers1(string)
result_re = find_numbers_re(string)
print(set(result_nore) == set(result_re))

True

%%timeit
result_nore = find_numbers1(string)

15.9 ms ± 89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
result_nore = find_numbers_re(string)

5.75 µs ± 73.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Repeated words#

It is common when writing text to produce typos of the form the the or a a. We would like to write a function that fixes strings plagued with these errors. For starters assume that the typos occur only in the sentence interior.

def fix_repeated_words(string):
    # Here the pattern will only match 2 repeats
    pattern = r"\b(\w+)\s+\1"
    match = re.search(pattern, string)
    if match is None:
        return string
    # We deal with N repeats 2 at a time; substitute and reduce to N-1 case by
    # replacing the matched substring with the group 1, i.e. the repeated word
    return fix_repeated_words(re.sub(pattern, r"\1", string))

print(fix_repeated_words("xx them them yy"))
print(fix_repeated_words("xx the the the yy"))
print(fix_repeated_words("xx the a a the the yy"))
print(fix_repeated_words("xx the the a a yy"))

xx them yy
xx the yy
xx the a the yy
xx the a yy

Question:

Can you handle ‘the the the’ case without iterations?
Can you extend the fix_repeated_words to handle typos at the beginning of sentence “The the” -> “The”

Algebraic chess notation#

Moves in a chess game can be described using the following notation.

1.Nf3 Nf6 2.c4 g6 3.Nc3 Bg7 4.d4 O-O 5.Bf4 d5 6.Qb3 dxc4 7.Qxc4 c6 8.e4 Nbd7 9.Rd1 Nb6 10.Qc5 Bg4 11.Bg5 Na4 12.Qa3 Nxc3 13.bxc3 Nxe4 14.Bxe7 Qb6 15.Bc4 Nxc3 16.Bc5 Rfe8+ 17.Kf1 Be6 18.Bxb6 Bxc4+ 19.Kg1 Ne2+ 20.Kf1 Nxd4+ 21.Kg1 Ne2+ 22.Kf1 Nc3+ 23.Kg1 axb6 24.Qb4 Ra4 25.Qxb6 Nxd1 26.h3 Rxa2 27.Kh2 Nxf2 28.Re1 Rxe1 29.Qd8+ Bf8 30.Nxe1 Bd5 31.Nf3 Ne4 32.Qb8 b5 33.h4 h5 34.Ne5 Kg7 35.Kg1 Bc5+ 36.Kf1 Ng3+ 37.Ke1 Bb4+ 38.Kd1 Bb3+ 39.Kc1 Ne2+ 40.Kb1 Nc3+ 41.Kc1 Rc2# 0-1

Briefly, we have (move number.) followed by encoding of the move of the white and black players. Here we would like to implement regexp that matches these meoves. To do this let’s look at the moves and build the functionality gradually. To monitor the progress we will use to progressbar from tqdm package.

from tqdm.notebook import tqdm

game = """
1.Nf3 Nf6 2.c4 g6 3.Nc3 Bg7 4.d4 O-O 5.Bf4 d5 6.Qb3 dxc4 7.Qxc4 c6 8.e4 Nbd7 9.Rd1 Nb6 10.Qc5 Bg4 11.Bg5 Na4 12.Qa3 Nxc3 13.bxc3 Nxe4 14.Bxe7 Qb6 15.Bc4 Nxc3 16.Bc5 Rfe8+ 17.Kf1 Be6 18.Bxb6 Bxc4+ 19.Kg1 Ne2+ 20.Kf1 Nxd4+ 21.Kg1 Ne2+ 22.Kf1 Nc3+ 23.Kg1 axb6 24.Qb4 Ra4 25.Qxb6 Nxd1 26.h3 Rxa2 27.Kh2 Nxf2 28.Re1 Rxe1 29.Qd8+ Bf8 30.Nxe1 Bd5 31.Nf3 Ne4 32.Qb8 b5 33.h4 h5 34.Ne5 Kg7 35.Kg1 Bc5+ 36.Kf1 Ng3+ 37.Ke1 Bb4+ 38.Kd1 Bb3+ 39.Kc1 Ne2+ 40.Kb1 Nc3+ 41.Kc1 Rc2# 0-1
"""
moves = sum([row.split() for row in filter(bool, re.split(r"\d+\.", game.strip()))], []);

Nf3 means k(N)ight moved to f3. Other pieces are (K)ing, (Q)ueen, (R)ook and (B)ishop

pattern = "[KQRBN][a-h][1-8]"

for move in tqdm(moves):
    if re.match(pattern, move) is None:
        print(move)
        break

c4

We see that c4 is a valid notation - it means that a pawn has moved. For our regexp it means that the piece pattern [KQRBN] is optional.

pattern = "[KQRBN]?[a-h][1-8]"

for move in tqdm(moves):
    if re.match(pattern, move) is None:
        print(move)
        break

O-O

Castling has a special notation.

pattern = "([KQRBN]?[a-h][1-8])|(O-O)"

for move in tqdm(moves):
    if re.match(pattern, move) is None:
        print(move)
        break

dxc4

If the piece captures, this is indicated by ‘x’ in front of the position where the capture happened. The other new thing that we see here is dealing with ambiguity - more pawns could make a capture at the position so we need to indicate which one it was.

pattern = "([KQRBN]?[a-h]?x?[a-h][1-8])|(O-O)"

for move in tqdm(moves):
    if re.match(pattern, move) is None:
        print(move)
        break

0-1

Almost there; just need to the score.

pattern = "([KQRBN]?[a-h]?x?[a-h][1-8])|(O-O)|(0-1)|(1-0)"

for move in tqdm(moves):
    if re.match(pattern, move) is None:
        print(move)
        break

If you look into the game in more detail you will see that for moves like Nc3+ Rc2# our regexp matches but we are not doing anything about the check + or checkmate #.

pattern = "^([KQRBN]?[a-h]?x?[a-h][1-8])$|^(O-O)$|^(0-1)$|^(1-0)$"

for move in tqdm(moves):
    if re.match(pattern, move) is None:
        print(move)
        break

Rfe8+

pattern = "^[KQRBN]?[a-h]?x?[a-h][1-8][+#]?$|^(O-O)$|^(0-1)$|^(1-0)$"

for move in tqdm(moves):
    if re.match(pattern, move) is None:
        print(move)
        break

Question: 1. There are moves we have not covered like pawn promotion, different castlings or notation for tied games. Can you add them?

Now that we have a valid move let us extract some information from it. Here we would like to get a piece that moved and where it went.

string = "Bhg7"
pattern = (
    "^(?P<piece>[KQRBN]?[a-h]?)x?(?P<position>[a-h][1-8])[+#]?$|^(O-O)$|^(0-1)$|^(1-0)$"
)
re.search(pattern, string).groupdict()

{'piece': 'Bh', 'position': 'g7'}

Question:

Write a game narrator that parser the algebraic chess notation and return a text transcription of the game as chess_narrate('1.Nf3 Nf6 2.c4 g6 3.d4 O-O 4.Qb3 dxc4 5.Qxc4 ...') returns

White knight moves to f3.
Black knight moves to f6.
White pawn moves to c4.
...
Black castles.
...
White queen captures at c4.

For fun, look in the PyAudio library and make your narrator talk!

Looking at Poker hands#

Our final example covers another game that Magnus Carlsen likes. A valid hand has 5 cards, each card is (A)ce, 2-9, (T)en, (J)ack, (Q)ueen, (K)ing.

def valid_hand(string):
    """Chrck received poker cards"""
    # Any of the allowed ranks is repeated 5 times
    pattern = "^[ATJQK2-9]{5}"
    # NOTE: we are using re.match here and not seach -> ^
    return re.match(pattern, string) is not None


print(valid_hand("A"))
print(valid_hand("AJ9K8"))

False
True

Let us complicate things and also introduce suits (h)earts, (c)lubs, (s)pades, (d)iamonds. The card encoding is now RankSuite

def valid_hand(string):
    """Check received poker cards"""
    # A card is repeated 5 times where card is 1 occurange of rank and 1 of suite
    pattern = "^(([ATJQK2-9]{1})([hcds]{1})){5}"
    # NOTE: we are using re.match here and not seach -> ^
    return re.match(pattern, string) is not None


print(valid_hand("AcJc9sKd8"))
print(valid_hand("AcJc9sKd8h"))
print(valid_hand("AcJc9sKdhh"))

False
True
False

Question: 1. Can you extend the function to validate that the hand comes from a regular deck, e.g. do not allow 5 cards of the same kind/rank?

Moving on, assuming the hands are only valid. Let as classify them. For starters, we might want to see if the hand is a flush.

def flush(string):
    """All cards have the same suit"""
    # Capture the suit                                            Here refer to it
    pattern = r"^(?:[ATJQK2-9]{1})([hcds]{1})(?:(?:[ATJQK2-9]{1})\1){4}"
    return re.match(pattern, string) is not None

print(flush("AcAc"))
print(flush("AcJc9cTc8c"))
print(flush("AcJc9cTc8s"))
print(flush("KdJc9cTc8s"))

False
True
False
False

For other hands it might be useful to sort cards. As an example, to check for 4 of a kind we could then see if our hand, once sorted, matches XXXY or YXXXX. Here X and Y are ranks. Let’s build the sorting functionality first.

# We are on purpose using lots of (too much?) regexp functionality in the implementation


def sort_hand(string):
    """Sort according to rank"""
    # We fist want to break the hand into individual cards
    card = "([ATJQK2-9][hcds])"
    card_pattern = re.compile(f"{card}" * 5)

    match = card_pattern.search(string)
    cards = match.groups()

    # For sorting we want to extract the rank of the card. The order
    # of the card (string) will then be determined by a lookup table
    table = {rank: i for i, rank in enumerate("23456789TJQKA", 2)}
    # To get the card we grab the first charcter   [First overkill I think]
    rank_pattern = re.compile("^.")
    key = lambda card: table[rank_pattern.match(card).group()]

    sorted_cards = sorted(cards, key=key)
    # We could now just join the sorted cards to build a new string but
    # to use more regex let us try to build a substitute string and use
    # re.sub. The substitute string should look like r'\i0\i1\i2\i3\i4'
    # where ij are indices

    indices = [cards.index(card) + 1 for card in sorted_cards]
    sub = "".join([rf"\{i}" for i in indices])
    # Finally
    sorted_hand = re.sub(card_pattern, sub, string)

    return sorted_hand

Let’s give it a try and use it to check if the hand has 4 cards of the same kind.

sort_hand("AcJd2c5cKs")

'2c5cJdKsAc'

def four_of_kind(hand):
    """Four hand with four of a kind return the kind otherwise empty string"""
    sorted_hand = sort_hand(hand)
    # NOTE: at this point we dont care about the suits. So we match
    # on the rank and if we want to see if the first card repeats 3 times
    first_pattern = r"([ATJQK2-9]).\1.\1.\1.(?:[ATJQK2-9].)"
    match = re.search(first_pattern, sorted_hand)
    if match is not None:
        return match.group(1)
    # If there is no match we skip the first (lower) card and
    # see about the 4 of a kind at the tail
    last_pattern = r"(?:[ATJQK2-9].)([ATJQK2-9]).\1.\1.\1"
    match = re.search(last_pattern, sorted_hand)
    if match is not None:
        return match.group(1)
    # No luck
    return ""

print(four_of_kind("AcJd2c5cKs"))
print(four_of_kind("AcAdAcAcKs"))
print(four_of_kind("7c7d7c7cKs"))
print(four_of_kind("2cKdKcKcKs"))
print(four_of_kind("7c7d7c7c7s"))

A
7
K
7

Question:

Can you use the OR pattern in the implementation of four_of_kind?
Can you write functions that evaluate remaining hands?
On top of functions you wrote could you add a function that generates hands for N players from a regular deck of cards and assigns a winner based on dealt hands?
Now you can run the simulator for many games - how do your statistics for winning games compare with Wiki?

Additional Resources#

Videos from Simon Funke from previous editions of the course Regex 1, Regex 2, Regex 3,Regex 4

RegEx golf by Google’s Peter Norvig.
Nice tutorial on RegEx in Python

Text processing with regular expressions

Contents

Text processing with regular expressions#

String methods#

What is a number?#

Regular expressions#

Metacharacters#

Anchors#

Repeating patterns#

Grouping and capturing#

Lookahead and Lookbehind assertions and conditional expressions#

Verbose patterns#

Functions of the re module#

Case studies with RegEx#

Validating an ip4 address#

Validating nordic email address#

Finding numbers#

Repeated words#

Algebraic chess notation#

Looking at Poker hands#

Additional Resources#

Functions of the `re` module#