Scraping the web with BeautifulSoup#

We are going to get information out of websites using requests and beautifulsoup.

Installation#

With conda, you can install the required dependencies with:

conda install beautifulsoup4 requests

or

python3 -m pip install beautifulsoup4 requests

Basic usage of BeautifulSoup#

First, we import the BeatifulSoup class:

from bs4 import BeautifulSoup

We load the html source file from disk and pass the contents to the BeautifulSoup constructor.

with open("list.html") as f:
    html = f.read()
    document = BeautifulSoup(html, "html.parser")
print(html)
<!doctype html>
<html>
  <head>
    <title>Sample HTML document</title>
  </head>
  <body>
    <h2>An Unordered HTML List</h2>

    <ul id="unordered_list" style="color: #f0e">
      <li>Coffee</li>
      <li>Tea</li>
      <li>Milk</li>
    </ul>

    <h2>An Ordered HTML List</h2>

    <ol id="ordered_list" style="color: rgb(20, 200, 100)">
      <li>First</li>
      <li>Second</li>
      <li>Third</li>
    </ol>
  </body>
</html>
from IPython.display import HTML

HTML(html)
Sample HTML document

An Unordered HTML List

  • Coffee
  • Tea
  • Milk

An Ordered HTML List

  1. First
  2. Second
  3. Third

Finding tags by name#

The document now contains the full html document. We can find the first occuring tag with a specific name with the find function. Let’s find the first un-ordered list tag:

ulist = document.find("ul")

The result contains all tags contained in the matched tag:

ulist
<ul id="unordered_list" style="color: #f0e">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>

The find_all function returns all tags that match the given tag name. We can use it to get a list of all list items:

items = ulist.find_all("li")
items
[<li>Coffee</li>, <li>Tea</li>, <li>Milk</li>]

Finally, we can loop over all items and extract their contant with the get_text function:

for item in items:
    print(repr(item.get_text()))
'Coffee'
'Tea'
'Milk'

Because whitespace is not meaningful in HTML, it is often useful to strip it when you are getting the content of a tag. You can do this with strip=True

for item in items:
    print(repr(item.get_text(strip=True)))
'Coffee'
'Tea'
'Milk'

Note that find_all is recursive by default. This means that we could call it the on the full document to get the items of both the ordered and un-ordered lists:

document.find_all("li")
[<li>Coffee</li>,
 <li>Tea</li>,
 <li>Milk</li>,
 <li>First</li>,
 <li>Second</li>,
 <li>Third</li>]
document.find_all("li", recursive=False)
[]
ulist.find_all("li", recursive=False)
[<li>Coffee</li>, <li>Tea</li>, <li>Milk</li>]

A recursive search finds all li tags anywhere.

document.find_all("li")

Finding tags by attributes#

Sometimes the easiest way to find a tag is by its attribute name. In our examples, both lists have an id attribute that uniquely identifies the tables. We can also use the find* methods to search for attributes:

document.find(attrs={"id": "unordered_list"})
<ul id="unordered_list" style="color: #f0e">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>

Accessing attributes#

The ul tag also contains a style attribute. Any bs4 tag behaves like a dictionary with attribute names as keys and attribute values as values:

ulist.attrs
{'id': 'unordered_list', 'style': 'color: #f0e'}
ulist["style"]
'color: #f0e'

Downloading a table from Wikipedia#

We aim to get a list of countries sorted by their population size: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

First, let’s import the required modules:

import re

import dateutil
import requests
from bs4 import BeautifulSoup

This time, we load the html directly from a website using the requests module:

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

r = requests.get(url)
url
'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'

The web server returns a status code to indicate if the request was (un-)succesfully. We use that status-code to check if the page was succesfully loaded:

assert r.status_code == 200

Next, we extract the html source and initiated BeautifulSoup:

html = r.text
document = BeautifulSoup(html, "html.parser")

by looking at the document, we can see that we are interested in first table. So we use find:

table = document.find("table", class_="wikitable")

If you are not familiar with html table, read this example first: https://www.w3schools.com/html/tryit.asp?filename=tryhtml_table_intro

print(str(table)[:1024])
<table class="wikitable sortable" style="text-align:right">
<tbody><tr class="is-sticky">
<th></th>
<th style="width:17em"><a href="/wiki/List_of_sovereign_states" title="List of sovereign states">Country</a> / <a href="/wiki/Dependent_territory" title="Dependent territory">Dependency</a></th>
<th>Population</th>
<th style="width:2em">% of<br/>world</th>
<th>Date</th>
<th><span class="nowrap">Source (official or from</span><br/>the <a href="/wiki/United_Nations" title="United Nations">United Nations</a>)</th>
<th class="unsortable">
</th></tr>
<tr>
<th>–
</th>
<td style="text-align:left"><b>World</b>
</td>
<td><b>8,063,588,000</b></td>
<td><b>100%</b></td>
<td><b><span data-sort-value="000000002023-10-04-0000" style="white-space:nowrap">4 Oct 2023</span></b>
</td>
<td style="text-align:left"><b>UN projection</b><sup class="reference" id="cite_ref-unpop_4-0"><a href="#cite_note-unpop-4">[3]</a></sup></td>
<td>
</td></tr>
<tr>
<th>1
</th>
<td style="text-align:left"><span class="flagicon"><span class="mw-image-

At this point, it is a good idea to programatically check that the table contains the correct header:

header = " ".join([th.get_text(strip=True) for th in table.find_all("th")])
assert "Population" in header
header
' Country/Dependency Population % ofworld Date Source (official or fromtheUnited Nations)  – 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 – 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 – 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 – 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 – 149 150 151 152 153 154 155 156 157 158 159 160 161 162 – 163 164 165 – 166 167 168 169 170 171 – 172 – 173 – – 174 – 175 176 177 – – – 178 179 180 – 181 – 182 183 184 – – 185 – 186 – – – – – – – 187 – – 188 189 190 – 191 – – – 192 – – 193 – 194 – – – – – – – – – 195 – –'

Exercise#

extract the information from the table

  • get the rows

  • find column names

  • get sensible data from each cell

  • parse numbers/dates where they show up

rows = table.find_all("tr")
rows[0]
<tr class="is-sticky">
<th></th>
<th style="width:17em"><a href="/wiki/List_of_sovereign_states" title="List of sovereign states">Country</a> / <a href="/wiki/Dependent_territory" title="Dependent territory">Dependency</a></th>
<th>Population</th>
<th style="width:2em">% of<br/>world</th>
<th>Date</th>
<th><span class="nowrap">Source (official or from</span><br/>the <a href="/wiki/United_Nations" title="United Nations">United Nations</a>)</th>
<th class="unsortable">
</th></tr>
column_names = [th.get_text(strip=True) for th in rows[0].find_all("th")]
column_names
['',
 'Country/Dependency',
 'Population',
 '% ofworld',
 'Date',
 'Source (official or fromtheUnited Nations)',
 '']
last_rank = 0
for row in rows[1:]:
    cells = row.find_all(["th", "td"])
    if not cells:
        continue
    cells_text = [cell.get_text(strip=True) for cell in cells]
    rank, country, population, percentage, updated_at, source, *comment = cells_text
    if not rank.isdigit():
        rank = last_rank
    else:
        last_rank = rank
        rank = int(rank)
    population = int(population.replace(",", ""))
    percentage = float(re.findall(r"[\d\.]+", percentage)[0]) / 100
    updated_at = dateutil.parser.parse(updated_at).date()

    print(rank, country, f"{population:,.2e}", f"{percentage:.1%}", updated_at)
0 World 8.06e+09 100.0% 2023-10-04
1 China 1.41e+09 17.5% 2022-12-31
2 India 1.39e+09 17.3% 2023-03-01
3 United States 3.35e+08 4.2% 2023-10-04
4 Indonesia 2.79e+08 3.5% 2023-07-01
5 Pakistan 2.41e+08 3.0% 2023-03-01
6 Nigeria 2.17e+08 2.7% 2022-03-21
7 Brazil 2.03e+08 2.5% 2022-08-01
8 Bangladesh 1.70e+08 2.1% 2022-06-14
9 Russia 1.46e+08 1.8% 2023-01-01
10 Mexico 1.29e+08 1.6% 2023-06-30
11 Japan 1.24e+08 1.5% 2023-09-01
12 Philippines 1.11e+08 1.4% 2023-10-04
13 Ethiopia 1.07e+08 1.3% 2023-07-01
14 Egypt 1.05e+08 1.3% 2023-10-04
15 Vietnam 1.00e+08 1.2% 2023-04-04
16 DR Congo 9.54e+07 1.2% 2019-07-01
17 Iran 8.53e+07 1.1% 2023-10-04
18 Turkey 8.53e+07 1.1% 2022-12-31
19 Germany 8.45e+07 1.0% 2023-06-30
20 Thailand 6.83e+07 0.8% 2021-07-01
21 France 6.82e+07 0.8% 2023-09-01
22 United Kingdom 6.70e+07 0.8% 2021-06-30
23 Tanzania 6.17e+07 0.8% 2022-08-23
24 South Africa 6.06e+07 0.8% 2022-07-01
25 Italy 5.88e+07 0.7% 2023-07-31
26 Myanmar 5.58e+07 0.7% 2022-07-01
27 Colombia 5.22e+07 0.6% 2023-06-30
28 Kenya 5.15e+07 0.6% 2023-01-01
29 South Korea 5.14e+07 0.6% 2022-12-31
30 Spain 4.83e+07 0.6% 2023-07-01
31 Argentina 4.67e+07 0.6% 2023-07-01
32 Algeria 4.54e+07 0.6% 2022-01-01
33 Iraq 4.33e+07 0.5% 2023-07-01
34 Uganda 4.29e+07 0.5% 2021-07-01
35 Sudan 4.20e+07 0.5% 2018-07-01
36 Ukraine 4.11e+07 0.5% 2022-02-01
37 Canada 4.04e+07 0.5% 2023-10-04
38 Poland 3.77e+07 0.5% 2023-07-31
39 Morocco 3.71e+07 0.5% 2023-10-04
40 Uzbekistan 3.64e+07 0.5% 2023-07-01
41 Afghanistan 3.43e+07 0.4% 2023-01-01
42 Peru 3.34e+07 0.4% 2022-07-01
43 Malaysia 3.34e+07 0.4% 2023-06-30
44 Angola 3.31e+07 0.4% 2022-06-30
45 Mozambique 3.24e+07 0.4% 2022-07-01
46 Saudi Arabia 3.22e+07 0.4% 2022-05-10
47 Yemen 3.19e+07 0.4% 2022-07-01
48 Ghana 3.08e+07 0.4% 2021-06-27
49 Ivory Coast 2.94e+07 0.4% 2021-12-14
50 Nepal 2.92e+07 0.4% 2021-11-25
51 Venezuela 2.83e+07 0.4% 2019-06-30
52 Cameroon 2.81e+07 0.3% 2023-07-01
53 Madagascar 2.69e+07 0.3% 2021-07-01
54 Australia 2.68e+07 0.3% 2023-10-04
55 North Korea 2.57e+07 0.3% 2021-07-01
56 Niger 2.54e+07 0.3% 2023-07-01
56 Taiwan 2.34e+07 0.3% 2023-08-31
57 Syria 2.29e+07 0.3% 2021-07-01
58 Mali 2.24e+07 0.3% 2022-06-15
59 Burkina Faso 2.22e+07 0.3% 2022-07-01
60 Sri Lanka 2.20e+07 0.3% 2023-07-01
61 Malawi 2.15e+07 0.3% 2022-07-01
62 Chile 2.00e+07 0.2% 2023-06-30
63 Kazakhstan 1.99e+07 0.2% 2023-09-01
64 Zambia 1.96e+07 0.2% 2022-09-14
65 Romania 1.91e+07 0.2% 2023-01-01
66 Ecuador 1.84e+07 0.2% 2023-10-04
67 Senegal 1.83e+07 0.2% 2023-07-01
68 Somalia 1.81e+07 0.2% 2023-07-01
69 Netherlands 1.79e+07 0.2% 2023-10-04
70 Guatemala 1.76e+07 0.2% 2023-07-01
71 Chad 1.74e+07 0.2% 2022-07-01
72 Cambodia 1.71e+07 0.2% 2023-07-01
73 Zimbabwe 1.52e+07 0.2% 2022-04-20
74 Guinea 1.33e+07 0.2% 2022-07-01
75 South Sudan 1.32e+07 0.2% 2020-07-01
76 Rwanda 1.32e+07 0.2% 2022-08-15
77 Burundi 1.28e+07 0.2% 2022-07-01
78 Benin 1.26e+07 0.2% 2023-07-01
79 Bolivia 1.20e+07 0.1% 2022-07-01
80 Tunisia 1.19e+07 0.1% 2023-01-01
81 Papua New Guinea 1.18e+07 0.1% 2021-07-01
82 Belgium 1.18e+07 0.1% 2023-08-01
83 Haiti 1.17e+07 0.1% 2020-07-01
84 Jordan 1.15e+07 0.1% 2023-10-04
85 Cuba 1.11e+07 0.1% 2022-12-31
86 Czech Republic 1.09e+07 0.1% 2023-06-30
87 Sweden 1.05e+07 0.1% 2023-08-01
88 Dominican Republic 1.05e+07 0.1% 2021-07-01
89 Greece 1.05e+07 0.1% 2021-10-22
90 Portugal 1.05e+07 0.1% 2022-12-31
91 Azerbaijan 1.02e+07 0.1% 2023-07-01
92 Tajikistan 1.01e+07 0.1% 2023-01-01
93 Israel 9.80e+06 0.1% 2023-07-31
94 Honduras 9.75e+06 0.1% 2023-07-01
95 Hungary 9.60e+06 0.1% 2023-01-01
96 United Arab Emirates 9.28e+06 0.1% 2020-12-31
97 Belarus 9.20e+06 0.1% 2023-01-01
98 Austria 9.13e+06 0.1% 2023-07-01
99 Switzerland 8.90e+06 0.1% 2023-06-30
100 Sierra Leone 8.49e+06 0.1% 2022-07-01
101 Togo 8.10e+06 0.1% 2022-11-08
101 Hong Kong(China) 7.50e+06 0.1% 2023-06-30
102 Laos 7.44e+06 0.1% 2022-07-01
103 Kyrgyzstan 7.10e+06 0.1% 2023-03-01
104 Turkmenistan 7.06e+06 0.1% 2022-12-17
105 Libya 6.93e+06 0.1% 2020-01-01
106 El Salvador 6.88e+06 0.1% 2022-07-01
107 Nicaragua 6.73e+06 0.1% 2022-06-30
108 Serbia 6.65e+06 0.1% 2022-10-31
109 Bulgaria 6.45e+06 0.1% 2022-12-31
110 Paraguay 6.11e+06 0.1% 2022-11-10
111 Congo 6.11e+06 0.1% 2023-07-01
112 Denmark 5.94e+06 0.1% 2023-07-01
113 Singapore 5.92e+06 0.1% 2023-06-30
114 Central African Republic 5.63e+06 0.1% 2020-07-01
115 Finland 5.56e+06 0.1% 2023-08-31
116 Norway 5.51e+06 0.1% 2023-06-30
117 Lebanon 5.49e+06 0.1% 2021-07-01
118 Palestine 5.48e+06 0.1% 2023-01-01
119 Slovakia 5.43e+06 0.1% 2023-06-30
120 Ireland 5.28e+06 0.1% 2023-04-01
121 Costa Rica 5.26e+06 0.1% 2023-06-30
122 New Zealand 5.22e+06 0.1% 2023-06-30
123 Oman 5.11e+06 0.1% 2023-08-31
124 Kuwait 4.67e+06 0.1% 2020-12-31
125 Liberia 4.66e+06 0.1% 2021-07-01
126 Mauritania 4.48e+06 0.1% 2023-07-01
127 Panama 4.34e+06 0.1% 2021-07-01
128 Croatia 3.86e+06 0.1% 2022-07-01
129 Eritrea 3.75e+06 0.1% 2023-07-01
130 Georgia 3.74e+06 0.1% 2023-01-01
131 Uruguay 3.57e+06 0.0% 2023-06-30
132 Mongolia 3.46e+06 0.0% 2022-12-31
133 Bosnia and Herzegovina 3.28e+06 0.0% 2022-07-01
133 Puerto Rico(US) 3.22e+06 0.0% 2022-07-01
134 Armenia 2.98e+06 0.0% 2023-01-01
135 Lithuania 2.87e+06 0.0% 2023-09-01
136 Jamaica 2.83e+06 0.0% 2019-07-01
137 Albania 2.76e+06 0.0% 2023-01-01
138 Qatar 2.66e+06 0.0% 2023-06-30
139 Namibia 2.64e+06 0.0% 2023-07-01
140 Moldova 2.51e+06 0.0% 2023-01-01
141 Gambia 2.42e+06 0.0% 2022-07-01
142 Botswana 2.41e+06 0.0% 2021-07-01
143 Lesotho 2.31e+06 0.0% 2023-07-01
144 Gabon 2.23e+06 0.0% 2021-07-01
145 Slovenia 2.12e+06 0.0% 2023-04-01
146 Latvia 1.88e+06 0.0% 2023-08-01
147 North Macedonia 1.83e+06 0.0% 2021-11-01
148 Guinea-Bissau 1.78e+06 0.0% 2023-07-01
148 Kosovo 1.77e+06 0.0% 2021-12-31
149 Bahrain 1.58e+06 0.0% 2023-07-01
150 Equatorial Guinea 1.56e+06 0.0% 2022-07-01
151 Estonia 1.37e+06 0.0% 2023-01-01
152 Trinidad and Tobago 1.37e+06 0.0% 2022-06-30
153 East Timor 1.35e+06 0.0% 2023-07-01
154 Mauritius 1.26e+06 0.0% 2023-06-30
155 Eswatini 1.22e+06 0.0% 2023-07-01
156 Djibouti 1.00e+06 0.0% 2022-07-01
157 Cyprus 9.18e+05 0.0% 2021-10-01
158 Fiji 8.93e+05 0.0% 2021-07-01
159 Bhutan 7.70e+05 0.0% 2023-10-04
160 Comoros 7.58e+05 0.0% 2017-12-15
161 Guyana 7.44e+05 0.0% 2019-07-01
162 Solomon Islands 7.35e+05 0.0% 2023-07-01
162 Macau(China) 6.79e+05 0.0% 2023-06-30
163 Luxembourg 6.61e+05 0.0% 2023-01-01
164 Montenegro 6.17e+05 0.0% 2023-01-01
165 Suriname 6.16e+05 0.0% 2021-07-01
165 Western Sahara 5.87e+05 0.0% 2023-07-01
166 Malta 5.20e+05 0.0% 2021-11-21
167 Cape Verde 4.91e+05 0.0% 2021-06-16
168 Brunei 4.45e+05 0.0% 2022-07-01
169 Belize 4.41e+05 0.0% 2022-07-01
170 Bahamas 3.97e+05 0.0% 2022-07-01
171 Iceland 3.94e+05 0.0% 2023-07-01
171 Northern Cyprus 3.83e+05 0.0% 2020-12-31
172 Maldives 3.83e+05 0.0% 2022-09-13
172 Transnistria 3.61e+05 0.0% 2022-12-31
173 Vanuatu 3.01e+05 0.0% 2021-07-01
173 French Polynesia(France) 2.80e+05 0.0% 2021-07-01
173 New Caledonia(France) 2.69e+05 0.0% 2023-01-01
174 Barbados 2.68e+05 0.0% 2022-12-31
174 Abkhazia 2.45e+05 0.0% 2020-01-01
175 São Tomé and Príncipe 2.15e+05 0.0% 2021-07-01
176 Samoa 2.06e+05 0.0% 2021-11-06
177 Saint Lucia 1.79e+05 0.0% 2018-07-01
177 Guam(US) 1.54e+05 0.0% 2020-04-01
177 Curacao(Netherlands) 1.49e+05 0.0% 2023-01-01
177 Artsakh 1.49e+05 0.0% 2019-10-01
178 Kiribati 1.21e+05 0.0% 2021-07-01
179 Grenada 1.13e+05 0.0% 2019-07-01
180 Saint Vincent and the Grenadines 1.11e+05 0.0% 2022-07-01
180 Aruba(Netherlands) 1.07e+05 0.0% 2022-09-30
181 Micronesia 1.06e+05 0.0% 2021-07-01
181 Jersey(UK) 1.03e+05 0.0% 2021-03-21
182 Antigua and Barbuda 1.01e+05 0.0% 2022-01-01
183 Seychelles 1.00e+05 0.0% 2022-04-22
184 Tonga 1.00e+05 0.0% 2022-01-01
184 US Virgin Islands(US) 8.71e+04 0.0% 2020-04-01
184 Isle of Man(UK) 8.41e+04 0.0% 2021-05-30
185 Andorra 8.35e+04 0.0% 2023-06-30
185 Cayman Islands(UK) 7.11e+04 0.0% 2020-09-30
186 Dominica 6.74e+04 0.0% 2017-12-31
186 Guernsey(UK) 6.42e+04 0.0% 2022-09-30
186 Bermuda(UK) 6.41e+04 0.0% 2021-07-01
186 Greenland(Denmark) 5.69e+04 0.0% 2023-07-01
186 South Ossetia 5.65e+04 0.0% 2021-12-31
186 Faroe Islands(Denmark) 5.47e+04 0.0% 2023-08-01
186 American Samoa(US) 4.97e+04 0.0% 2020-04-01
186 Northern Mariana Islands(US) 4.73e+04 0.0% 2020-04-01
187 Saint Kitts and Nevis 4.72e+04 0.0% 2011-05-15
187 Turks and Caicos Islands(UK) 4.61e+04 0.0% 2021-07-01
187 Sint Maarten(Netherlands) 4.29e+04 0.0% 2023-01-01
188 Marshall Islands 4.24e+04 0.0% 2021-09-30
189 Liechtenstein 3.97e+04 0.0% 2022-12-31
190 Monaco 3.90e+04 0.0% 2022-12-31
190 Gibraltar(UK) 3.40e+04 0.0% 2016-07-01
191 San Marino 3.39e+04 0.0% 2023-07-31
191 Saint Martin(France) 3.24e+04 0.0% 2020-01-01
191 British Virgin Islands(UK) 3.15e+04 0.0% 2023-07-01
191 Åland(Finland) 3.06e+04 0.0% 2023-08-31
192 Palau 1.67e+04 0.0% 2021-07-01
192 Anguilla(UK) 1.57e+04 0.0% 2021-12-31
192 Cook Islands 1.50e+04 0.0% 2021-07-01
193 Nauru 1.18e+04 0.0% 2021-07-01
193 Wallis and Futuna(France) 1.14e+04 0.0% 2021-01-01
194 Tuvalu 1.07e+04 0.0% 2021-07-01
194 Saint Barthélemy(France) 1.06e+04 0.0% 2020-01-01
194 Saint Pierre and Miquelon(France) 6.09e+03 0.0% 2020-01-01
194 Saint Helena, Ascension and Tristan da Cunha(UK) 5.65e+03 0.0% 2021-07-01
194 Montserrat(UK) 4.43e+03 0.0% 2022-07-01
194 Falkland Islands(UK) 3.66e+03 0.0% 2021-10-10
194 Norfolk Island(Australia) 2.19e+03 0.0% 2021-01-01
194 Christmas Island(Australia) 1.69e+03 0.0% 2021-01-01
194 Tokelau(NZ) 1.65e+03 0.0% 2019-01-01
194 Niue 1.55e+03 0.0% 2021-07-01
195 Vatican City 7.64e+02 0.0% 2023-06-26
195 Cocos (Keeling) Islands(Australia) 5.93e+02 0.0% 2020-06-30
195 Pitcairn Islands(UK) 4.70e+01 0.0% 2021-07-01

Attention: Beautiful Soup does not execute Javascript. This means that you the code in your browser inspector might look a bit different from the original HTML source code.

Another example of downloading a Wikipedia table#

Let’s consider another table in a Wikipedia page. This page has a lot more tables, so one challenge will be to pick the right table

https://en.wikipedia.org/wiki/Serena_Williams

We are interested in extracting these two tables:

Target Wikipedia tables

Exercise:

Find the tables on a page by locating heading and using .find_next()

We begin by downloading the webpage and instatiating the BeautifulSoup object:

r = requests.get("https://en.wikipedia.org/wiki/Serena_Williams")
document = BeautifulSoup(r.text, "html.parser")

This page contains a lot of tables without specific attributes that would make it easy to find our table of interest. Further, the same headings of the tables are used for multiple tables, making it difficult to find a table just by its headings:

len(document.find_all("table"))
75

Therefore, we choose another strategy.

First, we find the tag with class mw-headling whose string content starts with Singles. Then we find the next table using heading_element.find_next(...):

document.find_all(class_="mw-headline", string=re.compile("^Singles"))
[<span class="mw-headline" id="Singles:_33_(23–10)">Singles: 33 (23–10)</span>]
# string class
singles_heading = document.find(class_="mw-headline", string=re.compile("^Singles"))
singles_heading
<span class="mw-headline" id="Singles:_33_(23–10)">Singles: 33 (23–10)</span>
singles_heading.find_next("table")
<table class="sortable wikitable">
<tbody><tr>
<th>Result
</th>
<th>Year
</th>
<th>Tournament
</th>
<th>Surface
</th>
<th>Opponents
</th>
<th class="unsortable">Score
</th></tr>
<tr style="background:#ccf;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/1999_US_Open_%E2%80%93_Women%27s_singles" title="1999 US Open – Women's singles">1999</a></td>
<td><a href="/wiki/US_Open_(tennis)" title="US Open (tennis)">US Open</a></td>
<td><a class="mw-redirect" href="/wiki/Hard_court" title="Hard court">Hard</a></td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Switzerland" title="Switzerland"><img alt="Switzerland" class="mw-file-element" data-file-height="512" data-file-width="512" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/16px-Flag_of_Switzerland_%28Pantone%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/24px-Flag_of_Switzerland_%28Pantone%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/08/Flag_of_Switzerland_%28Pantone%29.svg/32px-Flag_of_Switzerland_%28Pantone%29.svg.png 2x" width="16"/></a></span></span> <a href="/wiki/Martina_Hingis" title="Martina Hingis">Martina Hingis</a></td>
<td>6–3, 7–6<sup>(7–4)</sup>
</td></tr>
<tr style="background:#ccf;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2001_US_Open_%E2%80%93_Women%27s_singles" title="2001 US Open – Women's singles">2001</a></td>
<td>US Open</td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Venus_Williams" title="Venus Williams">Venus Williams</a></td>
<td>2–6, 4–6
</td></tr>
<tr bgcolor="#ebc2af" style="border: 2px solid blue">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2002_French_Open_%E2%80%93_Women%27s_singles" title="2002 French Open – Women's singles">2002</a></td>
<td>French Open</td>
<td><a href="/wiki/Clay_court" title="Clay court">Clay</a></td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> Venus Williams</td>
<td>7–5, 6–3
</td></tr>
<tr bgcolor="#cfc" style="border: 2px solid blue">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2002_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2002 Wimbledon Championships – Women's singles">2002</a></td>
<td><a class="mw-redirect" href="/wiki/The_Championships,_Wimbledon" title="The Championships, Wimbledon">Wimbledon</a></td>
<td><a href="/wiki/Grass_court" title="Grass court">Grass</a></td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> Venus Williams</td>
<td>7–6<sup>(7–4)</sup>, 6–3
</td></tr>
<tr bgcolor="#ccf" style="border: 2px solid blue">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2002_US_Open_%E2%80%93_Women%27s_singles" title="2002 US Open – Women's singles">2002</a></td>
<td>US Open <small>(2)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> Venus Williams</td>
<td>6–4, 6–3
</td></tr>
<tr bgcolor="#ffc" style="border: 2px solid blue">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2003_Australian_Open_%E2%80%93_Women%27s_singles" title="2003 Australian Open – Women's singles">2003</a></td>
<td>Australian Open</td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> Venus Williams</td>
<td>7–6<sup>(7–4)</sup>, 3–6, 6–4
</td></tr>
<tr style="background:#cfc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2003_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2003 Wimbledon Championships – Women's singles">2003</a></td>
<td>Wimbledon <small>(2)</small></td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> Venus Williams</td>
<td>4–6, 6–4, 6–2
</td></tr>
<tr style="background:#cfc;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2004_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2004 Wimbledon Championships – Women's singles">2004</a></td>
<td>Wimbledon</td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Russia" title="Russia"><img alt="Russia" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Maria_Sharapova" title="Maria Sharapova">Maria Sharapova</a></td>
<td>1–6, 4–6
</td></tr>
<tr style="background:#ffc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2005_Australian_Open_%E2%80%93_Women%27s_singles" title="2005 Australian Open – Women's singles">2005</a></td>
<td>Australian Open <small>(2)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Lindsay_Davenport" title="Lindsay Davenport">Lindsay Davenport</a></td>
<td>2–6, 6–3, 6–0
</td></tr>
<tr style="background:#ffc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2007_Australian_Open_%E2%80%93_Women%27s_singles" title="2007 Australian Open – Women's singles">2007</a></td>
<td>Australian Open <small>(3)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Russia" title="Russia"><img alt="Russia" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></a></span></span> Maria Sharapova</td>
<td>6–1, 6–2
</td></tr>
<tr style="background:#cfc;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2008_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2008 Wimbledon Championships – Women's singles">2008</a></td>
<td>Wimbledon</td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> Venus Williams</td>
<td>5–7, 4–6
</td></tr>
<tr style="background:#ccf;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2008_US_Open_%E2%80%93_Women%27s_singles" title="2008 US Open – Women's singles">2008</a></td>
<td>US Open <small>(3)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Serbia" title="Serbia"><img alt="Serbia" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/00/Flag_of_Serbia_%282004%E2%80%932010%29.svg/23px-Flag_of_Serbia_%282004%E2%80%932010%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/00/Flag_of_Serbia_%282004%E2%80%932010%29.svg/35px-Flag_of_Serbia_%282004%E2%80%932010%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/00/Flag_of_Serbia_%282004%E2%80%932010%29.svg/45px-Flag_of_Serbia_%282004%E2%80%932010%29.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Jelena_Jankovi%C4%87" title="Jelena Janković">Jelena Janković</a></td>
<td>6–4, 7–5
</td></tr>
<tr style="background:#ffc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2009_Australian_Open_%E2%80%93_Women%27s_singles" title="2009 Australian Open – Women's singles">2009</a></td>
<td>Australian Open <small>(4)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Russia" title="Russia"><img alt="Russia" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Dinara_Safina" title="Dinara Safina">Dinara Safina</a></td>
<td>6–0, 6–3
</td></tr>
<tr style="background:#cfc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2009_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2009 Wimbledon Championships – Women's singles">2009</a></td>
<td>Wimbledon <small>(3)</small></td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> Venus Williams</td>
<td>7–6<sup>(7–3)</sup>, 6–2
</td></tr>
<tr style="background:#ffc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2010_Australian_Open_%E2%80%93_Women%27s_singles" title="2010 Australian Open – Women's singles">2010</a></td>
<td>Australian Open <small>(5)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Belgium" title="Belgium"><img alt="Belgium" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/92/Flag_of_Belgium_%28civil%29.svg/23px-Flag_of_Belgium_%28civil%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/92/Flag_of_Belgium_%28civil%29.svg/35px-Flag_of_Belgium_%28civil%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/92/Flag_of_Belgium_%28civil%29.svg/45px-Flag_of_Belgium_%28civil%29.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Justine_Henin" title="Justine Henin">Justine Henin</a></td>
<td>6–4, 3–6, 6–2
</td></tr>
<tr style="background:#cfc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2010_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2010 Wimbledon Championships – Women's singles">2010</a></td>
<td>Wimbledon <small>(4)</small></td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Russia" title="Russia"><img alt="Russia" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Vera_Zvonareva" title="Vera Zvonareva">Vera Zvonareva</a></td>
<td>6–3, 6–2
</td></tr>
<tr style="background:#ccf;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2011_US_Open_%E2%80%93_Women%27s_singles" title="2011 US Open – Women's singles">2011</a></td>
<td>US Open</td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Australia" title="Australia"><img alt="Australia" class="mw-file-element" data-file-height="640" data-file-width="1280" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/88/Flag_of_Australia_%28converted%29.svg/23px-Flag_of_Australia_%28converted%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/88/Flag_of_Australia_%28converted%29.svg/35px-Flag_of_Australia_%28converted%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/88/Flag_of_Australia_%28converted%29.svg/46px-Flag_of_Australia_%28converted%29.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Samantha_Stosur" title="Samantha Stosur">Samantha Stosur</a></td>
<td>2–6, 3–6
</td></tr>
<tr style="background:#cfc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2012_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2012 Wimbledon Championships – Women's singles">2012</a></td>
<td>Wimbledon <small>(5)</small></td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Poland" title="Poland"><img alt="Poland" class="mw-file-element" data-file-height="800" data-file-width="1280" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/en/thumb/1/12/Flag_of_Poland.svg/23px-Flag_of_Poland.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/12/Flag_of_Poland.svg/35px-Flag_of_Poland.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/12/Flag_of_Poland.svg/46px-Flag_of_Poland.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Agnieszka_Radwa%C5%84ska" title="Agnieszka Radwańska">Agnieszka Radwańska</a></td>
<td>6–1, 5–7, 6–2
</td></tr>
<tr style="background:#ccf;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2012_US_Open_%E2%80%93_Women%27s_singles" title="2012 US Open – Women's singles">2012</a></td>
<td>US Open <small>(4)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Belarus" title="Belarus"><img alt="Belarus" class="mw-file-element" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/85/Flag_of_Belarus.svg/23px-Flag_of_Belarus.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/85/Flag_of_Belarus.svg/35px-Flag_of_Belarus.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/85/Flag_of_Belarus.svg/46px-Flag_of_Belarus.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Victoria_Azarenka" title="Victoria Azarenka">Victoria Azarenka</a></td>
<td>6–2, 2–6, 7–5
</td></tr>
<tr style="background:#ebc2af;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2013_French_Open_%E2%80%93_Women%27s_singles" title="2013 French Open – Women's singles">2013</a></td>
<td>French Open <small>(2)</small></td>
<td>Clay</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Russia" title="Russia"><img alt="Russia" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></a></span></span> Maria Sharapova</td>
<td>6–4, 6–4
</td></tr>
<tr style="background:#ccf;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2013_US_Open_%E2%80%93_Women%27s_singles" title="2013 US Open – Women's singles">2013</a></td>
<td>US Open <small>(5)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Belarus" title="Belarus"><img alt="Belarus" class="mw-file-element" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/85/Flag_of_Belarus.svg/23px-Flag_of_Belarus.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/85/Flag_of_Belarus.svg/35px-Flag_of_Belarus.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/85/Flag_of_Belarus.svg/46px-Flag_of_Belarus.svg.png 2x" width="23"/></a></span></span> Victoria Azarenka</td>
<td>7–5, 6–7<sup>(6–8)</sup>, 6–1
</td></tr>
<tr bgcolor="#ccf" style="border: 2px solid blue">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2014_US_Open_%E2%80%93_Women%27s_singles" title="2014 US Open – Women's singles">2014</a></td>
<td>US Open <small>(6)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Denmark" title="Denmark"><img alt="Denmark" class="mw-file-element" data-file-height="387" data-file-width="512" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Flag_of_Denmark.svg/20px-Flag_of_Denmark.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Flag_of_Denmark.svg/31px-Flag_of_Denmark.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Flag_of_Denmark.svg/40px-Flag_of_Denmark.svg.png 2x" width="20"/></a></span></span> <a href="/wiki/Caroline_Wozniacki" title="Caroline Wozniacki">Caroline Wozniacki</a></td>
<td>6–3, 6–3
</td></tr>
<tr bgcolor="#ffc" style="border: 2px solid blue">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2015_Australian_Open_%E2%80%93_Women%27s_singles" title="2015 Australian Open – Women's singles">2015</a></td>
<td>Australian Open <small>(6)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Russia" title="Russia"><img alt="Russia" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></a></span></span> Maria Sharapova</td>
<td>6–3, 7–6<sup>(7–5)</sup>
</td></tr>
<tr bgcolor="#ebc2af" style="border: 2px solid blue">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2015_French_Open_%E2%80%93_Women%27s_singles" title="2015 French Open – Women's singles">2015</a></td>
<td>French Open <small>(3)</small></td>
<td>Clay</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Czech_Republic" title="Czech Republic"><img alt="Czech Republic" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_Czech_Republic.svg/23px-Flag_of_the_Czech_Republic.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_Czech_Republic.svg/35px-Flag_of_the_Czech_Republic.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_Czech_Republic.svg/45px-Flag_of_the_Czech_Republic.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Lucie_%C5%A0af%C3%A1%C5%99ov%C3%A1" title="Lucie Šafářová">Lucie Šafářová</a></td>
<td>6–3, 6–7<sup>(2–7)</sup>, 6–2
</td></tr>
<tr bgcolor="#cfc" style="border: 2px solid blue">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2015_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2015 Wimbledon Championships – Women's singles">2015</a></td>
<td>Wimbledon <small>(6)</small></td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Spain" title="Spain"><img alt="Spain" class="mw-file-element" data-file-height="500" data-file-width="750" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/23px-Flag_of_Spain.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/35px-Flag_of_Spain.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/45px-Flag_of_Spain.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Garbi%C3%B1e_Muguruza" title="Garbiñe Muguruza">Garbiñe Muguruza</a></td>
<td>6–4, 6–4
</td></tr>
<tr style="background:#ffc;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2016_Australian_Open_%E2%80%93_Women%27s_singles" title="2016 Australian Open – Women's singles">2016</a></td>
<td>Australian Open</td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Germany" title="Germany"><img alt="Germany" class="mw-file-element" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Angelique_Kerber" title="Angelique Kerber">Angelique Kerber</a></td>
<td>4–6, 6–3, 4–6
</td></tr>
<tr style="background:#ebc2af;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2016_French_Open_%E2%80%93_Women%27s_singles" title="2016 French Open – Women's singles">2016</a></td>
<td>French Open</td>
<td>Clay</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Spain" title="Spain"><img alt="Spain" class="mw-file-element" data-file-height="500" data-file-width="750" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/23px-Flag_of_Spain.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/35px-Flag_of_Spain.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/45px-Flag_of_Spain.svg.png 2x" width="23"/></a></span></span> Garbiñe Muguruza</td>
<td>5–7, 4–6
</td></tr>
<tr style="background:#cfc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2016_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2016 Wimbledon Championships – Women's singles">2016</a></td>
<td>Wimbledon <small>(7)</small></td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Germany" title="Germany"><img alt="Germany" class="mw-file-element" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" width="23"/></a></span></span> Angelique Kerber</td>
<td>7–5, 6–3
</td></tr>
<tr style="background:#ffc;">
<td style="background:#98fb98;">Win</td>
<td><a href="/wiki/2017_Australian_Open" title="2017 Australian Open">2017</a></td>
<td>Australian Open <small>(7)</small></td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> Venus Williams</td>
<td>6–4, 6–4
</td></tr>
<tr style="background:#cfc;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2018_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2018 Wimbledon Championships – Women's singles">2018</a></td>
<td>Wimbledon</td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Germany" title="Germany"><img alt="Germany" class="mw-file-element" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" width="23"/></a></span></span> Angelique Kerber</td>
<td>3–6, 3–6
</td></tr>
<tr style="background:#ccf;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2018_US_Open_%E2%80%93_Women%27s_singles" title="2018 US Open – Women's singles">2018</a></td>
<td>US Open</td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Japan" title="Japan"><img alt="Japan" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/9/9e/Flag_of_Japan.svg/23px-Flag_of_Japan.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/9e/Flag_of_Japan.svg/35px-Flag_of_Japan.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/9e/Flag_of_Japan.svg/45px-Flag_of_Japan.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Naomi_Osaka" title="Naomi Osaka">Naomi Osaka</a></td>
<td><a href="/wiki/2018_US_Open_%E2%80%93_Women%27s_singles_final" title="2018 US Open – Women's singles final">2–6, 4–6</a>
</td></tr>
<tr style="background:#cfc;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2019_Wimbledon_Championships_%E2%80%93_Women%27s_singles" title="2019 Wimbledon Championships – Women's singles">2019</a></td>
<td>Wimbledon</td>
<td>Grass</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Romania" title="Romania"><img alt="Romania" class="mw-file-element" data-file-height="400" data-file-width="600" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/73/Flag_of_Romania.svg/23px-Flag_of_Romania.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/73/Flag_of_Romania.svg/35px-Flag_of_Romania.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/73/Flag_of_Romania.svg/45px-Flag_of_Romania.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Simona_Halep" title="Simona Halep">Simona Halep</a></td>
<td>2–6, 2–6
</td></tr>
<tr style="background:#ccf;">
<td style="background:#ffa07a;">Loss</td>
<td><a href="/wiki/2019_US_Open_%E2%80%93_Women%27s_singles" title="2019 US Open – Women's singles">2019</a></td>
<td>US Open</td>
<td>Hard</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/wiki/Canada" title="Canada"><img alt="Canada" class="mw-file-element" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/23px-Flag_of_Canada_%28Pantone%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/35px-Flag_of_Canada_%28Pantone%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/46px-Flag_of_Canada_%28Pantone%29.svg.png 2x" width="23"/></a></span></span> <a href="/wiki/Bianca_Andreescu" title="Bianca Andreescu">Bianca Andreescu</a></td>
<td>3–6, 5–7
</td></tr></tbody></table>

Now, our tables of interest are the first two result tables for “Singles” and “Women’s doubles”. We write a small helper function that returns a table with a given heading:

def find_table_with_heading(document, heading_pat):
    heading_element = document.find(class_="mw-headline", string=heading_pat)
    table = heading_element.find_next("table")
    return table
singles_table = find_table_with_heading(document, re.compile("^Singles"))
# print headers
headings = singles_table.find_all("th")
[th.get_text(strip=True) for th in headings]
['Result', 'Year', 'Tournament', 'Surface', 'Opponents', 'Score']

Next, we can find the table after the heading “Women’s doubles”

doubles_table = find_table_with_heading(document, re.compile(r"^Women's doubles"))
# print headers
headings = doubles_table.find_all("th")
[th.get_text(strip=True) for th in headings]
['Result', 'Year', 'Tournament', 'Surface', 'Partner', 'Opponents', 'Score']

Exercise:#

  • Iterate through the rows

  • convert year to integer (or date)

  • strip note ‘(12)’ from event, so the same event has the same string

  • load into pandas DataFrame (more on pandas in a later lecture)

re.sub?
data = []
for row in singles_table.find_all("tr"):
    cells = row.find_all("td")
    if not cells:
        continue
    values = [cell.get_text(strip=True) for cell in cells]
    values[1] = int(values[1])
    values[2] = re.sub(r"\s*\(.+\)", "", values[2])
    print(values)
    data.append(values)
['Win', 1999, 'US Open', 'Hard', 'Martina Hingis', '6–3, 7–6(7–4)']
['Loss', 2001, 'US Open', 'Hard', 'Venus Williams', '2–6, 4–6']
['Win', 2002, 'French Open', 'Clay', 'Venus Williams', '7–5, 6–3']
['Win', 2002, 'Wimbledon', 'Grass', 'Venus Williams', '7–6(7–4), 6–3']
['Win', 2002, 'US Open', 'Hard', 'Venus Williams', '6–4, 6–3']
['Win', 2003, 'Australian Open', 'Hard', 'Venus Williams', '7–6(7–4), 3–6, 6–4']
['Win', 2003, 'Wimbledon', 'Grass', 'Venus Williams', '4–6, 6–4, 6–2']
['Loss', 2004, 'Wimbledon', 'Grass', 'Maria Sharapova', '1–6, 4–6']
['Win', 2005, 'Australian Open', 'Hard', 'Lindsay Davenport', '2–6, 6–3, 6–0']
['Win', 2007, 'Australian Open', 'Hard', 'Maria Sharapova', '6–1, 6–2']
['Loss', 2008, 'Wimbledon', 'Grass', 'Venus Williams', '5–7, 4–6']
['Win', 2008, 'US Open', 'Hard', 'Jelena Janković', '6–4, 7–5']
['Win', 2009, 'Australian Open', 'Hard', 'Dinara Safina', '6–0, 6–3']
['Win', 2009, 'Wimbledon', 'Grass', 'Venus Williams', '7–6(7–3), 6–2']
['Win', 2010, 'Australian Open', 'Hard', 'Justine Henin', '6–4, 3–6, 6–2']
['Win', 2010, 'Wimbledon', 'Grass', 'Vera Zvonareva', '6–3, 6–2']
['Loss', 2011, 'US Open', 'Hard', 'Samantha Stosur', '2–6, 3–6']
['Win', 2012, 'Wimbledon', 'Grass', 'Agnieszka Radwańska', '6–1, 5–7, 6–2']
['Win', 2012, 'US Open', 'Hard', 'Victoria Azarenka', '6–2, 2–6, 7–5']
['Win', 2013, 'French Open', 'Clay', 'Maria Sharapova', '6–4, 6–4']
['Win', 2013, 'US Open', 'Hard', 'Victoria Azarenka', '7–5, 6–7(6–8), 6–1']
['Win', 2014, 'US Open', 'Hard', 'Caroline Wozniacki', '6–3, 6–3']
['Win', 2015, 'Australian Open', 'Hard', 'Maria Sharapova', '6–3, 7–6(7–5)']
['Win', 2015, 'French Open', 'Clay', 'Lucie Šafářová', '6–3, 6–7(2–7), 6–2']
['Win', 2015, 'Wimbledon', 'Grass', 'Garbiñe Muguruza', '6–4, 6–4']
['Loss', 2016, 'Australian Open', 'Hard', 'Angelique Kerber', '4–6, 6–3, 4–6']
['Loss', 2016, 'French Open', 'Clay', 'Garbiñe Muguruza', '5–7, 4–6']
['Win', 2016, 'Wimbledon', 'Grass', 'Angelique Kerber', '7–5, 6–3']
['Win', 2017, 'Australian Open', 'Hard', 'Venus Williams', '6–4, 6–4']
['Loss', 2018, 'Wimbledon', 'Grass', 'Angelique Kerber', '3–6, 3–6']
['Loss', 2018, 'US Open', 'Hard', 'Naomi Osaka', '2–6, 4–6']
['Loss', 2019, 'Wimbledon', 'Grass', 'Simona Halep', '2–6, 2–6']
['Loss', 2019, 'US Open', 'Hard', 'Bianca Andreescu', '3–6, 5–7']

When data is in this form, we can convert it into a DataFrame with pandas.

You’ll learn more about pandas next week.

import pandas as pd

headings = [th.get_text(strip=True) for th in singles_table.find_all("th")]
df = pd.DataFrame(data, columns=headings)
df
Result Year Tournament Surface Opponents Score
0 Win 1999 US Open Hard Martina Hingis 6–3, 7–6(7–4)
1 Loss 2001 US Open Hard Venus Williams 2–6, 4–6
2 Win 2002 French Open Clay Venus Williams 7–5, 6–3
3 Win 2002 Wimbledon Grass Venus Williams 7–6(7–4), 6–3
4 Win 2002 US Open Hard Venus Williams 6–4, 6–3
5 Win 2003 Australian Open Hard Venus Williams 7–6(7–4), 3–6, 6–4
6 Win 2003 Wimbledon Grass Venus Williams 4–6, 6–4, 6–2
7 Loss 2004 Wimbledon Grass Maria Sharapova 1–6, 4–6
8 Win 2005 Australian Open Hard Lindsay Davenport 2–6, 6–3, 6–0
9 Win 2007 Australian Open Hard Maria Sharapova 6–1, 6–2
10 Loss 2008 Wimbledon Grass Venus Williams 5–7, 4–6
11 Win 2008 US Open Hard Jelena Janković 6–4, 7–5
12 Win 2009 Australian Open Hard Dinara Safina 6–0, 6–3
13 Win 2009 Wimbledon Grass Venus Williams 7–6(7–3), 6–2
14 Win 2010 Australian Open Hard Justine Henin 6–4, 3–6, 6–2
15 Win 2010 Wimbledon Grass Vera Zvonareva 6–3, 6–2
16 Loss 2011 US Open Hard Samantha Stosur 2–6, 3–6
17 Win 2012 Wimbledon Grass Agnieszka Radwańska 6–1, 5–7, 6–2
18 Win 2012 US Open Hard Victoria Azarenka 6–2, 2–6, 7–5
19 Win 2013 French Open Clay Maria Sharapova 6–4, 6–4
20 Win 2013 US Open Hard Victoria Azarenka 7–5, 6–7(6–8), 6–1
21 Win 2014 US Open Hard Caroline Wozniacki 6–3, 6–3
22 Win 2015 Australian Open Hard Maria Sharapova 6–3, 7–6(7–5)
23 Win 2015 French Open Clay Lucie Šafářová 6–3, 6–7(2–7), 6–2
24 Win 2015 Wimbledon Grass Garbiñe Muguruza 6–4, 6–4
25 Loss 2016 Australian Open Hard Angelique Kerber 4–6, 6–3, 4–6
26 Loss 2016 French Open Clay Garbiñe Muguruza 5–7, 4–6
27 Win 2016 Wimbledon Grass Angelique Kerber 7–5, 6–3
28 Win 2017 Australian Open Hard Venus Williams 6–4, 6–4
29 Loss 2018 Wimbledon Grass Angelique Kerber 3–6, 3–6
30 Loss 2018 US Open Hard Naomi Osaka 2–6, 4–6
31 Loss 2019 Wimbledon Grass Simona Halep 2–6, 2–6
32 Loss 2019 US Open Hard Bianca Andreescu 3–6, 5–7

With pandas, we can filter this data, group it, and plot interesting relationships.

Pandas groupby is an interesting operation for performing aggregations, e.g. counting the wins/losses by year and result:

df.Result.value_counts()
Result
Win     23
Loss    10
Name: count, dtype: int64
results_by_year = df.groupby(["Year", "Result"]).Tournament.count().unstack().fillna(0)
results_by_year
Result Loss Win
Year
1999 0.0 1.0
2001 1.0 0.0
2002 0.0 3.0
2003 0.0 2.0
2004 1.0 0.0
2005 0.0 1.0
2007 0.0 1.0
2008 1.0 1.0
2009 0.0 2.0
2010 0.0 2.0
2011 1.0 0.0
2012 0.0 2.0
2013 0.0 2.0
2014 0.0 1.0
2015 0.0 3.0
2016 2.0 1.0
2017 0.0 1.0
2018 2.0 0.0
2019 2.0 0.0

Which we can now plot

results_by_year.plot(kind="bar", grid=False)
<Axes: xlabel='Year'>
../../_images/c8295e8c76e43eb62bed62f843dacbe8543d36bd371e095c3c1d74c22f208f2e.png

Is there any significance to the court?

results_by_surface = df.groupby(["Surface", "Result"]).Tournament.count().unstack()
results_by_surface
Result Loss Win
Tournament
Australian Open 1 7
French Open 1 3
US Open 4 6
Wimbledon 4 7
results_by_surface.plot(kind="bar")
<Axes: xlabel='Tournament'>
../../_images/a83a291289a67a0e27cfc1f8940b2b4a652ba4a72b12c943658b33700a1bb88f.png

We can even filter to e.g. select opponents who Williams faced at least twice

results_by_op = df.groupby(["Opponents", "Result"]).Tournament.count().unstack()
results_by_op
Result Loss Win
Opponents
Agnieszka Radwańska NaN 1.0
Angelique Kerber 2.0 1.0
Bianca Andreescu 1.0 NaN
Caroline Wozniacki NaN 1.0
Dinara Safina NaN 1.0
Garbiñe Muguruza 1.0 1.0
Jelena Janković NaN 1.0
Justine Henin NaN 1.0
Lindsay Davenport NaN 1.0
Lucie Šafářová NaN 1.0
Maria Sharapova 1.0 3.0
Martina Hingis NaN 1.0
Naomi Osaka 1.0 NaN
Samantha Stosur 1.0 NaN
Simona Halep 1.0 NaN
Venus Williams 2.0 7.0
Vera Zvonareva NaN 1.0
Victoria Azarenka NaN 2.0
# we can exclude opponents only met once:
results_by_op = results_by_op.fillna(0)
results_by_op
Result Loss Win
Opponents
Agnieszka Radwańska 0.0 1.0
Angelique Kerber 2.0 1.0
Bianca Andreescu 1.0 0.0
Caroline Wozniacki 0.0 1.0
Dinara Safina 0.0 1.0
Garbiñe Muguruza 1.0 1.0
Jelena Janković 0.0 1.0
Justine Henin 0.0 1.0
Lindsay Davenport 0.0 1.0
Lucie Šafářová 0.0 1.0
Maria Sharapova 1.0 3.0
Martina Hingis 0.0 1.0
Naomi Osaka 1.0 0.0
Samantha Stosur 1.0 0.0
Simona Halep 1.0 0.0
Venus Williams 2.0 7.0
Vera Zvonareva 0.0 1.0
Victoria Azarenka 0.0 2.0
(results_by_op.Win + results_by_op.Loss) > 1
Opponents
Agnieszka Radwańska    False
Angelique Kerber        True
Bianca Andreescu       False
Caroline Wozniacki     False
Dinara Safina          False
Garbiñe Muguruza        True
Jelena Janković        False
Justine Henin          False
Lindsay Davenport      False
Lucie Šafářová         False
Maria Sharapova         True
Martina Hingis         False
Naomi Osaka            False
Samantha Stosur        False
Simona Halep           False
Venus Williams          True
Vera Zvonareva         False
Victoria Azarenka       True
dtype: bool
multiple_meetings = results_by_op[(results_by_op.Win + results_by_op.Loss) > 1]
multiple_meetings.plot(kind="bar")
<Axes: xlabel='Opponents'>
../../_images/35ae8e9b871127b97b298800af23ed8eaef678008fa9e8aa2a76f6ec3a905a80.png

Exercise:#

Find images on the UiO page

  1. Go to https://en.wikipedia.org/wiki/University_of_Oslo

  2. Download the content from the site using BeautifulSoup and requests

  3. Search for all images (using images = document.find_all('img')) and print out the content

  4. Include only images with the attribute class_="mw-file-element" in your list of images.

  5. Print out a list of the value of the “src” attribute for the images in 4.

  6. See if you can display an image by pasting a result from 5 into your web-browser.

r = requests.get("https://no.wikipedia.org/wiki/Universitetet_i_Oslo")
html = r.text
print(html[:400])
<!DOCTYPE html>
<html class="client-nojs" lang="nb" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Universitetet i Oslo – Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":[",\t."," \t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","januar","februar","mars","april","mai","juni","jul
document = BeautifulSoup(html, "html.parser")
images = document.find_all("img", class_="mw-file-element")
len(images)
17
for image in images:
    print(image["src"])
<img alt="Rediger på Wikidata" class="mw-file-element" data-file-height="20" data-file-width="20" decoding="async" height="10" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/10px-OOjs_UI_icon_edit-ltr-progressive.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/15px-OOjs_UI_icon_edit-ltr-progressive.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/20px-OOjs_UI_icon_edit-ltr-progressive.svg.png 2x" width="10"/>
<img alt="Rediger på Wikidata" class="mw-file-element" data-file-height="20" data-file-width="20" decoding="async" height="10" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/10px-OOjs_UI_icon_edit-ltr-progressive.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/15px-OOjs_UI_icon_edit-ltr-progressive.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/20px-OOjs_UI_icon_edit-ltr-progressive.svg.png 2x" width="10"/>
<img alt="Rediger på Wikidata" class="mw-file-element" data-file-height="20" data-file-width="20" decoding="async" height="10" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/10px-OOjs_UI_icon_edit-ltr-progressive.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/15px-OOjs_UI_icon_edit-ltr-progressive.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/20px-OOjs_UI_icon_edit-ltr-progressive.svg.png 2x" width="10"/>
from IPython.display import HTML, display
for image in images:
    url = image["src"]
    if "://" in url:
        pass
    elif url.startswith("//"):
        # add 'scheme' or 'protocol'
        url = "https:" + url
    elif url.startswith("/"):
        url = "https://no.wikipedia.org" + url
    else:
        # not an understood URL
        raise ValueError(f"I don't understand this url: {url}")
    html = HTML(f'<img src="{url}">')
    display(html)