The Web, how does it work?!#
The World-Wide Web (WWW) is a collection of standards, for how most things on the Internet connect to each other:
URLs - how to specify what document to request from what server (https://uio-in3110.github.io/)
DNS - how to resolve names (
uio-in3110.github.io
) to machine addresses (185.199.111.153
)TCP/IP - the low-level protocol of sending bytes back and forth to machines
HTTP - the protocol for how to request documents (more later in the semester)
HTML - markup language for documents on the web (topic for today)
Browser - an application that ties all these things together
So I want to get the page at http://example.com/
. That’s my URL.
It means I want to:
use protocol
http
talk to the server at
example.com
request the document at
/
%%html
<iframe src="http://example.com/" style="width: 90%; height: 400px;"></iframe>
from urllib.parse import urlparse
url = "http://example.com/"
url_parsed = urlparse(url)
url_parsed
ParseResult(scheme='http', netloc='example.com', path='/', params='', query='', fragment='')
Step 1. resolve the machine “example.com” so I can talk to it
import socket
hostname = url_parsed.hostname
ip_address = socket.gethostbyname(hostname)
ip_address
'93.184.216.34'
Step 2. connect to the server using the ip address
s = socket.create_connection((ip_address, 80))
Step 3. send an “HTTP GET request” for the document /
message = f"""\
GET {url_parsed.path} HTTP/1.1
Host: {url_parsed.hostname}
"""
message = message.replace("\n", "\r\n").encode("utf8")
message
b'GET / HTTP/1.1\r\nHost: example.com\r\n\r\n'
s.send(message)
37
Step 4. receive the response
response_bytes = s.recv(65535)
response_bytes
b'HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nAge: 453618\r\nCache-Control: max-age=604800\r\nContent-Type: text/html; charset=UTF-8\r\nDate: Wed, 04 Oct 2023 10:24:29 GMT\r\nEtag: "3147526947"\r\nExpires: Wed, 11 Oct 2023 10:24:29 GMT\r\nLast-Modified: Thu, 17 Oct 2019 07:18:26 GMT\r\nServer: ECS (dcb/7EC9)\r\nVary: Accept-Encoding\r\nX-Cache: HIT\r\nContent-Length: 1256\r\n\r\n<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset="utf-8" />\n <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n <meta name="viewport" content="width=device-width, initial-scale=1" />\n <style type="text/css">\n body {\n background-color: #f0f0f2;\n margin: 0;\n padding: 0;\n font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n \n }\n div {\n width: 600px;\n margin: 5em auto;\n padding: 2em;\n background-color: #fdfdff;\n border-radius: 0.5em;\n box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n }\n a:link, a:visited {\n color: #38488f;\n text-decoration: none;\n }\n @media (max-width: 700px) {\n div {\n margin: 0 auto;\n width: auto;\n }\n }\n </style> \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.</p>\n <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'
response = response_bytes.decode("utf8")
print(response)
HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 453618
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Wed, 04 Oct 2023 10:24:29 GMT
Etag: "3147526947"
Expires: Wed, 11 Oct 2023 10:24:29 GMT
Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
Server: ECS (dcb/7EC9)
Vary: Accept-Encoding
X-Cache: HIT
Content-Length: 1256
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
Step 5. extract the content from the message
head, _, content = response.partition("\r\n\r\n")
print(content)
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
Step 6. interpret the content as HTML
from bs4 import BeautifulSoup
page = BeautifulSoup(content)
page
<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
Finally, do something with the page
new_url = page.find("a")["href"]
new_url
'https://www.iana.org/domains/example'
More on HTML next.