What is pandas good for?#

Working with (large) data sets and created automated data processes.

Pandas is extensively used to prepare data in data science (machine learning, data analytics, …)

Examples:

  • Import and export data into standard formats (CSV, Excel, Latex, ..).

  • Combine with Numpy for advanced computations or Matplotlib for visualisations.

  • Calculate statistics and answer questions about the data, like

    • What’s the average, median, max, or min of each column?

    • Does column A correlate with column B?

    • What does the distribution of data in column C look like?

  • Clean up data (e.g. fill out missing information and fix inconsistent formatting) and merge multiple data sets into one common dataset.

../../_images/pressure.png
import pandas as pd
import pylab as pl

First, a short recap of the video session

The two fundamental data-structures in pandas are Series and DataFrame:

s = pd.Series([1, 2, 3])
s
0    1
1    2
2    3
dtype: int64
s = pd.Series([1, 2, 3], index=["a", "b", "c"])
s
a    1
b    2
c    3
dtype: int64
dic = {"a": 1, "b": 2, "c": 3}
s = pd.Series(dic)
s
a    1
b    2
c    3
dtype: int64
s["a"]
1
dic = {"a": [1, 2], "b": [3, 4], "c": [5, 6]}
s = pd.DataFrame(dic)
s
a b c
0 1 3 5
1 2 4 6
s["a"]
0    1
1    2
Name: a, dtype: int64
s["a"][0]
1
s.columns
Index(['a', 'b', 'c'], dtype='object')

Reading data from file#

Now assume that we have some pressure data obtained from a sensor, as shown below ../../_images/pressure.png

df = pd.read_csv("data/pressure.csv")
df
Unnamed: 0 t p
0 0 0.000 -1.077684
1 1 0.005 -0.933488
2 2 0.010 -0.956377
3 3 0.015 -0.963243
4 4 0.020 -0.864824
... ... ... ...
5995 5995 29.975 2.296034
5996 5996 29.980 2.312056
5997 5997 29.985 2.488295
5998 5998 29.990 2.570692
5999 5999 29.995 2.472273

6000 rows × 3 columns

t = df["t"]
p = df["p"]
p
0      -1.077684
1      -0.933488
2      -0.956377
3      -0.963243
4      -0.864824
          ...   
5995    2.296034
5996    2.312056
5997    2.488295
5998    2.570692
5999    2.472273
Name: p, Length: 6000, dtype: float64
pl.plot(t, p)
pl.show()
../../_images/0c5d7073fb30ac3bca8077e44c9dcbd77afca8cfbfe6a4513bf17556bdb50311.png

This way of extracting data from the DataFrame is useful for futher computations with t and p. For plotting purposes only, the DataFrame has its own plot-function:

df.plot()
pl.show()
../../_images/448f4e724a693567d8d4a0b37ea849161af3adec7cff68cc3bd2c0fb8902a2c1.png
df.plot("t", "p")
pl.show()
../../_images/b2bce240daca5de53c6185d693a156721e338eb3c4aa6b39b32830b8f1d88819.png

How to write data to csv#

t = pl.linspace(0, 2 * pl.pi, 200)
p = pl.sin(2 * pl.pi * t)
pl.plot(t, p)
pl.show()
../../_images/e5069a414db58ce860aa633f757709b29c172c9acfb24b268037a2483ea69ae5.png
data = pl.array([t, p])

When dealing with table data, you should always consider whether to use the .transpose() of a matrix

df = pd.DataFrame(data.transpose(), columns=["t", "p"])
df
t p
0 0.000000 0.000000
1 0.031574 0.197085
2 0.063148 0.386439
3 0.094721 0.560635
4 0.126295 0.712838
... ... ...
195 6.156890 0.833697
196 6.188464 0.926180
197 6.220038 0.982332
198 6.251612 0.999949
199 6.283185 0.978341

200 rows × 2 columns

df.to_csv("pressure_computed.csv")

Adding a column to the existing DataFrame:#

v = pl.cos(2 * pl.pi * t)
v
array([ 1.        ,  0.98038635,  0.92231478,  0.82806328,  0.70132909,
        0.54708365,  0.37137759,  0.18110338, -0.01627502, -0.213015  ,
       -0.40139897, -0.57403714, -0.72415738, -0.84587087, -0.93440313,
       -0.98628127, -0.99947025, -0.9734527 , -0.90924922, -0.80937835,
       -0.67775774, -0.51955052, -0.34096273, -0.14899989,  0.04880781,
        0.24470092,  0.43099506,  0.60038243,  0.74621842,  0.86278226,
        0.94550148,  0.99113122,  0.99788155,  0.96548768,  0.89522032,
        0.78983588,  0.6534683 ,  0.49146692,  0.31018662,  0.11673853,
       -0.0812889 , -0.27612758, -0.46013452, -0.62609162, -0.76748884,
       -0.87877953, -0.95559807, -0.99493106, -0.99523559, -0.95649971,
       -0.88024292, -0.76945657, -0.62848651, -0.46286261, -0.27908187,
       -0.08435349,  0.11368385,  0.30726168,  0.48878646,  0.65113746,
        0.7879461 ,  0.89384572,  0.96468219,  0.99767677,  0.99153518,
        0.94649833,  0.8643329 ,  0.74826202,  0.60283883,  0.4337679 ,
        0.24768142,  0.05187907, -0.14595836, -0.33807024, -0.51692053,
       -0.67549342, -0.80756852, -0.90796489, -0.97274423, -0.99936544,
       -0.98678423, -0.93549413, -0.84750712, -0.72627468, -0.57655245,
       -0.40421361, -0.21601856, -0.01934969,  0.17807823,  0.36852061,
        0.54450692,  0.69913369,  0.82633533,  0.92112205,  0.97977564,
        0.99999527,  0.98098778,  0.92349877,  0.8297834 ,  0.70351786,
        0.5496552 ,  0.37423105,  0.18412683, -0.0132002 , -0.21000942,
       -0.39858053, -0.5715164 , -0.72203322, -0.84422662, -0.93330329,
       -0.98576898, -0.9995656 , -0.97415196, -0.91052496, -0.81118052,
       -0.68001565, -0.52217559, -0.34385199, -0.15204001,  0.0457361 ,
        0.2417181 ,  0.42821815,  0.59792036,  0.74416776,  0.86122346,
        0.94449569,  0.99071789,  0.99807689,  0.96628403,  0.89658645,
        0.79171819,  0.65579296,  0.49414274,  0.31310862,  0.1197921 ,
       -0.07822354, -0.27317068, -0.45740207, -0.62369081, -0.76551384,
       -0.87730783, -0.95468738, -0.99461712, -0.99553071, -0.95739231,
       -0.88169799, -0.77141703, -0.63087545, -0.46558633, -0.28203351,
       -0.08741728,  0.1106281 ,  0.30433384,  0.48610138,  0.64880047,
        0.78604886,  0.89246268,  0.96386758,  0.99746256,  0.99192976,
        0.94748624,  0.86587537,  0.75029855,  0.60528953,  0.43653664,
        0.25065959,  0.05494983, -0.14291545, -0.33517455, -0.51428565,
       -0.67322271, -0.80575106, -0.90667196, -0.97202656, -0.99925118,
       -0.98727786, -0.93657629, -0.84913535, -0.72838512, -0.5790623 ,
       -0.40702443, -0.21902008, -0.02242417,  0.17505139,  0.36566015,
        0.54192504,  0.69693168,  0.82459956,  0.91992062,  0.97915568,
        0.99998109,  0.98157993,  0.92467404,  0.83149567,  0.70569997,
        0.55222156,  0.37708098,  0.18714853, -0.01012525, -0.20700185])
df["v"] = v
df
t p v
0 0.000000 0.000000 1.000000
1 0.031574 0.197085 0.980386
2 0.063148 0.386439 0.922315
3 0.094721 0.560635 0.828063
4 0.126295 0.712838 0.701329
... ... ... ...
195 6.156890 0.833697 0.552222
196 6.188464 0.926180 0.377081
197 6.220038 0.982332 0.187149
198 6.251612 0.999949 -0.010125
199 6.283185 0.978341 -0.207002

200 rows × 3 columns

It is possible to create an empty DataFrame and just add the columns whenever you like:

empty_df = pd.DataFrame()
empty_df["t"] = t
empty_df["p"] = p
empty_df["v"] = v
empty_df
t p v
0 0.000000 0.000000 1.000000
1 0.031574 0.197085 0.980386
2 0.063148 0.386439 0.922315
3 0.094721 0.560635 0.828063
4 0.126295 0.712838 0.701329
... ... ... ...
195 6.156890 0.833697 0.552222
196 6.188464 0.926180 0.377081
197 6.220038 0.982332 0.187149
198 6.251612 0.999949 -0.010125
199 6.283185 0.978341 -0.207002

200 rows × 3 columns

empty_df.plot("t", ["p", "v"])  # pl.plot(t,p,t,v) in matplotlib
<AxesSubplot:xlabel='t'>
../../_images/1e9b45072843482ddeacae040514d2129904c761e7b01e9e17bd21e461c72a8b.png

Exercise

  1. Create uniformly sampled time points between 0 and 30.

  2. Generate positional data in the xy-plane given by [0.4*t + cos(t), sin(t)]

  3. Create a DataFrame consisting of the three columns t, x and y

  4. plot x versus y using first the matplotlib plot function and then the DataFrame plot-method

Velocity-data can be computed by \(v_{x_i} = \frac{x_{i+1} - x_i}{t_{i+1} - t_i}\), \(v_{y_i} = \frac{y_{i+1} - y_i}{t_{i+1} - t_i}\) 5. Compute the velocity data for x and y and add those as columns in the DataFrame

A real world example. Oslo bysykkel data#

We go to https://oslobysykkel.no/apne-data/historisk (you can also get there by “oslo bysykkel data historisk” on google). We download the September data as CSV.

import zipfile
from pathlib import Path

import requests


def download_file(filename, url):
    path = Path(filename)
    if path.exists():
        return path
    print(f"Downloading {path}")
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with path.open("wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return path


def download_trips(year, month):
    dest = Path("data") / f"oslo_bike_{year}_{month:02}.csv"
    return download_file(
        dest,
        f"https://data.urbansharing.com/oslobysykkel.no/trips/v1/{year}/{month:02}.csv",
    )


trip_csv = download_trips(2021, 9)
import pandas as pd
import pylab as pl

trips = pd.read_csv(trip_csv)
trips
started_at ended_at duration start_station_id start_station_name start_station_description start_station_latitude start_station_longitude end_station_id end_station_name end_station_description end_station_latitude end_station_longitude
0 2021-09-01 03:00:15.478000+00:00 2021-09-01 03:20:12.685000+00:00 1197 620 Bislettgata ved Sofies Gate 59.923774 10.734713 735 Oslo Hospital ved trikkestoppet 59.903213 10.767344
1 2021-09-01 03:03:04.080000+00:00 2021-09-01 03:06:10.732000+00:00 186 422 St. Hanshaugen langs Waldemar Thranes gate 59.923703 10.740542 499 Bjerregaards gate ovenfor Fredrikke Qvams gate 59.925488 10.746058
2 2021-09-01 03:16:41.288000+00:00 2021-09-01 03:25:20.740000+00:00 519 424 Birkelunden langs Seilduksgata 59.925611 10.760926 478 Jernbanetorget Europarådets plass 59.911901 10.749929
3 2021-09-01 03:21:55.708000+00:00 2021-09-01 03:28:20.138000+00:00 384 446 Bislett Stadion ved rundkjøringen 59.925471 10.731219 478 Jernbanetorget Europarådets plass 59.911901 10.749929
4 2021-09-01 03:26:16.090000+00:00 2021-09-01 03:30:30.133000+00:00 254 514 Sofienberggata ved Sars gate 59.921206 10.769989 542 Grünerhagen Nord ved Sofienberggata 59.922426 10.755427
... ... ... ... ... ... ... ... ... ... ... ... ... ...
190000 2021-09-30 22:56:22.073000+00:00 2021-09-30 23:11:50.892000+00:00 928 415 Sinsenveien ved Kongehellegata 59.929542 10.781053 537 St. Olavs gate ved Pilestredet 59.917968 10.738629
190001 2021-09-30 22:57:17.478000+00:00 2021-09-30 23:01:52.188000+00:00 274 460 Botanisk Hage sør langs Jens Bjelkes gate 59.915418 10.769330 475 Hausmanns bru langs Nylandsveien 59.914651 10.759872
190002 2021-09-30 22:57:33.599000+00:00 2021-09-30 23:04:41.205000+00:00 427 399 Uelands gate Ved Ulvetrappen (Ilatrappen) 59.929545 10.748986 622 Pilestredet 63 ved trikkestoppet 59.923883 10.731363
190003 2021-09-30 22:58:05.623000+00:00 2021-09-30 23:05:39.679000+00:00 454 465 Bjørvika under broen Nylandsveien 59.909006 10.756180 390 Saga Kino langs Olav Vs gate 59.914240 10.732771
190004 2021-09-30 22:58:36.872000+00:00 2021-09-30 23:03:06.333000+00:00 269 412 Jakob kirke langs Torggata 59.917866 10.754898 442 Vulkan ved Maridalsveien 59.922510 10.751010

190005 rows × 13 columns

We can work with the data using normal pylab (and numpy functions):

pl.hist(trips["duration"], range=[0, 1500])
(array([ 5658., 28023., 40349., 35711., 26288., 17542., 11016.,  6854.,
         4433.,  2907.]),
 array([   0.,  150.,  300.,  450.,  600.,  750.,  900., 1050., 1200.,
        1350., 1500.]),
 <BarContainer object of 10 artists>)
../../_images/524c1b0e9819646b01dcaf8a564a65cb8a485b7d01d93b91fd41b45364e09891.png

We can also use DataFrame built-in functions:

trips.sort_values("duration")
started_at ended_at duration start_station_id start_station_name start_station_description start_station_latitude start_station_longitude end_station_id end_station_name end_station_description end_station_latitude end_station_longitude
54120 2021-09-09 06:29:40.293000+00:00 2021-09-09 06:30:42.103000+00:00 61 381 Grønlands torg ved Tøyenbekken 59.912520 10.762240 381 Grønlands torg ved Tøyenbekken 59.912520 10.762240
150033 2021-09-23 16:57:04.305000+00:00 2021-09-23 16:58:05.488000+00:00 61 421 Alexander Kiellands Plass langs Maridalsveien 59.928067 10.751203 421 Alexander Kiellands Plass langs Maridalsveien 59.928067 10.751203
53582 2021-09-09 05:53:12.313000+00:00 2021-09-09 05:54:14.047000+00:00 61 623 7 Juni Plassen langs Henrik Ibsens gate 59.915060 10.731272 623 7 Juni Plassen langs Henrik Ibsens gate 59.915060 10.731272
70038 2021-09-11 09:08:26.043000+00:00 2021-09-11 09:09:27.337000+00:00 61 2304 Hedmarksgata ved Jordal Amfi 59.911784 10.783884 2304 Hedmarksgata ved Jordal Amfi 59.911784 10.783884
54483 2021-09-09 06:53:24.342000+00:00 2021-09-09 06:54:25.653000+00:00 61 474 Blindern studentparkering rett ved Blindern Studenterhjem 59.940874 10.720779 474 Blindern studentparkering rett ved Blindern Studenterhjem 59.940874 10.720779
... ... ... ... ... ... ... ... ... ... ... ... ... ...
52827 2021-09-08 21:45:08.196000+00:00 2021-09-09 05:06:05.736000+00:00 26457 1755 Aker Brygge ved trikkestopp 59.911184 10.730035 1755 Aker Brygge ved trikkestopp 59.911184 10.730035
92728 2021-09-14 22:37:30.359000+00:00 2021-09-15 06:03:14.857000+00:00 26744 525 Myraløkka Øst ved Bentsenbrua 59.937205 10.760581 597 Fredensborg ved rundkjøringen 59.920995 10.750358
94955 2021-09-15 07:51:00.736000+00:00 2021-09-15 15:25:43.368000+00:00 27282 468 Skillebekk langs Drammensveien 59.912793 10.710103 390 Saga Kino langs Olav Vs gate 59.914240 10.732771
8884 2021-09-02 07:15:39.199000+00:00 2021-09-02 15:05:55.621000+00:00 28216 615 Munkedamsveien ved Haakon VIIs gate 59.913523 10.730106 580 Georg Morgenstiernes hus ved Moltke Moes vei 59.939026 10.723003
125087 2021-09-20 06:48:08.055000+00:00 2021-09-20 16:12:09.094000+00:00 33841 506 Botanisk Hage vest ved Blytts gate 59.920128 10.768875 569 Botanisk hage sør-vest ved Sars' gate 59.917835 10.766374

190005 rows × 13 columns

trips["start_station_latitude"]
0         59.923774
1         59.923703
2         59.925611
3         59.925471
4         59.921206
            ...    
190000    59.929542
190001    59.915418
190002    59.929545
190003    59.909006
190004    59.917866
Name: start_station_latitude, Length: 190005, dtype: float64

Exercise#

  1. Make a scatter-plot showing the position (longitude, latitude) of stations in Oslo. It is OK to plot a station several times. Use matplotlib or the built-in DataFrame.plot.scatter

  2. (Bonus) Make a scatter-plot with different size of the cirles, and let the size be dependent on how popular a station is (i.e. how many trips were started at the given station)

pl.scatter(trips["start_station_longitude"], trips["start_station_latitude"])
pl.show()
../../_images/b630642239cfb74471421c60fff0fa517d4ddb7a1ce40755acfd00c9c7e9556c.png

Let’s see if we can find information about how popular the different start stations are

trips["start_station_id"]
0         620
1         422
2         424
3         446
4         514
         ... 
190000    415
190001    460
190002    399
190003    465
190004    412
Name: start_station_id, Length: 190005, dtype: int64

Let’s first try the numpy-way:

stations = pl.unique(trips["start_station_id"])
stations
array([ 377,  378,  380,  381,  382,  383,  384,  385,  387,  388,  389,
        390,  391,  392,  393,  394,  396,  397,  398,  399,  400,  401,
        402,  403,  404,  405,  406,  407,  408,  409,  410,  411,  412,
        413,  414,  415,  416,  417,  418,  420,  421,  422,  423,  424,
        425,  426,  427,  428,  429,  430,  431,  432,  433,  434,  435,
        436,  437,  438,  439,  440,  441,  442,  443,  444,  445,  446,
        447,  448,  449,  450,  451,  452,  453,  454,  455,  456,  457,
        458,  459,  460,  461,  462,  463,  464,  465,  466,  468,  469,
        470,  471,  472,  473,  474,  475,  476,  477,  478,  479,  480,
        481,  482,  483,  484,  486,  487,  488,  489,  491,  493,  494,
        495,  496,  497,  498,  499,  500,  501,  502,  503,  505,  506,
        507,  508,  509,  511,  512,  513,  514,  516,  518,  519,  521,
        522,  523,  524,  525,  526,  527,  529,  530,  531,  532,  533,
        534,  535,  536,  537,  540,  541,  542,  543,  545,  547,  548,
        549,  550,  551,  552,  553,  554,  555,  556,  557,  558,  559,
        560,  561,  562,  563,  564,  565,  567,  568,  569,  570,  572,
        573,  574,  575,  577,  578,  579,  580,  581,  582,  583,  584,
        585,  586,  587,  588,  589,  590,  591,  592,  593,  594,  595,
        596,  597,  598,  599,  600,  601,  602,  603,  605,  606,  607,
        608,  609,  610,  611,  612,  613,  614,  615,  616,  617,  618,
        619,  620,  621,  622,  623,  624,  625,  626,  627,  735,  737,
        738,  739,  742,  744,  746,  748,  787,  970, 1009, 1023, 1101,
       1755, 1919, 2270, 2280, 2304, 2305, 2306, 2307, 2308, 2309, 2315])
stations[0] == trips[
    "start_station_id"
]  # find out if trips started at the given station
0         False
1         False
2         False
3         False
4         False
          ...  
190000    False
190001    False
190002    False
190003    False
190004    False
Name: start_station_id, Length: 190005, dtype: bool
(
    stations[0] == trips["start_station_id"]
).sum()  # sum all trips that started at the given station
680

Now we generalize the line above to create a list of number of trips for each station

number_of_trips = [
    (stations[i] == trips["start_station_id"]).sum() for i in range(len(stations))
]
number_of_trips
[680,
 562,
 1097,
 927,
 603,
 1115,
 1745,
 1450,
 585,
 628,
 400,
 1134,
 1431,
 606,
 822,
 994,
 1391,
 1592,
 2308,
 777,
 981,
 490,
 889,
 818,
 772,
 365,
 855,
 1020,
 2189,
 573,
 1004,
 724,
 1217,
 1548,
 684,
 316,
 578,
 674,
 427,
 950,
 2831,
 690,
 1278,
 1411,
 430,
 1050,
 597,
 285,
 376,
 629,
 813,
 417,
 613,
 855,
 814,
 838,
 936,
 1340,
 823,
 1355,
 217,
 1188,
 1495,
 1739,
 200,
 2038,
 1386,
 622,
 515,
 908,
 585,
 659,
 666,
 96,
 743,
 665,
 675,
 836,
 536,
 1572,
 369,
 1045,
 788,
 1606,
 1082,
 281,
 964,
 658,
 582,
 324,
 512,
 567,
 578,
 590,
 993,
 644,
 1632,
 1344,
 1947,
 454,
 243,
 548,
 765,
 588,
 784,
 703,
 1717,
 461,
 1437,
 677,
 671,
 755,
 565,
 211,
 1532,
 728,
 433,
 1046,
 1213,
 565,
 599,
 1257,
 380,
 313,
 1053,
 914,
 332,
 881,
 442,
 568,
 1086,
 1320,
 533,
 466,
 337,
 814,
 896,
 479,
 428,
 794,
 302,
 269,
 350,
 556,
 790,
 282,
 1179,
 769,
 377,
 848,
 582,
 323,
 418,
 508,
 904,
 197,
 2199,
 478,
 850,
 477,
 478,
 507,
 1356,
 837,
 494,
 74,
 716,
 835,
 785,
 1116,
 132,
 633,
 528,
 469,
 179,
 585,
 67,
 623,
 801,
 344,
 781,
 1339,
 1155,
 894,
 1440,
 851,
 630,
 628,
 622,
 634,
 168,
 355,
 1147,
 81,
 267,
 207,
 426,
 143,
 498,
 934,
 1487,
 493,
 719,
 179,
 159,
 1096,
 301,
 215,
 2132,
 680,
 477,
 563,
 826,
 467,
 480,
 797,
 349,
 298,
 769,
 617,
 547,
 2056,
 460,
 561,
 931,
 704,
 721,
 274,
 126,
 1026,
 1126,
 311,
 929,
 532,
 767,
 193,
 417,
 760,
 573,
 117,
 869,
 153,
 1203,
 88,
 331,
 403,
 425,
 576,
 170,
 604,
 342,
 214,
 838]

Now let’s try some pandas: For the only purpose of counting trips per station we may use .value_counts()

number_of_trips_pandas = trips["start_station_id"].value_counts()
number_of_trips_pandas
421     2831
398     2308
551     2199
408     2189
607     2132
        ... 
454       96
1919      88
591       81
560       74
573       67
Name: start_station_id, Length: 253, dtype: int64

Now let’s say in our case we want all the information we can get about the start station, not only the number of trips. To group the data by start_station_id and count, while still extracting other relevant data for the start station we can use groupby()

station_data = trips.groupby(
    [
        "start_station_id",
        "start_station_name",
        "start_station_description",
        "start_station_latitude",
        "start_station_longitude",
    ]
).count()
station_data
started_at ended_at duration end_station_id end_station_name end_station_description end_station_latitude end_station_longitude
start_station_id start_station_name start_station_description start_station_latitude start_station_longitude
377 Tøyenparken ved Caltexløkka 59.915667 10.777566 680 680 680 680 680 680 680 680
378 Colosseum Kino langs Fridtjof Nansens vei 59.929843 10.711285 562 562 562 562 562 562 562 562
380 Bentsebrugata rett over busstoppet 59.939230 10.759170 1097 1097 1097 1097 1097 1097 1097 1097
381 Grønlands torg ved Tøyenbekken 59.912520 10.762240 927 927 927 927 927 927 927 927
382 Stensgata ved trikkestoppet 59.929586 10.732839 603 603 603 603 603 603 603 603
... ... ... ... ... ... ... ... ... ... ... ... ...
2306 Økern Portal ved Dag Hammarskjölds vei 59.930972 10.801830 170 170 170 170 170 170 170 170
2307 Domus Athletica ved Vestgrensa Studentby 59.946219 10.724626 604 604 604 604 604 604 604 604
2308 Gunerius motsatt side av Torggata fra Gunerius bygget 59.914638 10.753428 342 342 342 342 342 342 342 342
2309 Ulven Torg ved ulvenveien 59.924960 10.812061 214 214 214 214 214 214 214 214
2315 Rostockgata ved Operagata 59.906890 10.760307 838 838 838 838 838 838 838 838

254 rows × 8 columns

station_data = station_data.reset_index()
station_data
start_station_id start_station_name start_station_description start_station_latitude start_station_longitude started_at ended_at duration end_station_id end_station_name end_station_description end_station_latitude end_station_longitude
0 377 Tøyenparken ved Caltexløkka 59.915667 10.777566 680 680 680 680 680 680 680 680
1 378 Colosseum Kino langs Fridtjof Nansens vei 59.929843 10.711285 562 562 562 562 562 562 562 562
2 380 Bentsebrugata rett over busstoppet 59.939230 10.759170 1097 1097 1097 1097 1097 1097 1097 1097
3 381 Grønlands torg ved Tøyenbekken 59.912520 10.762240 927 927 927 927 927 927 927 927
4 382 Stensgata ved trikkestoppet 59.929586 10.732839 603 603 603 603 603 603 603 603
... ... ... ... ... ... ... ... ... ... ... ... ... ...
249 2306 Økern Portal ved Dag Hammarskjölds vei 59.930972 10.801830 170 170 170 170 170 170 170 170
250 2307 Domus Athletica ved Vestgrensa Studentby 59.946219 10.724626 604 604 604 604 604 604 604 604
251 2308 Gunerius motsatt side av Torggata fra Gunerius bygget 59.914638 10.753428 342 342 342 342 342 342 342 342
252 2309 Ulven Torg ved ulvenveien 59.924960 10.812061 214 214 214 214 214 214 214 214
253 2315 Rostockgata ved Operagata 59.906890 10.760307 838 838 838 838 838 838 838 838

254 rows × 13 columns

station_data = station_data.drop(columns=station_data.columns[-7:])
station_data
start_station_id start_station_name start_station_description start_station_latitude start_station_longitude started_at
0 377 Tøyenparken ved Caltexløkka 59.915667 10.777566 680
1 378 Colosseum Kino langs Fridtjof Nansens vei 59.929843 10.711285 562
2 380 Bentsebrugata rett over busstoppet 59.939230 10.759170 1097
3 381 Grønlands torg ved Tøyenbekken 59.912520 10.762240 927
4 382 Stensgata ved trikkestoppet 59.929586 10.732839 603
... ... ... ... ... ... ...
249 2306 Økern Portal ved Dag Hammarskjölds vei 59.930972 10.801830 170
250 2307 Domus Athletica ved Vestgrensa Studentby 59.946219 10.724626 604
251 2308 Gunerius motsatt side av Torggata fra Gunerius bygget 59.914638 10.753428 342
252 2309 Ulven Torg ved ulvenveien 59.924960 10.812061 214
253 2315 Rostockgata ved Operagata 59.906890 10.760307 838

254 rows × 6 columns

station_data = station_data.rename(columns={"started_at": "started_trips"})
station_data = station_data.set_index("start_station_id")
station_data
start_station_name start_station_description start_station_latitude start_station_longitude started_trips
start_station_id
377 Tøyenparken ved Caltexløkka 59.915667 10.777566 680
378 Colosseum Kino langs Fridtjof Nansens vei 59.929843 10.711285 562
380 Bentsebrugata rett over busstoppet 59.939230 10.759170 1097
381 Grønlands torg ved Tøyenbekken 59.912520 10.762240 927
382 Stensgata ved trikkestoppet 59.929586 10.732839 603
... ... ... ... ... ...
2306 Økern Portal ved Dag Hammarskjölds vei 59.930972 10.801830 170
2307 Domus Athletica ved Vestgrensa Studentby 59.946219 10.724626 604
2308 Gunerius motsatt side av Torggata fra Gunerius bygget 59.914638 10.753428 342
2309 Ulven Torg ved ulvenveien 59.924960 10.812061 214
2315 Rostockgata ved Operagata 59.906890 10.760307 838

254 rows × 5 columns

station_data.sort_values("started_trips", ascending=False)
start_station_name start_station_description start_station_latitude start_station_longitude started_trips
start_station_id
421 Alexander Kiellands Plass langs Maridalsveien 59.928067 10.751203 2831
398 Ringnes Park ved Sannergata 59.928434 10.759430 2308
551 Olaf Ryes plass langs Sofienberggata 59.922425 10.758182 2198
408 Tøyen skole forsiden av skolebygget 59.914943 10.773977 2189
607 Marcus Thranes gate ved Akerselva 59.932772 10.758595 2132
... ... ... ... ... ...
454 Furulund langs Vækerøveien 59.919810 10.651118 96
1919 Kværnerveien Ved Kværnerveien 5 59.905911 10.778592 88
591 Grenseveien ved Togbru 59.924645 10.781727 81
560 Gaustad T-bane langs Slemdalsveien 59.945955 10.710392 74
573 Tordenskiolds gate ved Rådhusgata 59.911776 10.735113 67

254 rows × 5 columns

ended_trips = trips["end_station_id"].value_counts()
ended_trips
421     2818
551     2702
443     2676
489     2644
480     2479
        ... 
591       80
1919      71
498       60
560       51
601       43
Name: end_station_id, Length: 253, dtype: int64
station_data["ended_trips"] = ended_trips
station_data.sort_values("started_trips", ascending=False)
start_station_name start_station_description start_station_latitude start_station_longitude started_trips ended_trips
start_station_id
421 Alexander Kiellands Plass langs Maridalsveien 59.928067 10.751203 2831 2818
398 Ringnes Park ved Sannergata 59.928434 10.759430 2308 2308
551 Olaf Ryes plass langs Sofienberggata 59.922425 10.758182 2198 2702
408 Tøyen skole forsiden av skolebygget 59.914943 10.773977 2189 2183
607 Marcus Thranes gate ved Akerselva 59.932772 10.758595 2132 1607
... ... ... ... ... ... ...
454 Furulund langs Vækerøveien 59.919810 10.651118 96 85
1919 Kværnerveien Ved Kværnerveien 5 59.905911 10.778592 88 71
591 Grenseveien ved Togbru 59.924645 10.781727 81 80
560 Gaustad T-bane langs Slemdalsveien 59.945955 10.710392 74 51
573 Tordenskiolds gate ved Rådhusgata 59.911776 10.735113 67 124

254 rows × 6 columns

Plotting on a map with ipyleaflet and HTML#

We saw that the scatterplot could be used to plot stations on a map:

station_data.plot.scatter("start_station_longitude", "start_station_latitude")
<AxesSubplot:xlabel='start_station_longitude', ylabel='start_station_latitude'>
../../_images/8ac4e976f443b783d3bd4bc28cd7b6efd782651621daf649a18c0bcd5f7c9a86.png

We now have tools to plot the most popular bike stations as bigger circles

station_data.plot.scatter(
    "start_station_longitude", "start_station_latitude", s="started_trips"
)
<AxesSubplot:xlabel='start_station_longitude', ylabel='start_station_latitude'>
../../_images/6a0ed90454fcc98117974b472a40c78505d652b9d87a475e392110032bb6db49.png

ipywidgets/HTML and ipyleaflet are useful tools to visualize data on maps#

from ipyleaflet import Circle, Map, Marker, Polyline, basemap_to_tiles, basemaps
from ipywidgets import HTML
oslo_center = (
    59.9127,
    10.7461,
)  # NB ipyleaflet uses Lat-Long (i.e. y,x, when specifying coordinates)
oslo_map = Map(center=oslo_center, zoom=13)
oslo_map
oslo_map.save(
    "data/raw_oslo_map.html"
)  # if interactive view is not possible inline try to open this in your browser

We can add different layers to our map with a marker function. The function is written such that for a given row in the dataframe (i.e. a given station), we add one marker to the map

def add_markers(row):
    center = row["start_station_latitude"], row["start_station_longitude"]
    marker = Circle(
        location=center, radius=int(0.04 * row["started_trips"]), color="green"
    )
    oslo_map.add_layer(marker)
station_data.apply(add_markers, axis=1)
start_station_id
377     None
378     None
380     None
381     None
382     None
        ... 
2306    None
2307    None
2308    None
2309    None
2315    None
Length: 254, dtype: object

Exercise#

Note: If you have issues with installing ipyleaflet or ipywidget with pip, just use pl.scatter() or pl.plot()

  1. Create the DataFrame station_data as described in the lecture

  2. Make a similar plot of the Oslo map with the most popular end-stations as red circles

  3. Add the following line as the last line in your add_markers-function: marker.popup = HTML(f”{row[‘start_station_name’]} Trips started: {row[‘started_trips’]}”) . You can also add newlines within the string with the HTML command for newline

  4. Try to make an Oslo map showing both started trips and ended trips in the same map

  5. Make a map showing which stations are most popular going from Stensgata

ls data
FremontBridge.csv
bike-counter-locations-oslo-municipality.csv
car-counter-locations-oslo-municipality.csv
eustat_area.tsv
eustat_population.tsv
nor_population2022.csv
oslo_bike_2021_09.csv
oslo_bike_september_2022.csv
pressure.csv
pressure.png
raw_oslo_map.html
state-abbrevs.csv
state-areas.csv
state-population.csv
used_car_sales.csv
%reset
import pandas as pd
import pylab as pl

trip_csv = "data/oslo_bike_2021_09.csv"
trips = pd.read_csv(trip_csv)
station_data = trips.groupby(
    [
        "start_station_id",
        "start_station_longitude",
        "start_station_latitude",
        "start_station_name",
    ]
).count()
station_data = station_data.reset_index()
station_data = station_data.drop(columns=station_data.columns[-7:])
station_data = station_data.rename(columns={"started_at": "started_trips"})
station_data = station_data.set_index("start_station_id")
station_data["ended_trips"] = trips["end_station_id"].value_counts()

from ipyleaflet import Circle, Map, Marker, Polyline, basemap_to_tiles, basemaps
from ipywidgets import HTML

oslo_center = (
    59.9127,
    10.7461,
)  # NB ipyleaflet uses Lat-Long (i.e. y,x, when specifying coordinates)
oslo_map = Map(center=oslo_center, zoom=13)


def add_markers(row):
    center = row["start_station_latitude"], row["start_station_longitude"]
    marker = Circle(
        location=center, radius=int(0.04 * row["started_trips"]), color="green"
    )
    marker2 = Circle(
        location=center, radius=int(0.04 * row["ended_trips"]), color="red"
    )
    oslo_map.add_layer(marker)
    oslo_map.add_layer(marker2)
    marker.popup = HTML(
        f"{row['start_station_name']} <br> Trips started: {row['started_trips']}"
    )


station_data.apply(add_markers, axis=1)
oslo_map