Brief tour of Altair

Brief tour of Altair#

Useful links:

altair tutorial: https://altair-viz.github.io/altair-tutorial
altair docs: https://altair-viz.github.io

The goal of this section is to teach you the core concepts required to create a basic Altair chart; namely:

Data, Marks, and Encodings: the three core pieces of an Altair chart
Encoding Types: Q (quantitative), N (nominal), O (ordinal), T (temporal), which drive the visual representation of the encodings
Binning and Aggregation: which let you control aspects of the data representation within Altair.

With a good understanding of these core pieces, you will be well on your way to making a variety of charts in Altair.

python3 -m pip install altair vega_datasets

We’ll start by importing Altair:

import altair as alt

A Basic Altair Chart#

The essential elements of an Altair chart are the data, the mark, and the encoding.

The format by which these are specified will look something like this:

alt.Chart(data).mark_point().encode(
    encoding_1='column_1',
    encoding_2='column_2',
    # etc.
)

Let’s take a look at these pieces, one at a time.

The Data#

Data in Altair is built around the Pandas Dataframe. For this section, we’ll use the cars dataset that we saw before, which we can load using the vega_datasets package:

pip install vega-datasets

from vega_datasets import data

cars = data.cars()

cars

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA
...	...	...	...	...	...	...	...	...	...
401	ford mustang gl	27.0	4	140.0	86.0	2790	15.6	1982-01-01	USA
402	vw pickup	44.0	4	97.0	52.0	2130	24.6	1982-01-01	Europe
403	dodge rampage	32.0	4	135.0	84.0	2295	11.6	1982-01-01	USA
404	ford ranger	28.0	4	120.0	79.0	2625	18.6	1982-01-01	USA
405	chevy s-10	31.0	4	119.0	82.0	2720	19.4	1982-01-01	USA

406 rows × 9 columns

Data in Altair is expected to be in a tidy format; in other words:

each row is an observation
each column is a variable

See Altair’s Data Documentation for more information.

The Chart object#

With the data defined, you can instantiate Altair’s fundamental object, the Chart. Fundamentally, a Chart is an object which knows how to emit a JSON dictionary representing the data and visualization encodings, which can be sent to the notebook and rendered by the Vega-Lite JavaScript library. Let’s take a look at what this JSON representation looks like, using only the first row of the data:

cars1 = cars.iloc[:1]
cars1

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA

alt.Chart(cars1).mark_point().to_dict()

{'config': {'view': {'continuousWidth': 300, 'continuousHeight': 300}},
 'data': {'name': 'data-36a712fbaefa4d20aa0b32e160cfd83a'},
 'mark': {'type': 'point'},
 '$schema': 'https://vega.github.io/schema/vega-lite/v5.8.0.json',
 'datasets': {'data-36a712fbaefa4d20aa0b32e160cfd83a': [{'Name': 'chevrolet chevelle malibu',
    'Miles_per_Gallon': 18.0,
    'Cylinders': 8,
    'Displacement': 307.0,
    'Horsepower': 130.0,
    'Weight_in_lbs': 3504,
    'Acceleration': 12.0,
    'Year': '1970-01-01T00:00:00',
    'Origin': 'USA'}]}}

At this point the chart includes a JSON-formatted representation of the dataframe, what type of mark to use, along with some metadata that is included in every chart output.

The Mark#

We can decide what sort of mark we would like to use to represent our data. In the previous example, we can choose the point mark to represent each data as a point on the plot:

alt.Chart(cars).mark_point()

The result is a visualization with one point per row in the data, though it is not a particularly interesting: all the points are stacked right on top of each other!

It is useful to again examine the JSON output here:

alt.Chart(cars1).mark_point().to_dict()

{'config': {'view': {'continuousWidth': 300, 'continuousHeight': 300}},
 'data': {'name': 'data-36a712fbaefa4d20aa0b32e160cfd83a'},
 'mark': {'type': 'point'},
 '$schema': 'https://vega.github.io/schema/vega-lite/v5.8.0.json',
 'datasets': {'data-36a712fbaefa4d20aa0b32e160cfd83a': [{'Name': 'chevrolet chevelle malibu',
    'Miles_per_Gallon': 18.0,
    'Cylinders': 8,
    'Displacement': 307.0,
    'Horsepower': 130.0,
    'Weight_in_lbs': 3504,
    'Acceleration': 12.0,
    'Year': '1970-01-01T00:00:00',
    'Origin': 'USA'}]}}

Notice that now in addition to the data, the specification includes information about the mark type.

There are a number of available marks that you can use; some of the more common are the following:

mark_point()
mark_circle()
mark_square()
mark_line()
mark_area()
mark_bar()
mark_tick()

You can get a complete list of mark_* methods using Jupyter’s tab-completion feature: in any cell just type:

alt.Chart.mark_

followed by the tab key to see the available options.

Encodings#

The next step is to add visual encoding channels (or encodings for short) to the chart. An encoding channel specifies how a given data column should be mapped onto the visual properties of the visualization. Some of the more frequenty used visual encodings are listed here:

x: x-axis value
y: y-axis value
color: color of the mark
opacity: transparency/opacity of the mark
shape: shape of the mark
size: size of the mark
row: row within a grid of facet plots
column: column within a grid of facet plots

For a complete list of these encodings, see the Encodings section of the documentation.

Visual encodings can be created with the encode() method of the Chart object. For example, we can start by mapping the y axis of the chart to the Origin column:

alt.Chart(cars).mark_point().encode(y="Origin")

The result is a one-dimensional visualization representing the values taken on by Origin, with the points in each category on top of each other. As above, we can view the JSON data generated for this visualization:

alt.Chart(cars1).mark_point().encode(x="Origin").to_dict()

{'config': {'view': {'continuousWidth': 300, 'continuousHeight': 300}},
 'data': {'name': 'data-36a712fbaefa4d20aa0b32e160cfd83a'},
 'mark': {'type': 'point'},
 'encoding': {'x': {'field': 'Origin', 'type': 'nominal'}},
 '$schema': 'https://vega.github.io/schema/vega-lite/v5.8.0.json',
 'datasets': {'data-36a712fbaefa4d20aa0b32e160cfd83a': [{'Name': 'chevrolet chevelle malibu',
    'Miles_per_Gallon': 18.0,
    'Cylinders': 8,
    'Displacement': 307.0,
    'Horsepower': 130.0,
    'Weight_in_lbs': 3504,
    'Acceleration': 12.0,
    'Year': '1970-01-01T00:00:00',
    'Origin': 'USA'}]}}

The result is the same as above with the addition of the 'encoding' key, which specifies the visualization channel (y), the name of the field (Origin), and the type of the variable (nominal). We’ll discuss these data types in a moment.

The visualization can be made more interesting by adding another channel to the encoding: let’s encode the Miles_per_Gallon as the x position:

alt.Chart(cars).mark_point().encode(
    x="Miles_per_Gallon",
    y="Origin",
)

You can add as many encodings as you wish, with each encoding mapped to a column in the data. For example, here we will color the points by Origin, and plot Miles_per_gallon vs Year:

alt.Chart(cars).mark_point().encode(
    x="Year",
    y="Miles_per_Gallon",
    color="Origin",
)

Excercise: Exploring Data#

Now that you know the basics (Data, encodings, marks) take some time and try making a few plots!

In particular, I’d suggest trying various combinations of the following:

Marks: mark_point(), mark_line(), mark_bar(), mark_text(), mark_rect()…
Data Columns: 'Acceleration', 'Cylinders', 'Displacement', 'Horsepower', 'Miles_per_Gallon', 'Name', 'Origin', 'Weight_in_lbs', 'Year'
Encodings: x, y, color, shape, row, column, opacity, text, tooltip…

Use various combinations of these options, and see what you can learn from the data! In particular, think about the following:

Which encodings go well with continuous, quantitative values?
Which encodings go well with discrete, categorical (i.e. nominal) values?

If you want a prompt, try to answer some specific questions:

Can you visualize how miles per gallon relates to properties such as displacement, horsepower, cylinders?
How much information can you fit in one chart with different encodings?

cars.dtypes

Name                        object
Miles_per_Gallon           float64
Cylinders                    int64
Displacement               float64
Horsepower                 float64
Weight_in_lbs                int64
Acceleration               float64
Year                datetime64[ns]
Origin                      object
dtype: object

alt.Chart(cars).mark_point().encode(
    x="Acceleration",
    y="Miles_per_Gallon",
    color="Cylinders:N",
    size="Weight_in_lbs",
    shape="Origin",
)

Encoding Types#

One of the central ideas of Altair is that the library will choose good defaults for your data type.

The basic data types supported by Altair are as follows:

Data Type	Code	Description
quantitative	Q	Numerical quantity (real-valued)
nominal	N	Name / Unordered categorical
ordinal	O	Ordered categorial
temporal	T	Date/time

When you specify data as a pandas dataframe, these types are automatically determined by Altair.

When you specify data as a URL, you must manually specify data types for each of your columns.

Let’s look at a simple plot containing three of the columns from the cars data:

alt.Chart(cars).mark_tick().encode(
    x="Miles_per_Gallon", y="Origin", color="Cylinders:O"
)

Questions:

what data type best goes with Miles_per_Gallon?
what data type best goes with Origin?
what data type best goes with Cylinders?

Let’s add the shorthands for each of these data types to our specification, using the one-letter codes above (for example, change "Miles_per_Gallon" to "Miles_per_Gallon:Q" to explicitly specify that it is a quantitative type):

alt.Chart(cars).mark_tick().encode(
    x="Miles_per_Gallon:Q",
    color="Origin:N",
    y="Cylinders:O",
)

cars[cars.Cylinders == 3]

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
78	mazda rx2 coupe	19.0	3	70.0	97.0	2330	13.5	1972-01-01	Japan
118	maxda rx3	18.0	3	70.0	90.0	2124	13.5	1973-01-01	Japan
250	mazda rx-4	21.5	3	80.0	110.0	2720	13.5	1977-01-01	Japan
341	mazda rx-7 gs	23.7	3	70.0	100.0	2420	12.5	1980-01-01	Japan

Notice how if we change the data type for 'Cylinders' to ordinal the plot changes.

As you use Altair, it is useful to get into the habit of always specifying these types explicitly, because this is mandatory when working with data loaded from a file or a URL.

Exercise: Adding Explicit Types#

Following are a few simple charts made with the cars dataset. For each one, try to add explicit types to the encodings (i.e. change "Horsepower" to "Horsepower:Q" so that the plot doesn’t change.

Are there any plots that can be made better by changing the type?

alt.Chart(cars).mark_bar().encode(
    y="Origin:N",
    x="mean(Horsepower):Q",
)

alt.Chart(cars).mark_line().encode(
    x="Year:T",
    y="mean(Miles_per_Gallon):Q",
    color="Origin:N",
)

alt.Chart(cars).mark_bar().encode(
    y="Cylinders:O",
    x="count():Q",
    color="Origin:N",
)

alt.Chart(cars).mark_rect().encode(
    x="Cylinders:O",
    y="Origin:N",
    color="count():Q",
)

Back to weather#

We can load our forecasting data and visualize it again with altair

from forecasting import city_forecast

import pandas as pd

forecasts = pd.concat(
    city_forecast(city) for city in ("Oslo", "Bergen", " Tromsø", "Trondheim")
)
forecasts

	time	air_pressure_at_sea_level	air_temperature	cloud_area_fraction	relative_humidity	wind_from_direction	wind_speed	next_12_hours_symbol_code	next_1_hours_symbol_code	next_1_hours_precipitation_amount	next_6_hours_symbol_code	next_6_hours_precipitation_amount	city
0	2023-10-25 07:00:00+00:00	1015.5	2.9	25.4	78.9	47.1	4.2	partlycloudy	fair	0.0	partlycloudy	0.0	Oslo
1	2023-10-25 08:00:00+00:00	1015.4	3.3	16.8	76.8	45.6	4.8	partlycloudy	fair	0.0	partlycloudy	0.0	Oslo
2	2023-10-25 09:00:00+00:00	1015.2	3.9	11.6	73.1	47.7	4.6	partlycloudy	clearsky	0.0	partlycloudy	0.0	Oslo
3	2023-10-25 10:00:00+00:00	1015.1	4.6	30.4	70.9	49.8	4.5	partlycloudy	fair	0.0	partlycloudy	0.0	Oslo
4	2023-10-25 11:00:00+00:00	1014.4	5.3	59.3	68.0	46.6	5.0	partlycloudy	partlycloudy	0.0	partlycloudy	0.0	Oslo
...	...	...	...	...	...	...	...	...	...	...	...	...	...
80	2023-11-03 06:00:00+00:00	1001.9	3.6	100.0	70.9	128.2	4.2	cloudy	NaN	NaN	cloudy	0.0	Trondheim
81	2023-11-03 12:00:00+00:00	994.9	5.6	98.8	67.3	116.6	4.2	cloudy	NaN	NaN	cloudy	0.0	Trondheim
82	2023-11-03 18:00:00+00:00	997.6	4.5	95.7	74.1	129.1	4.2	cloudy	NaN	NaN	cloudy	0.0	Trondheim
83	2023-11-04 00:00:00+00:00	995.8	4.3	100.0	74.1	135.3	4.1	NaN	NaN	NaN	cloudy	0.0	Trondheim
84	2023-11-04 06:00:00+00:00	995.0	3.8	100.0	76.2	120.3	4.1	NaN	NaN	NaN	NaN	NaN	Trondheim

340 rows × 13 columns

Recalling our columns and data types

forecasts.dtypes

time                                 datetime64[ns, UTC]
air_pressure_at_sea_level                        float64
air_temperature                                  float64
cloud_area_fraction                              float64
relative_humidity                                float64
wind_from_direction                              float64
wind_speed                                       float64
next_12_hours_symbol_code                         object
next_1_hours_symbol_code                          object
next_1_hours_precipitation_amount                float64
next_6_hours_symbol_code                          object
next_6_hours_precipitation_amount                float64
city                                              object
dtype: object

alt.Chart(forecasts).mark_line().encode(
    x="time",
    y="air_temperature",
    color="city",
)

short_forecast = forecasts.dropna(subset=["next_1_hours_symbol_code"])
alt.Chart(short_forecast).mark_point().encode(
    x="time",
    y="air_temperature",
    color="city",
    shape="next_1_hours_symbol_code",
)

alt.Chart(forecasts).mark_bar().encode(
    x="next_6_hours_symbol_code",
    color="city",
    y="count()",
)

alt.Chart(forecasts).mark_bar().encode(
    x="next_6_hours_symbol_code",
    color="city",
    y="count()",
).facet(
    facet="city",
    columns=2,
)

alt.Chart(short_forecast).mark_bar().encode(
    x="time",
    y="next_1_hours_precipitation_amount",
    color="next_1_hours_symbol_code",
).facet(facet="city:N", columns=2)

alt.Chart(short_forecast).mark_point().encode(
    x="time",
    y="next_1_hours_precipitation_amount",
    color="next_1_hours_symbol_code",
    shape="next_1_hours_symbol_code",
).facet(facet="city:N", columns=2)

alt.Chart(forecasts).mark_point().encode(
    x="cloud_area_fraction",
    y="next_6_hours_symbol_code",
    shape="city",
    color="next_6_hours_precipitation_amount",
    # shape="next_1_hours_symbol_code",
)