Finding bottlenecks in your Python program#
“First make it work. Then make it right. Then make it fast.” - Kent Beck
Profiling allows us to measure resources used by sections of the program.#
Typical resources of interest are
Amount of wall-clock time used
Amount of CPU time used
Amount of RAM used
But also other resources can be measured, such as:
Disk I/O
Network I/O
…
Today, we only consider wallclock/CPU profiling.
Profiling techniques#
Start simple, switch to more complex techniques if needed!
Techniques for measuring wall clock/CPU time in increasing complexity:
Manual timing
The
timeit
moduleThe
cprofile
moduleLine by line profiling
Case study: filling a grid with point values#
Consider a rectangular 2D grid
A NumPy array
a[i,j]
holds values at the grid points
Implementation#
import numpy as np
class Grid2D:
def __init__(self, xmin=0, xmax=1, dx=0.5, ymin=0, ymax=1, dy=0.5):
self.xcoor = np.arange(xmin, xmax + dx, step=dx)
self.ycoor = np.arange(ymin, ymax + dy, step=dy)
def gridloop(self, f):
lx = np.size(self.xcoor)
ly = np.size(self.ycoor)
a = np.zeros((lx, ly))
for i in range(lx):
x = self.xcoor[i]
for j in range(ly):
y = self.ycoor[j]
a[i, j] = f(x, y)
return a
Usage#
Create a new grid:
g = Grid2D(dx=0.001, dy=0.001)
Define function to evaluate:
def myfunc(x, y):
return np.sin(x * y) + y
Computing grid values:
print("Computing values...")
a = g.gridloop(myfunc)
print("done")
Computing values...
done
Timing#
Use time.perf_counter()
to measure the time spend in a code section.
t0 = time.perf_counter()
# execute code here
t1 = time.perf_counter()
print("Runtime: {}".format(t1-t0))
Note: time.perf_counter()
measures wall clock time. Use time.clock()
to measure CPU time.
Timing guidelines:
Put simple statements in a loop to increase measurment accuracy.
Make sure to use a constant machine load.
Run the tests several times, choose the smallest time.
Timing of the case study#
The case study has two parts that could potentially be slow:
The initialisation
Grid2D(dx=0.001, dy=0.001)
Calling the
g.gridloop(myfunc)
function.
We time these two parts separately to figure out how much time is spend in each.
Timing the Grid2D initialisation#
import time
t0 = time.perf_counter()
g = Grid2D(dx=0.001, dy=0.001)
t1 = time.perf_counter()
print(f"CPU time: {t1 - t0:.4} s")
CPU time: 4.046e-05 s
Timing the gridloop
function#
min_time = 1e9 # Keep track of shortest runtime
for i in range(1, 10):
t0 = time.perf_counter()
g.gridloop(myfunc)
t1 = time.perf_counter()
min_time = min(min_time, t1 - t0)
print(f"Experiment {i}. CPU time: {t1 - t0:.4} s")
print(f"Minimum runtime: {min_time}")
Experiment 1. CPU time: 0.7527 s
Experiment 2. CPU time: 0.7456 s
Experiment 3. CPU time: 0.7475 s
Experiment 4. CPU time: 0.7349 s
Experiment 5. CPU time: 0.7436 s
Experiment 6. CPU time: 0.7315 s
Experiment 7. CPU time: 0.7516 s
Experiment 8. CPU time: 0.7355 s
Experiment 9. CPU time: 0.7382 s
Minimum runtime: 0.7315359170315787
\(=>\) The gridloop function is causing the slow execution!
The timeit module#
The timeit
module provides an convenient way for measuring the CPU time of small code snippets.
Usage:
import timeit
timeit.timeit(stmt="a+=1", setup="a=0")
0.019028542039450258
The code is automatically wrapped in a for loop. By default the command is executed 1,000,000 times. It returns the accumulated runtime.
You can adjust the the number of executions:
timeit.timeit(stmt="a+=1", setup="a=0", number=10000)
0.00017849996220320463
Use timeit.repeat if you would like to repeat the experiment multiple times:
timeit.repeat(stmt="a+=1", setup="a=0", number=10000, repeat=5)
[0.00018762500258162618,
0.00018729199655354023,
0.00018633296713232994,
0.00018083304166793823,
0.00018633296713232994]
Timing user defined function#
Timeit creates its own namespace - which means that variables, functions, … are not avaible in timeit, unless they are imported in the setup argument:
timeit.repeat(
stmt="g.gridloop(myfunc)",
setup="from __main__ import g, myfunc",
repeat=5,
number=1,
)
[0.7332040829933248,
0.733379082987085,
0.7288548749638721,
0.73128754098434,
0.7393817079719156]
Profiling modules with cProfile#
A profile is a set of statistics that describes how often and for how long various parts of the program executed.
cProfile
is two main (deterministic) profiling module in Python.
Two options to use cProfile#
As a script:
python -m cProfile script.py
As a module:
import cProfile
pr = cProfile.Profile()
res = pr.run("g.gridloop(myfunc)") # res contains the statistics
Getting runtime statistics#
res.print_stats() # Print profiling statistics
1002014 function calls in 0.815 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.187 0.187 0.815 0.815 42104406.py:11(gridloop)
1002001 0.627 0.000 0.627 0.000 866067724.py:1(myfunc)
2 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(size)
1 0.000 0.000 0.815 0.815 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 fromnumeric.py:3165(_size_dispatcher)
2 0.000 0.000 0.000 0.000 fromnumeric.py:3169(size)
1 0.000 0.000 0.815 0.815 {built-in method builtins.exec}
2 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
1 0.001 0.001 0.001 0.001 {built-in method numpy.zeros}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Meaning of clumns:
ncalls: number of calls
tottime: total time spent in the given function excluding time made in calls to sub-functions
percall: tottime divided by ncalls
cumtime: cumulative time spent in this and all subfunctions
percall: cumtime divided by ncalls
filename:lineno(function): information about the function
Back to our case-study: What have we learned so far?#
The biggest contributors of the total runtime are:
gridloop
contributes one third of the total runtime.myfunc
calls contributes two thirds of the total runtime.
myfunc
is fairly straight-forwarddef myfunc(x, y): return sin(x*y) + y
Might be difficult to improve.
What about
gridloop
?
Recall that, gridloop
was a function of the form#
def gridloop(self, f):
lx = size(self.xcoor)
ly = size(self.ycoor)
a = zeros((lx,ly))
for i in range(lx):
x = self.xcoor[i]
for j in range(ly):
y = self.ycoor[j]
a[i,j] = f(x, y)
return a
It would be useful to see how much time is spend in each line!
Line by line profiling#
The line_profiler module inspects the time spend in each line of a Python function.
Usage#
Install with
conda install line_profiler
(orpip install line-profiler
)“Decorate” the function of interest with
@profile
:@profile def gridloop(func): # ...
Run line profiler with:
kernprof -l -v grid2d_lineprofile.py
Demo#
from line_profiler import LineProfiler
profiler = LineProfiler()
profiler.add_function(g.gridloop)
profiler.runcall(g.gridloop, myfunc)
profiler.print_stats()
Timer unit: 1e-06 s
Total time: 1.50999 s
File: /var/folders/qr/3vxfnp1x2t1fw55dr288mphc0000gn/T/ipykernel_45895/42104406.py
Function: gridloop at line 11
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 def gridloop(self, f):
12 1 8.0 8.0 0.0 lx = np.size(self.xcoor)
13 1 2.0 2.0 0.0 ly = np.size(self.ycoor)
14 1 339.0 339.0 0.0 a = np.zeros((lx,ly))
15
16 1002 197.0 0.2 0.0 for i in range(lx):
17 1001 288.0 0.3 0.0 x = self.xcoor[i]
18 1003002 217804.0 0.2 14.4 for j in range(ly):
19 1002001 258869.0 0.3 17.1 y = self.ycoor[j]
20 1002001 1032486.0 1.0 68.4 a[i,j] = f(x, y)
21 1 0.0 0.0 0.0 return a
Conclusion: A significant amount of time is spend in loops and indexing. How can we improve this?
A vectorised Grid2D implementation#
class VectorisedGrid2D:
def __init__(self, xmin=0, xmax=1, dx=0.5, ymin=0, ymax=1, dy=0.5):
self.xcoor = np.arange(xmin, xmax + dx, step=dx)
self.ycoor = np.arange(ymin, ymax + dy, step=dy)
def gridloop(self, f):
return f(self.xcoor[:, None], self.ycoor[None, :]) # Vectorized grid evaluation
Timing the vectorised Grid2D#
vg = VectorisedGrid2D(dx=0.001, dy=0.001)
min(
timeit.repeat(
stmt="vg.gridloop(myfunc)",
setup="from __main__ import vg, myfunc",
repeat=5,
number=1,
)
)
0.025603624992072582
g = Grid2D(dx=0.001, dy=0.001)
min(
timeit.repeat(
stmt="g.gridloop(myfunc)",
setup="from __main__ import g, myfunc",
repeat=5,
number=1,
)
)
0.725250874995254
Vectorization yielded a 50-100x speed improvement!