Profiling and Optimising#

Optimising is the process of making your code ‘better’ by some metric. Most often, this is total runtime (how long it takes), but can also mean things like reducing memory usage or any other way to measure the ‘cost’ of your code.

Profiling is a way of measuring your code, to figure out what part of it you may want to to optimize.

IPython provides some tools that can make it it a bit easier to profile and optimise your code sometimes.

For more information on IPython, see our IPython notes

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

%timeit#

The main IPython tool we are going to use here is %timeit, a magic that automates measuring how long it takes to run a snippet of code.

for N in (100, 500, 1000, 2000):
    print(f"Size: {N} x {N}")
    A = np.random.random((N, N))
    %timeit A.dot(A)
Size: 100 x 100
The slowest run took 28.70 times longer than the fastest. This could mean that an intermediate result is being cached.
2.08 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Size: 500 x 500
2.42 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Size: 1000 x 1000
19.5 ms ± 6.36 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Size: 2000 x 2000
96.8 ms ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let’s look at what options %timeit can take.

%timeit?
Docstring:
Time execution of a Python statement or expression

Usage, in line mode:
  %timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o] statement
or in cell mode:
  %%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o] setup_code
  code
  code...

Time execution of a Python statement or expression using the timeit
module.  This function can be used both as a line and cell magic:

- In line mode you can time a single-line statement (though multiple
  ones can be chained with using semicolons).

- In cell mode, the statement in the first line is used as setup code
  (executed but not timed) and the body of the cell is timed.  The cell
  body has access to any variables created in the setup code.

Options:
-n<N>: execute the given statement <N> times in a loop. If <N> is not
provided, <N> is determined so as to get sufficient accuracy.

-r<R>: number of repeats <R>, each consisting of <N> loops, and take the
best result.
Default: 7

-t: use time.time to measure the time, which is the default on Unix.
This function measures wall time.

-c: use time.clock to measure the time, which is the default on
Windows and measures wall time. On Unix, resource.getrusage is used
instead and returns the CPU user time.

-p<P>: use a precision of <P> digits to display the timing result.
Default: 3

-q: Quiet, do not print result.

-o: return a TimeitResult that can be stored in a variable to inspect
    the result in more details.

.. versionchanged:: 7.3
    User variables are no longer expanded,
    the magic line is always left unmodified.

Examples
--------
::

  In [1]: %timeit pass
  8.26 ns ± 0.12 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

  In [2]: u = None

  In [3]: %timeit u is None
  29.9 ns ± 0.643 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

  In [4]: %timeit -r 4 u == None

  In [5]: import time

  In [6]: %timeit -n1 time.sleep(2)

The times reported by %timeit will be slightly higher than those
reported by the timeit.py script when variables are accessed. This is
due to the fact that %timeit executes the statement in the namespace
of the shell, compared with timeit.py, which uses a single setup
statement to import function or create variables. Generally, the bias
does not matter as long as results from timeit.py are not mixed with
those from %timeit.
File:      ~/conda/lib/python3.11/site-packages/IPython/core/magics/execution.py

We can save the result in an object with %timeit -o, and specify to only run one group of 100 iterations.

A = np.random.random((100, 100))
tr = %timeit -o -n 1 -r 100 A.dot(A)
The slowest run took 101.85 times longer than the fastest. This could mean that an intermediate result is being cached.
1.49 ms ± 2.4 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)
tr
<TimeitResult : 1.49 ms ± 2.4 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)>
tr.best
0.00012358397361822426
tr.best, tr.worst
(0.00012358397361822426, 0.012586958007887006)
print(len(tr.all_runs))
tr.all_runs[:10]
100
[0.0032582079875282943,
 0.0023532090126536787,
 0.00310250002075918,
 0.00602666600025259,
 0.003167124988976866,
 0.0003759169776458293,
 0.012586958007887006,
 0.008417666977038607,
 0.007104332995368168,
 0.0002472500200383365]
plt.hist(np.array(tr.all_runs) * 1e6)
plt.xlabel("t (µs)");
../../_images/ee6ce087b40cafdd30035a28e71a6e6635b9478ff90f649f8fe381dda7f70eb1.png

Diffusing a wave#

Our task is to optimise a 1-D diffusion algorithm, using numpy and Cython.

Our input signal is a sawtooth wave:

\[ x_\mathrm{sawtooth}(t) = \frac{A}{2}-\frac {A}{\pi}\sum_{k=1}^{\infty}\frac {\sin (k t)}{k} \]
from scipy.signal import sawtooth

T = 8 * np.pi
t = np.linspace(0, T, 1024)
x = sawtooth(t)
plt.plot(t, x)

steps = 10_000
../../_images/ccfe97161fde814846398b4ab36043942932074382adc3bc649a0be94dc4a5a4.png

Exercise: turning math into code#

Hints:

  1. when you see a sum over a variable, that should make you think of a for loop

  2. x(t) can be an array, where t is an array of values, and x is the result of calling an

For example, to implement

\[ x(t) = sin(t) \]
x_t = np.sin(t)
plt.plot(t, x_t)
[<matplotlib.lines.Line2D at 0x147119050>]
../../_images/ef073dd96382ca49c298c369e1be53cda9d5db0f6f2b3051b086d4017ba8cdc9.png
\[ x_\mathrm{sawtooth}(t) = \frac{A}{2}-\frac {A}{\pi}\sum_{k=1}^{\infty}\frac {\sin (k t)}{k} \]
def my_sawtooth(t, A=1, n_freqs=100):
    # x = A/2 - ...
    x = np.ones(t.shape) * 0.5
    # A/π * sum(k=1, k=n_freqs) { sin(k t) / k }
    for k in range(1, n_freqs + 1):
        x -= np.sin(k * t) / (k * np.pi)
    return x * A


# test your sawtooth output
# it should look like similar to the above

n_list = [5, 10, 100, 1000]
fig, axes = plt.subplots(len(n_list), 1)

for ax, n_freqs in zip(axes, n_list):
    ax.plot(t, my_sawtooth(t, n_freqs=n_freqs))
    ax.set_title(f"n={n_freqs}")
../../_images/158d6c5125b7b0942a5ca2fcc00bea1832999cc305099f822666907024b6057e.png

We are going to diffuse the wave by evolving the heat equation:

\[ \frac{\delta x}{\delta t} = \alpha \frac{\delta^2 x}{\delta t^2} \]

Which we can discretize for our arrays:

(5)#\[\begin{align} x_{k}[i] =& \frac{1}{4} \left( x_{k-1}[i-1] + 2 x_{k-1}[i] + x_{k-1}[i+1] \right) & i \in [1, N-1] \\ x_{k}[0] =& x_{0}[0] \\ x_{k}[N] =& x_{0}[N] \\ \end{align}\]
points = [10, 100, 10, 100, 10]

plt.bar(range(len(points)), points)
plt.text(1, 101, "1/2", size=16, ha="center")
plt.text(0, 11, "1/4", size=16, ha="center")
plt.text(2, 11, "1/4", size=16, ha="center")
Text(2, 11, '1/4')
../../_images/adbf22320df9ecc8944b2191bca1593d7494c373c1befb2eb96acb782a03a798.png
import time

from IPython.display import clear_output

i_list = range(len(points))
for k in range(12):
    plt.bar(range(len(points)), points)
    plt.ylim(0, 100)
    plt.title(f"k = {k}")
    plt.show()
    next_points = points.copy()
    for i in range(1, len(points) - 1):
        next_points[i] = 0.25 * points[i - 1] + 0.5 * points[i] + 0.25 * points[i + 1]
    points = next_points
    time.sleep(1)
    clear_output(wait=True)
../../_images/37238ab9ec8f41ed9fc73a26377e4000e976ed87ba538297129c2b1a5f8aa7e3.png

Pure Python#

We’ll start with a pure Python implementation, to use as a reference.

Let’s implement our algorithm:

Advancing our blur one step looks like:

(6)#\[\begin{align} x_{k}[0] =& x_{0}[0] \\ x_{k}[N] =& x_{0}[N] \\ x_{k}[i] =& \frac{1}{4} \left( x_{k-1}[i-1] + 2 x_{k-1}[i] + x_{k-1}[i+1] \right) & i \in [1, N-1] \end{align}\]

and we keep repeating that to get

\[ x_k \in 1...steps \]
  • What variables do we need?

  • How many loops?

def blur_py(x0, steps=1024):
    """Advance a Gaussian blur `steps` steps"""
    x = 1 * x0  # copy
    x_k = np.empty_like(x)
    x_k[0] = x[0]
    x_k[-1] = x[-1]
    for k in range(steps):
        for i in range(1, len(x) - 1):
            x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
        x, x_k = x_k, x  # swap for next step
    return x
y = blur_py(x, steps)
plt.plot(t, x, "--")
plt.plot(t, y);
../../_images/3cf60ecc74017a5d654a9885344d4247315fee53c365a27609a43bbac51b4044.png

Now we can measure how long it takes to run evolve this system:

ref_run = %timeit -o y = blur_py(x, steps)
t_ref = ref_run.best

times = {"python": t_ref}
2.98 s ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So it takes a few seconds. We can also see how it changes with different times and resolutions.

Vectorizing with numpy#

We can vectorize the inner loop with a single numpy operation:

import numpy as np


def blur_np(x, steps=1024):
    x = 1 * x
    x_k = np.empty_like(x)
    x_k[0] = x[0]
    x_k[-1] = x[-1]
    for _ in range(steps):
        x_k[1:-1] = 0.25 * (x[0:-2] + 2 * x[1:-1] + x[2:])
        x, x_k = x_k, x
    return x
y = blur_np(x, steps)
plt.plot(t, x, "--")
plt.plot(t, y)
[<matplotlib.lines.Line2D at 0x1473ee1d0>]
../../_images/3cf60ecc74017a5d654a9885344d4247315fee53c365a27609a43bbac51b4044.png
np_r = %timeit -o blur_np(x, steps)
t_np = np_r.best
33.5 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
times["numpy"] = t_np
def plot_times():
    ind = np.arange(len(times))
    plt.bar(ind, times.values(), log=True)
    plt.xticks(ind, times.keys(), rotation=30)
    plt.ylim(0.1 * min(times.values()), max(times.values()))


plot_times()
../../_images/b1317d8988764893e86522bc1afcaf6df6397296b783177da4816d20f7d36edc.png

So vectorizing the inner loop brings us from O(1 second) to O(10 milliseconds), an improvement of ~100x:

t_ref / t_np
89.18848313821238

Cython#

Cython provides an IPython extension, which defines a magic we can use to inline bits of Cython code in the notebook:

%load_ext Cython

# small fix to improve embedded Cython annotations in our lecture notes
%load_ext _cython_annotate_fix
%%cython

def csum(n):
    cs = 0
    for i in range(n):
        cs += i
    return cs
%timeit csum(5)
147 ns ± 0.831 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%%cython?
Docstring:
::

  %cython [-a] [--annotate-fullc] [-+] [-3] [-2] [-f] [-c COMPILE_ARGS]
              [--link-args LINK_ARGS] [-l LIB] [-n NAME] [-L dir] [-I INCLUDE]
              [-S SRC] [--pgo] [--verbose]

Compile and import everything from a Cython code cell.

The contents of the cell are written to a `.pyx` file in the
directory `IPYTHONDIR/cython` using a filename with the hash of the
code. This file is then cythonized and compiled. The resulting module
is imported and all of its symbols are injected into the user's
namespace. The usage is similar to that of `%%cython_pyximport` but
you don't have to pass a module name::

    %%cython
    def f(x):
        return 2.0*x

To compile OpenMP codes, pass the required  `--compile-args`
and `--link-args`.  For example with gcc::

    %%cython --compile-args=-fopenmp --link-args=-fopenmp
    ...

To enable profile guided optimisation, pass the ``--pgo`` option.
Note that the cell itself needs to take care of establishing a suitable
profile when executed. This can be done by implementing the functions to
optimise, and then calling them directly in the same cell on some realistic
training data like this::

    %%cython --pgo
    def critical_function(data):
        for item in data:
            ...

    # execute function several times to build profile
    from somewhere import some_typical_data
    for _ in range(100):
        critical_function(some_typical_data)

In Python 3.5 and later, you can distinguish between the profile and
non-profile runs as follows::

    if "_pgo_" in __name__:
        ...  # execute critical code here

options:
  -a, --annotate        Produce a colorized HTML version of the source.
  --annotate-fullc      Produce a colorized HTML version of the source which
                        includes entire generated C/C++-code.
  -+, --cplus           Output a C++ rather than C file.
  -3                    Select Python 3 syntax.
  -2                    Select Python 2 syntax.
  -f, --force           Force the compilation of a new module, even if the
                        source has been previously compiled.
  -c COMPILE_ARGS, --compile-args COMPILE_ARGS
                        Extra flags to pass to compiler via the
                        `extra_compile_args` Extension flag (can be specified
                        multiple times).
  --link-args LINK_ARGS
                        Extra flags to pass to linker via the
                        `extra_link_args` Extension flag (can be specified
                        multiple times).
  -l LIB, --lib LIB     Add a library to link the extension against (can be
                        specified multiple times).
  -n NAME, --name NAME  Specify a name for the Cython module.
  -L dir                Add a path to the list of library directories (can be
                        specified multiple times).
  -I INCLUDE, --include INCLUDE
                        Add a path to the list of include directories (can be
                        specified multiple times).
  -S SRC, --src SRC     Add a path to the list of src files (can be specified
                        multiple times).
  --pgo                 Enable profile guided optimisation in the C compiler.
                        Compiles the cell twice and executes it in between to
                        generate a runtime profile.
  --verbose             Print debug information like generated .c/.cpp file
                        location and exact gcc/g++ command invoked.
File:      ~/conda/lib/python3.11/site-packages/Cython/Build/IpythonMagic.py

%%cython --annotate shows you annotations about the generated sourcecode. The key to writing Cython is to minimize the amount of Python calls in the generated code. In general: yellow = slow.

def psum(n):
    cs = 0
    for i in range(n):
        cs += i
    return cs
%%cython --annotate

def csum(n):
    cs = 0
    for i in range(n):
        cs += i
    return cs

Generated by Cython 3.0.2

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

 1: 
+2: def csum(n):
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134_1csum(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134_1csum = {"csum", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134_1csum, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134_1csum(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  PyObject *__pyx_v_n = 0;
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("csum (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args);
  if (unlikely((__pyx_nargs < 0))) __PYX_ERR(0, 2, __pyx_L3_error)
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_n,0};
  PyObject* values[1] = {0};
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_n)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 2, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "csum") < 0)) __PYX_ERR(0, 2, __pyx_L3_error)
      }
    } else if (unlikely(__pyx_nargs != 1)) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
    }
    __pyx_v_n = values[0];
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("csum", 1, 1, 1, __pyx_nargs); __PYX_ERR(0, 2, __pyx_L3_error)
  goto __pyx_L3_error;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_AddTraceback("_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134.csum", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_54_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134_csum(__pyx_self, __pyx_v_n);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134_csum(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_n) {
  PyObject *__pyx_v_cs = NULL;
  PyObject *__pyx_v_i = NULL;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("csum", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_AddTraceback("_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134.csum", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XDECREF(__pyx_v_cs);
  __Pyx_XDECREF(__pyx_v_i);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple_ = PyTuple_Pack(3, __pyx_n_s_n, __pyx_n_s_cs, __pyx_n_s_i); if (unlikely(!__pyx_tuple_)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple_);
  __Pyx_GIVEREF(__pyx_tuple_);
/* … */
  __pyx_t_2 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_ffa75fecb2c2b5d887239288c8b3c0369a07c134_1csum, 0, __pyx_n_s_csum, NULL, __pyx_n_s_cython_magic_ffa75fecb2c2b5d887, __pyx_d, ((PyObject *)__pyx_codeobj__2)); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_csum, __pyx_t_2) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_2) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
+3:     cs = 0
  __Pyx_INCREF(__pyx_int_0);
  __pyx_v_cs = __pyx_int_0;
+4:     for i in range(n):
  __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_builtin_range, __pyx_v_n); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 4, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (likely(PyList_CheckExact(__pyx_t_1)) || PyTuple_CheckExact(__pyx_t_1)) {
    __pyx_t_2 = __pyx_t_1; __Pyx_INCREF(__pyx_t_2); __pyx_t_3 = 0;
    __pyx_t_4 = NULL;
  } else {
    __pyx_t_3 = -1; __pyx_t_2 = PyObject_GetIter(__pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 4, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __pyx_t_4 = __Pyx_PyObject_GetIterNextFunc(__pyx_t_2); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 4, __pyx_L1_error)
  }
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  for (;;) {
    if (likely(!__pyx_t_4)) {
      if (likely(PyList_CheckExact(__pyx_t_2))) {
        if (__pyx_t_3 >= PyList_GET_SIZE(__pyx_t_2)) break;
        #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
        __pyx_t_1 = PyList_GET_ITEM(__pyx_t_2, __pyx_t_3); __Pyx_INCREF(__pyx_t_1); __pyx_t_3++; if (unlikely((0 < 0))) __PYX_ERR(0, 4, __pyx_L1_error)
        #else
        __pyx_t_1 = PySequence_ITEM(__pyx_t_2, __pyx_t_3); __pyx_t_3++; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 4, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_1);
        #endif
      } else {
        if (__pyx_t_3 >= PyTuple_GET_SIZE(__pyx_t_2)) break;
        #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
        __pyx_t_1 = PyTuple_GET_ITEM(__pyx_t_2, __pyx_t_3); __Pyx_INCREF(__pyx_t_1); __pyx_t_3++; if (unlikely((0 < 0))) __PYX_ERR(0, 4, __pyx_L1_error)
        #else
        __pyx_t_1 = PySequence_ITEM(__pyx_t_2, __pyx_t_3); __pyx_t_3++; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 4, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_1);
        #endif
      }
    } else {
      __pyx_t_1 = __pyx_t_4(__pyx_t_2);
      if (unlikely(!__pyx_t_1)) {
        PyObject* exc_type = PyErr_Occurred();
        if (exc_type) {
          if (likely(__Pyx_PyErr_GivenExceptionMatches(exc_type, PyExc_StopIteration))) PyErr_Clear();
          else __PYX_ERR(0, 4, __pyx_L1_error)
        }
        break;
      }
      __Pyx_GOTREF(__pyx_t_1);
    }
    __Pyx_XDECREF_SET(__pyx_v_i, __pyx_t_1);
    __pyx_t_1 = 0;
/* … */
  }
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
+5:         cs += i
    __pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_cs, __pyx_v_i); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 5, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF_SET(__pyx_v_cs, __pyx_t_1);
    __pyx_t_1 = 0;
+6:     return cs
  __Pyx_XDECREF(__pyx_r);
  __Pyx_INCREF(__pyx_v_cs);
  __pyx_r = __pyx_v_cs;
  goto __pyx_L0;

Uh oh, that looks like a lot of yellow. We can reduce it by adding some type annotations:

%%cython --annotate

from cython import int

def csum2(n: int):
    i: int
    cs = 0
    for i in range(n):
        cs += i
    return cs

Generated by Cython 3.0.2

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

 1: 
+2: from cython import int
  __pyx_t_3 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_3) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
 3: 
+4: def csum2(n: int):
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16_1csum2(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16_1csum2 = {"csum2", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16_1csum2, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16_1csum2(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  int __pyx_v_n;
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("csum2 (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args);
  if (unlikely((__pyx_nargs < 0))) __PYX_ERR(0, 4, __pyx_L3_error)
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_n,0};
  PyObject* values[1] = {0};
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_n)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 4, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "csum2") < 0)) __PYX_ERR(0, 4, __pyx_L3_error)
      }
    } else if (unlikely(__pyx_nargs != 1)) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
    }
    __pyx_v_n = __Pyx_PyInt_As_int(values[0]); if (unlikely((__pyx_v_n == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 4, __pyx_L3_error)
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("csum2", 1, 1, 1, __pyx_nargs); __PYX_ERR(0, 4, __pyx_L3_error)
  goto __pyx_L3_error;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_AddTraceback("_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16.csum2", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_54_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16_csum2(__pyx_self, __pyx_v_n);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16_csum2(CYTHON_UNUSED PyObject *__pyx_self, int __pyx_v_n) {
  int __pyx_v_i;
  PyObject *__pyx_v_cs = NULL;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("csum2", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_AddTraceback("_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16.csum2", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XDECREF(__pyx_v_cs);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple_ = PyTuple_Pack(3, __pyx_n_s_n, __pyx_n_s_i, __pyx_n_s_cs); if (unlikely(!__pyx_tuple_)) __PYX_ERR(0, 4, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple_);
  __Pyx_GIVEREF(__pyx_tuple_);
/* … */
  __pyx_t_2 = __Pyx_PyDict_NewPresized(1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 4, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  if (PyDict_SetItem(__pyx_t_2, __pyx_n_s_n, __pyx_n_s_int) < 0) __PYX_ERR(0, 4, __pyx_L1_error)
  __pyx_t_3 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_3b29ad3d8007e919ac9149407675ad03f92e5d16_1csum2, 0, __pyx_n_s_csum2, NULL, __pyx_n_s_cython_magic_3b29ad3d8007e919ac, __pyx_d, ((PyObject *)__pyx_codeobj__2)); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 4, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_CyFunction_SetAnnotationsDict(__pyx_t_3, __pyx_t_2);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_csum2, __pyx_t_3) < 0) __PYX_ERR(0, 4, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
 5:     i: int
+6:     cs = 0
  __Pyx_INCREF(__pyx_int_0);
  __pyx_v_cs = __pyx_int_0;
+7:     for i in range(n):
  __pyx_t_1 = __pyx_v_n;
  __pyx_t_2 = __pyx_t_1;
  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_i = __pyx_t_3;
+8:         cs += i
    __pyx_t_4 = __Pyx_PyInt_From_int(__pyx_v_i); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 8, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_4);
    __pyx_t_5 = PyNumber_InPlaceAdd(__pyx_v_cs, __pyx_t_4); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 8, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_5);
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    __Pyx_DECREF_SET(__pyx_v_cs, __pyx_t_5);
    __pyx_t_5 = 0;
  }
+9:     return cs
  __Pyx_XDECREF(__pyx_r);
  __Pyx_INCREF(__pyx_v_cs);
  __pyx_r = __pyx_v_cs;
  goto __pyx_L0;

Almost there, but I still see yellow on the lines with cs:

%%cython --annotate
from cython import int

def csum3(n: int) -> int:
    i: int
    cs: int = 0
    for i in range(n):
        cs += i
    return cs

Generated by Cython 3.0.2

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

+1: from cython import int
  __pyx_t_3 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_3) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
 2: 
+3: def csum3(n: int) -> int:
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543_1csum3(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543_1csum3 = {"csum3", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543_1csum3, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543_1csum3(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  int __pyx_v_n;
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("csum3 (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args);
  if (unlikely((__pyx_nargs < 0))) __PYX_ERR(0, 3, __pyx_L3_error)
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_n,0};
  PyObject* values[1] = {0};
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_n)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 3, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "csum3") < 0)) __PYX_ERR(0, 3, __pyx_L3_error)
      }
    } else if (unlikely(__pyx_nargs != 1)) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
    }
    __pyx_v_n = __Pyx_PyInt_As_int(values[0]); if (unlikely((__pyx_v_n == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 3, __pyx_L3_error)
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("csum3", 1, 1, 1, __pyx_nargs); __PYX_ERR(0, 3, __pyx_L3_error)
  goto __pyx_L3_error;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_AddTraceback("_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543.csum3", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_54_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543_csum3(__pyx_self, __pyx_v_n);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543_csum3(CYTHON_UNUSED PyObject *__pyx_self, int __pyx_v_n) {
  int __pyx_v_i;
  int __pyx_v_cs;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("csum3", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_AddTraceback("_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543.csum3", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple_ = PyTuple_Pack(3, __pyx_n_s_n, __pyx_n_s_i, __pyx_n_s_cs); if (unlikely(!__pyx_tuple_)) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple_);
  __Pyx_GIVEREF(__pyx_tuple_);
/* … */
  __pyx_t_2 = __Pyx_PyDict_NewPresized(2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  if (PyDict_SetItem(__pyx_t_2, __pyx_n_s_n, __pyx_n_s_int) < 0) __PYX_ERR(0, 3, __pyx_L1_error)
  if (PyDict_SetItem(__pyx_t_2, __pyx_n_s_return, __pyx_n_s_int) < 0) __PYX_ERR(0, 3, __pyx_L1_error)
  __pyx_t_3 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_05d58375b0f56a22bc6d33d2931bd7a4ec562543_1csum3, 0, __pyx_n_s_csum3, NULL, __pyx_n_s_cython_magic_05d58375b0f56a22bc, __pyx_d, ((PyObject *)__pyx_codeobj__2)); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_CyFunction_SetAnnotationsDict(__pyx_t_3, __pyx_t_2);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_csum3, __pyx_t_3) < 0) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
 4:     i: int
+5:     cs: int = 0
  __pyx_v_cs = 0;
+6:     for i in range(n):
  __pyx_t_1 = __pyx_v_n;
  __pyx_t_2 = __pyx_t_1;
  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_i = __pyx_t_3;
+7:         cs += i
    __pyx_v_cs = (__pyx_v_cs + __pyx_v_i);
  }
+8:     return cs
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_4 = __Pyx_PyInt_From_int(__pyx_v_cs); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 8, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_r = __pyx_t_4;
  __pyx_t_4 = 0;
  goto __pyx_L0;

Much better! Now there’s only Python when entering the function, which is about as good as we can do.

N = 1_000_000
print("psum ", end=" ")
tr_psum = %timeit -o psum(N)
print("csum ", end=" ")
tr_csum = %timeit -o csum(N)
print("csum2", end=" ")
tr_csum2 = %timeit -o csum2(N)
print("csum3", end=" ")
tr_csum3 = %timeit -o csum3(N)
psum  23.5 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
csum  20.8 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
csum2 20 ms ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
csum3 24.8 ns ± 0.195 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
from timing_utils import timing_table

timing_table(
    {
        "psum": tr_psum,
        "csum": tr_csum,
        "csum2": tr_csum2,
        "csum3": tr_csum3,
    }
)
implementationspeed
psum1.0 (normalized)
csum1.13x
csum21.18x
csum3946,558x

Blurring with Cython#

Now we can apply the same principles to writing a blur in Cython.

%%cython --annotate

import numpy as np


def blur_cython(x, steps=1024):
    x = 1 * x  # copy
    x_k = np.empty_like(x)
    x_k[0] = x[0]
    x_k[-1] = x[-1]
    for _ in range(steps):
        for i in range(1, len(x) - 1):
            x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
        x, x_k = x_k, x  # swap for next step
    return x

Generated by Cython 3.0.2

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

 01: 
+02: import numpy as np
  __pyx_t_2 = __Pyx_ImportDottedModule(__pyx_n_s_numpy, NULL); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_2) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
/* … */
  __pyx_t_2 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_2) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
 03: 
 04: 
+05: def blur_cython(x, steps=1024):
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47_1blur_cython(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47_1blur_cython = {"blur_cython", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47_1blur_cython, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47_1blur_cython(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  PyObject *__pyx_v_x = 0;
  PyObject *__pyx_v_steps = 0;
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("blur_cython (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args);
  if (unlikely((__pyx_nargs < 0))) __PYX_ERR(0, 5, __pyx_L3_error)
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_x,&__pyx_n_s_steps,0};
  PyObject* values[2] = {0,0};
    values[1] = __Pyx_Arg_NewRef_FASTCALL(((PyObject *)((PyObject *)__pyx_int_1024)));
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_x)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 5, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (kw_args > 0) {
          PyObject* value = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_steps);
          if (value) { values[1] = __Pyx_Arg_NewRef_FASTCALL(value); kw_args--; }
          else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 5, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "blur_cython") < 0)) __PYX_ERR(0, 5, __pyx_L3_error)
      }
    } else {
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        break;
        default: goto __pyx_L5_argtuple_error;
      }
    }
    __pyx_v_x = values[0];
    __pyx_v_steps = values[1];
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("blur_cython", 0, 1, 2, __pyx_nargs); __PYX_ERR(0, 5, __pyx_L3_error)
  goto __pyx_L3_error;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_AddTraceback("_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47.blur_cython", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_54_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47_blur_cython(__pyx_self, __pyx_v_x, __pyx_v_steps);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47_blur_cython(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_x, PyObject *__pyx_v_steps) {
  PyObject *__pyx_v_x_k = NULL;
  CYTHON_UNUSED PyObject *__pyx_v__ = NULL;
  PyObject *__pyx_v_i = NULL;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("blur_cython", 0);
  __Pyx_INCREF(__pyx_v_x);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_9);
  __Pyx_XDECREF(__pyx_t_10);
  __Pyx_AddTraceback("_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47.blur_cython", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XDECREF(__pyx_v_x_k);
  __Pyx_XDECREF(__pyx_v__);
  __Pyx_XDECREF(__pyx_v_i);
  __Pyx_XDECREF(__pyx_v_x);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__3 = PyTuple_Pack(5, __pyx_n_s_x, __pyx_n_s_steps, __pyx_n_s_x_k, __pyx_n_s__2, __pyx_n_s_i); if (unlikely(!__pyx_tuple__3)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__3);
  __Pyx_GIVEREF(__pyx_tuple__3);
  __pyx_codeobj__4 = (PyObject*)__Pyx_PyCode_New(2, 0, 0, 5, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__3, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_Users_minrk_Library_Caches_ipyt, __pyx_n_s_blur_cython, 5, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__4)) __PYX_ERR(0, 5, __pyx_L1_error)
/* … */
  __pyx_t_2 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_5e3a7c0b9a77ebd65e99e24b10f8f056789e4a47_1blur_cython, 0, __pyx_n_s_blur_cython, NULL, __pyx_n_s_cython_magic_5e3a7c0b9a77ebd65e, __pyx_d, ((PyObject *)__pyx_codeobj__4)); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_CyFunction_SetDefaultsTuple(__pyx_t_2, __pyx_tuple__5);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_blur_cython, __pyx_t_2) < 0) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
+06:     x = 1 * x  # copy
  __pyx_t_1 = __Pyx_PyInt_MultiplyCObj(__pyx_int_1, __pyx_v_x, 1, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_DECREF_SET(__pyx_v_x, __pyx_t_1);
  __pyx_t_1 = 0;
+07:     x_k = np.empty_like(x)
  __Pyx_GetModuleGlobalName(__pyx_t_2, __pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 7, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_empty_like); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 7, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = NULL;
  __pyx_t_4 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_3))) {
    __pyx_t_2 = PyMethod_GET_SELF(__pyx_t_3);
    if (likely(__pyx_t_2)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3);
      __Pyx_INCREF(__pyx_t_2);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_3, function);
      __pyx_t_4 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_2, __pyx_v_x};
    __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_3, __pyx_callargs+1-__pyx_t_4, 1+__pyx_t_4);
    __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0;
    if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 7, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  }
  __pyx_v_x_k = __pyx_t_1;
  __pyx_t_1 = 0;
+08:     x_k[0] = x[0]
  __pyx_t_1 = __Pyx_GetItemInt(__pyx_v_x, 0, long, 1, __Pyx_PyInt_From_long, 0, 0, 1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 8, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (unlikely((__Pyx_SetItemInt(__pyx_v_x_k, 0, __pyx_t_1, long, 1, __Pyx_PyInt_From_long, 0, 0, 1) < 0))) __PYX_ERR(0, 8, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
+09:     x_k[-1] = x[-1]
  __pyx_t_1 = __Pyx_GetItemInt(__pyx_v_x, -1L, long, 1, __Pyx_PyInt_From_long, 0, 1, 1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 9, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (unlikely((__Pyx_SetItemInt(__pyx_v_x_k, -1L, __pyx_t_1, long, 1, __Pyx_PyInt_From_long, 0, 1, 1) < 0))) __PYX_ERR(0, 9, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
+10:     for _ in range(steps):
  __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_builtin_range, __pyx_v_steps); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (likely(PyList_CheckExact(__pyx_t_1)) || PyTuple_CheckExact(__pyx_t_1)) {
    __pyx_t_3 = __pyx_t_1; __Pyx_INCREF(__pyx_t_3); __pyx_t_5 = 0;
    __pyx_t_6 = NULL;
  } else {
    __pyx_t_5 = -1; __pyx_t_3 = PyObject_GetIter(__pyx_t_1); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 10, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_3);
    __pyx_t_6 = __Pyx_PyObject_GetIterNextFunc(__pyx_t_3); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 10, __pyx_L1_error)
  }
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  for (;;) {
    if (likely(!__pyx_t_6)) {
      if (likely(PyList_CheckExact(__pyx_t_3))) {
        if (__pyx_t_5 >= PyList_GET_SIZE(__pyx_t_3)) break;
        #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
        __pyx_t_1 = PyList_GET_ITEM(__pyx_t_3, __pyx_t_5); __Pyx_INCREF(__pyx_t_1); __pyx_t_5++; if (unlikely((0 < 0))) __PYX_ERR(0, 10, __pyx_L1_error)
        #else
        __pyx_t_1 = PySequence_ITEM(__pyx_t_3, __pyx_t_5); __pyx_t_5++; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 10, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_1);
        #endif
      } else {
        if (__pyx_t_5 >= PyTuple_GET_SIZE(__pyx_t_3)) break;
        #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
        __pyx_t_1 = PyTuple_GET_ITEM(__pyx_t_3, __pyx_t_5); __Pyx_INCREF(__pyx_t_1); __pyx_t_5++; if (unlikely((0 < 0))) __PYX_ERR(0, 10, __pyx_L1_error)
        #else
        __pyx_t_1 = PySequence_ITEM(__pyx_t_3, __pyx_t_5); __pyx_t_5++; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 10, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_1);
        #endif
      }
    } else {
      __pyx_t_1 = __pyx_t_6(__pyx_t_3);
      if (unlikely(!__pyx_t_1)) {
        PyObject* exc_type = PyErr_Occurred();
        if (exc_type) {
          if (likely(__Pyx_PyErr_GivenExceptionMatches(exc_type, PyExc_StopIteration))) PyErr_Clear();
          else __PYX_ERR(0, 10, __pyx_L1_error)
        }
        break;
      }
      __Pyx_GOTREF(__pyx_t_1);
    }
    __Pyx_XDECREF_SET(__pyx_v__, __pyx_t_1);
    __pyx_t_1 = 0;
/* … */
  }
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
+11:         for i in range(1, len(x) - 1):
    __pyx_t_7 = PyObject_Length(__pyx_v_x); if (unlikely(__pyx_t_7 == ((Py_ssize_t)-1))) __PYX_ERR(0, 11, __pyx_L1_error)
    __pyx_t_1 = PyInt_FromSsize_t((__pyx_t_7 - 1)); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 11, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __pyx_t_2 = PyTuple_New(2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 11, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_INCREF(__pyx_int_1);
    __Pyx_GIVEREF(__pyx_int_1);
    if (__Pyx_PyTuple_SET_ITEM(__pyx_t_2, 0, __pyx_int_1)) __PYX_ERR(0, 11, __pyx_L1_error);
    __Pyx_GIVEREF(__pyx_t_1);
    if (__Pyx_PyTuple_SET_ITEM(__pyx_t_2, 1, __pyx_t_1)) __PYX_ERR(0, 11, __pyx_L1_error);
    __pyx_t_1 = 0;
    __pyx_t_1 = __Pyx_PyObject_Call(__pyx_builtin_range, __pyx_t_2, NULL); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 11, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    if (likely(PyList_CheckExact(__pyx_t_1)) || PyTuple_CheckExact(__pyx_t_1)) {
      __pyx_t_2 = __pyx_t_1; __Pyx_INCREF(__pyx_t_2); __pyx_t_7 = 0;
      __pyx_t_8 = NULL;
    } else {
      __pyx_t_7 = -1; __pyx_t_2 = PyObject_GetIter(__pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 11, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_2);
      __pyx_t_8 = __Pyx_PyObject_GetIterNextFunc(__pyx_t_2); if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 11, __pyx_L1_error)
    }
    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
    for (;;) {
      if (likely(!__pyx_t_8)) {
        if (likely(PyList_CheckExact(__pyx_t_2))) {
          if (__pyx_t_7 >= PyList_GET_SIZE(__pyx_t_2)) break;
          #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
          __pyx_t_1 = PyList_GET_ITEM(__pyx_t_2, __pyx_t_7); __Pyx_INCREF(__pyx_t_1); __pyx_t_7++; if (unlikely((0 < 0))) __PYX_ERR(0, 11, __pyx_L1_error)
          #else
          __pyx_t_1 = PySequence_ITEM(__pyx_t_2, __pyx_t_7); __pyx_t_7++; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 11, __pyx_L1_error)
          __Pyx_GOTREF(__pyx_t_1);
          #endif
        } else {
          if (__pyx_t_7 >= PyTuple_GET_SIZE(__pyx_t_2)) break;
          #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
          __pyx_t_1 = PyTuple_GET_ITEM(__pyx_t_2, __pyx_t_7); __Pyx_INCREF(__pyx_t_1); __pyx_t_7++; if (unlikely((0 < 0))) __PYX_ERR(0, 11, __pyx_L1_error)
          #else
          __pyx_t_1 = PySequence_ITEM(__pyx_t_2, __pyx_t_7); __pyx_t_7++; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 11, __pyx_L1_error)
          __Pyx_GOTREF(__pyx_t_1);
          #endif
        }
      } else {
        __pyx_t_1 = __pyx_t_8(__pyx_t_2);
        if (unlikely(!__pyx_t_1)) {
          PyObject* exc_type = PyErr_Occurred();
          if (exc_type) {
            if (likely(__Pyx_PyErr_GivenExceptionMatches(exc_type, PyExc_StopIteration))) PyErr_Clear();
            else __PYX_ERR(0, 11, __pyx_L1_error)
          }
          break;
        }
        __Pyx_GOTREF(__pyx_t_1);
      }
      __Pyx_XDECREF_SET(__pyx_v_i, __pyx_t_1);
      __pyx_t_1 = 0;
/* … */
    }
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
+12:             x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
      __pyx_t_1 = __Pyx_PyInt_SubtractObjC(__pyx_v_i, __pyx_int_1, 1, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __pyx_t_9 = __Pyx_PyObject_GetItem(__pyx_v_x, __pyx_t_1); if (unlikely(!__pyx_t_9)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_9);
      __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
      __pyx_t_1 = __Pyx_PyObject_GetItem(__pyx_v_x, __pyx_v_i); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __pyx_t_10 = __Pyx_PyInt_MultiplyCObj(__pyx_int_2, __pyx_t_1, 2, 0, 0); if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_10);
      __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
      __pyx_t_1 = PyNumber_Add(__pyx_t_9, __pyx_t_10); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_DECREF(__pyx_t_9); __pyx_t_9 = 0;
      __Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
      __pyx_t_10 = __Pyx_PyInt_AddObjC(__pyx_v_i, __pyx_int_1, 1, 0, 0); if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_10);
      __pyx_t_9 = __Pyx_PyObject_GetItem(__pyx_v_x, __pyx_t_10); if (unlikely(!__pyx_t_9)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_9);
      __Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
      __pyx_t_10 = PyNumber_Add(__pyx_t_1, __pyx_t_9); if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_10);
      __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
      __Pyx_DECREF(__pyx_t_9); __pyx_t_9 = 0;
      __pyx_t_9 = PyNumber_Multiply(__pyx_float_0_25, __pyx_t_10); if (unlikely(!__pyx_t_9)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_9);
      __Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
      if (unlikely((PyObject_SetItem(__pyx_v_x_k, __pyx_v_i, __pyx_t_9) < 0))) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_DECREF(__pyx_t_9); __pyx_t_9 = 0;
+13:         x, x_k = x_k, x  # swap for next step
    __pyx_t_11 = __pyx_v_x_k;
    __pyx_t_12 = __pyx_v_x;
    __pyx_v_x = __pyx_t_11;
    __pyx_t_11 = 0;
    __pyx_v_x_k = __pyx_t_12;
    __pyx_t_12 = 0;
+14:     return x
  __Pyx_XDECREF(__pyx_r);
  __Pyx_INCREF(__pyx_v_x);
  __pyx_r = __pyx_v_x;
  goto __pyx_L0;
c1 = %timeit -o y = blur_cython(x, steps)
t_c1 = c1.best
times["cython (no hints)"] = t_c1
2.85 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
plot_times()
../../_images/53fc34d991b7d4752cc76fda62cf9c8fafee17161d7756631f116aba568e1411.png

Without annotations, we don’t get any real improvement over the pure Python version. We can note the types of the input arguments, to get some improvements:

%%cython --annotate

import cython as C
import numpy as np

def blur_cython2(x: C.double[:], steps: C.int=1024):
    x = x.copy()
    x_k = np.empty_like(x)
    x_k[0] = x[0]
    x_k[-1] = x[-1]
    i: C.size_t
    N: C.size_t = len(x)
    for _ in range(steps):
        for i in range(1, N-1):
            x_k[i] = .25 * ( x[i-1] + 2 * x[i] + x[i+1] )
        x, x_k = x_k, x # swap for next step
    return np.asarray(x)

Generated by Cython 3.0.2

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

 01: 
+02: import cython as C
  __pyx_t_5 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_5) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
+03: import numpy as np
  __pyx_t_7 = __Pyx_ImportDottedModule(__pyx_n_s_numpy, NULL); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_7) < 0) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
 04: 
+05: def blur_cython2(x: C.double[:], steps: C.int=1024):
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db_1blur_cython2(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db_1blur_cython2 = {"blur_cython2", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db_1blur_cython2, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db_1blur_cython2(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  __Pyx_memviewslice __pyx_v_x = { 0, 0, { 0 }, { 0 }, { 0 } };
  int __pyx_v_steps;
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("blur_cython2 (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args);
  if (unlikely((__pyx_nargs < 0))) __PYX_ERR(0, 5, __pyx_L3_error)
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_x,&__pyx_n_s_steps,0};
  PyObject* values[2] = {0,0};
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_x)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 5, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (kw_args > 0) {
          PyObject* value = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_steps);
          if (value) { values[1] = __Pyx_Arg_NewRef_FASTCALL(value); kw_args--; }
          else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 5, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "blur_cython2") < 0)) __PYX_ERR(0, 5, __pyx_L3_error)
      }
    } else {
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        break;
        default: goto __pyx_L5_argtuple_error;
      }
    }
    __pyx_v_x = __Pyx_PyObject_to_MemoryviewSlice_ds_double(values[0], PyBUF_WRITABLE); if (unlikely(!__pyx_v_x.memview)) __PYX_ERR(0, 5, __pyx_L3_error)
    if (values[1]) {
      __pyx_v_steps = __Pyx_PyInt_As_int(values[1]); if (unlikely((__pyx_v_steps == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 5, __pyx_L3_error)
    } else {
      __pyx_v_steps = ((int)((int)0x400));
    }
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("blur_cython2", 0, 1, 2, __pyx_nargs); __PYX_ERR(0, 5, __pyx_L3_error)
  goto __pyx_L3_error;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
  __Pyx_AddTraceback("_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db.blur_cython2", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  if (unlikely(((PyObject *)__pyx_v_x.memview) == Py_None)) {
    PyErr_Format(PyExc_TypeError, "Argument '%.200s' must not be None", "x"); __PYX_ERR(0, 5, __pyx_L1_error)
  }
  __pyx_r = __pyx_pf_54_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db_blur_cython2(__pyx_self, __pyx_v_x, __pyx_v_steps);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  goto __pyx_L0;
  __pyx_L1_error:;
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db_blur_cython2(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_x, int __pyx_v_steps) {
  PyObject *__pyx_v_x_k = NULL;
  size_t __pyx_v_i;
  size_t __pyx_v_N;
  CYTHON_UNUSED int __pyx_v__;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("blur_cython2", 0);
  __PYX_INC_MEMVIEW(&__pyx_v_x, 1);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_1, 1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_18, 1);
  __Pyx_AddTraceback("_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db.blur_cython2", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XDECREF(__pyx_v_x_k);
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__21 = PyTuple_Pack(6, __pyx_n_s_x, __pyx_n_s_steps, __pyx_n_s_x_k, __pyx_n_s_i, __pyx_n_s_N, __pyx_n_s__20); if (unlikely(!__pyx_tuple__21)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__21);
  __Pyx_GIVEREF(__pyx_tuple__21);
/* … */
  __pyx_t_7 = __Pyx_PyInt_From_int(((int)0x400)); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __pyx_t_4 = PyTuple_New(1); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_GIVEREF(__pyx_t_7);
  if (__Pyx_PyTuple_SET_ITEM(__pyx_t_4, 0, __pyx_t_7)) __PYX_ERR(0, 5, __pyx_L1_error);
  __pyx_t_7 = 0;
  __pyx_t_7 = __Pyx_PyDict_NewPresized(2); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_t_7, __pyx_n_s_x, __pyx_kp_s_C_double) < 0) __PYX_ERR(0, 5, __pyx_L1_error)
  if (PyDict_SetItem(__pyx_t_7, __pyx_n_s_steps, __pyx_kp_s_C_int) < 0) __PYX_ERR(0, 5, __pyx_L1_error)
  __pyx_t_5 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_77a0cbe9819d4eb3494c654665b559b8f2f401db_1blur_cython2, 0, __pyx_n_s_blur_cython2, NULL, __pyx_n_s_cython_magic_77a0cbe9819d4eb349, __pyx_d, ((PyObject *)__pyx_codeobj__22)); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __Pyx_CyFunction_SetDefaultsTuple(__pyx_t_5, __pyx_t_4);
  __Pyx_CyFunction_SetAnnotationsDict(__pyx_t_5, __pyx_t_7);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_blur_cython2, __pyx_t_5) < 0) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
+06:     x = x.copy()
  __pyx_t_1 = __pyx_memoryview_copy_slice_dc_double_c(__pyx_v_x); if (unlikely(!__pyx_t_1.memview)) __PYX_ERR(0, 6, __pyx_L1_error)
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
  __pyx_v_x = __pyx_t_1;
  __pyx_t_1.memview = NULL;
  __pyx_t_1.data = NULL;
+07:     x_k = np.empty_like(x)
  __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 7, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_4 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_empty_like); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 7, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = __pyx_memoryview_fromslice(__pyx_v_x, 1, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 7, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_5 = NULL;
  __pyx_t_6 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_4))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_4);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_4, function);
      __pyx_t_6 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_3};
    __pyx_t_2 = __Pyx_PyObject_FastCall(__pyx_t_4, __pyx_callargs+1-__pyx_t_6, 1+__pyx_t_6);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 7, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  }
  __pyx_v_x_k = __pyx_t_2;
  __pyx_t_2 = 0;
+08:     x_k[0] = x[0]
  __pyx_t_7 = 0;
  __pyx_t_6 = -1;
  if (__pyx_t_7 < 0) {
    __pyx_t_7 += __pyx_v_x.shape[0];
    if (unlikely(__pyx_t_7 < 0)) __pyx_t_6 = 0;
  } else if (unlikely(__pyx_t_7 >= __pyx_v_x.shape[0])) __pyx_t_6 = 0;
  if (unlikely(__pyx_t_6 != -1)) {
    __Pyx_RaiseBufferIndexError(__pyx_t_6);
    __PYX_ERR(0, 8, __pyx_L1_error)
  }
  __pyx_t_2 = PyFloat_FromDouble((*((double *) ( /* dim=0 */ (__pyx_v_x.data + __pyx_t_7 * __pyx_v_x.strides[0]) )))); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 8, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  if (unlikely((__Pyx_SetItemInt(__pyx_v_x_k, 0, __pyx_t_2, long, 1, __Pyx_PyInt_From_long, 0, 0, 1) < 0))) __PYX_ERR(0, 8, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
+09:     x_k[-1] = x[-1]
  __pyx_t_7 = -1L;
  __pyx_t_6 = -1;
  if (__pyx_t_7 < 0) {
    __pyx_t_7 += __pyx_v_x.shape[0];
    if (unlikely(__pyx_t_7 < 0)) __pyx_t_6 = 0;
  } else if (unlikely(__pyx_t_7 >= __pyx_v_x.shape[0])) __pyx_t_6 = 0;
  if (unlikely(__pyx_t_6 != -1)) {
    __Pyx_RaiseBufferIndexError(__pyx_t_6);
    __PYX_ERR(0, 9, __pyx_L1_error)
  }
  __pyx_t_2 = PyFloat_FromDouble((*((double *) ( /* dim=0 */ (__pyx_v_x.data + __pyx_t_7 * __pyx_v_x.strides[0]) )))); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 9, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  if (unlikely((__Pyx_SetItemInt(__pyx_v_x_k, -1L, __pyx_t_2, long, 1, __Pyx_PyInt_From_long, 0, 1, 1) < 0))) __PYX_ERR(0, 9, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
 10:     i: C.size_t
+11:     N: C.size_t = len(x)
  __pyx_t_8 = __Pyx_MemoryView_Len(__pyx_v_x); 
  __pyx_v_N = __pyx_t_8;
+12:     for _ in range(steps):
  __pyx_t_6 = __pyx_v_steps;
  __pyx_t_9 = __pyx_t_6;
  for (__pyx_t_10 = 0; __pyx_t_10 < __pyx_t_9; __pyx_t_10+=1) {
    __pyx_v__ = __pyx_t_10;
+13:         for i in range(1, N-1):
    __pyx_t_11 = (__pyx_v_N - 1);
    __pyx_t_12 = __pyx_t_11;
    for (__pyx_t_13 = 1; __pyx_t_13 < __pyx_t_12; __pyx_t_13+=1) {
      __pyx_v_i = __pyx_t_13;
+14:             x_k[i] = .25 * ( x[i-1] + 2 * x[i] + x[i+1] )
      __pyx_t_14 = (__pyx_v_i - 1);
      __pyx_t_15 = -1;
      if (unlikely(__pyx_t_14 >= (size_t)__pyx_v_x.shape[0])) __pyx_t_15 = 0;
      if (unlikely(__pyx_t_15 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_15);
        __PYX_ERR(0, 14, __pyx_L1_error)
      }
      __pyx_t_16 = __pyx_v_i;
      __pyx_t_15 = -1;
      if (unlikely(__pyx_t_16 >= (size_t)__pyx_v_x.shape[0])) __pyx_t_15 = 0;
      if (unlikely(__pyx_t_15 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_15);
        __PYX_ERR(0, 14, __pyx_L1_error)
      }
      __pyx_t_17 = (__pyx_v_i + 1);
      __pyx_t_15 = -1;
      if (unlikely(__pyx_t_17 >= (size_t)__pyx_v_x.shape[0])) __pyx_t_15 = 0;
      if (unlikely(__pyx_t_15 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_15);
        __PYX_ERR(0, 14, __pyx_L1_error)
      }
      __pyx_t_2 = PyFloat_FromDouble((.25 * (((*((double *) ( /* dim=0 */ (__pyx_v_x.data + __pyx_t_14 * __pyx_v_x.strides[0]) ))) + (2.0 * (*((double *) ( /* dim=0 */ (__pyx_v_x.data + __pyx_t_16 * __pyx_v_x.strides[0]) ))))) + (*((double *) ( /* dim=0 */ (__pyx_v_x.data + __pyx_t_17 * __pyx_v_x.strides[0]) )))))); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 14, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_2);
      if (unlikely((__Pyx_SetItemInt(__pyx_v_x_k, __pyx_v_i, __pyx_t_2, size_t, 0, __Pyx_PyInt_FromSize_t, 0, 0, 1) < 0))) __PYX_ERR(0, 14, __pyx_L1_error)
      __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    }
+15:         x, x_k = x_k, x # swap for next step
    __pyx_t_18 = __Pyx_PyObject_to_MemoryviewSlice_ds_double(__pyx_v_x_k, PyBUF_WRITABLE); if (unlikely(!__pyx_t_18.memview)) __PYX_ERR(0, 15, __pyx_L1_error)
    __pyx_t_2 = __pyx_memoryview_fromslice(__pyx_v_x, 1, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 15, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
    __pyx_v_x = __pyx_t_18;
    __pyx_t_18.memview = NULL;
    __pyx_t_18.data = NULL;
    __Pyx_DECREF_SET(__pyx_v_x_k, __pyx_t_2);
    __pyx_t_2 = 0;
  }
+16:     return np.asarray(x)
  __Pyx_XDECREF(__pyx_r);
  __Pyx_GetModuleGlobalName(__pyx_t_4, __pyx_n_s_np); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 16, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_4, __pyx_n_s_asarray); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 16, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __pyx_t_4 = __pyx_memoryview_fromslice(__pyx_v_x, 1, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 16, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_5 = NULL;
  __pyx_t_6 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_3))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_3);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_3, function);
      __pyx_t_6 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_4};
    __pyx_t_2 = __Pyx_PyObject_FastCall(__pyx_t_3, __pyx_callargs+1-__pyx_t_6, 1+__pyx_t_6);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 16, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  }
  __pyx_r = __pyx_t_2;
  __pyx_t_2 = 0;
  goto __pyx_L0;
c2 = %timeit -o blur_cython2(x, steps)
t_c2 = c2.best
times["cython (loops)"] = t_c2
plot_times()
885 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
../../_images/147d8dbe31d0a05156f51b85dc370ef11cfe4c65a6f5388339264624551ebf07.png

Just by making sure the iteration variables are defined as integers, we can save about 80% of the time.

The biggest key to optimizing with Cython is getting that yellow out of your loops. The more deeply nested a bit of code is within a loop, the more often it is called, and the more value you can get out of making it fast. In Cython, fast means avoiding Python (getting rid of yellow). To get rid of Python calls, we need to tell Python about the numpy arrays x and y:

%%cython --annotate

import cython as C
import numpy as np


def blur_cython_typed(x_: C.double[::1], steps: C.int = 1024):
    i: C.size_t
    N: C.size_t = x_.shape[0]
    x: C.double[::1] = x_.copy()
    x_k: C.double[::1] = np.empty_like(x_)
    x_k[0] = x[0]
    x_k[N - 1] = x[N - 1]
    for _ in range(steps):
        for i in range(1, N - 1):
            x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
        x, x_k = x_k, x  # swap for next step
    return np.asarray(x)

Generated by Cython 3.0.2

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

 01: 
+02: import cython as C
  __pyx_t_5 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_5) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
+03: import numpy as np
  __pyx_t_7 = __Pyx_ImportDottedModule(__pyx_n_s_numpy, NULL); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_7) < 0) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
 04: 
 05: 
+06: def blur_cython_typed(x_: C.double[::1], steps: C.int = 1024):
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a_1blur_cython_typed(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a_1blur_cython_typed = {"blur_cython_typed", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a_1blur_cython_typed, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a_1blur_cython_typed(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  __Pyx_memviewslice __pyx_v_x_ = { 0, 0, { 0 }, { 0 }, { 0 } };
  int __pyx_v_steps;
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("blur_cython_typed (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args);
  if (unlikely((__pyx_nargs < 0))) __PYX_ERR(0, 6, __pyx_L3_error)
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_x,&__pyx_n_s_steps,0};
  PyObject* values[2] = {0,0};
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_x)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 6, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (kw_args > 0) {
          PyObject* value = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_steps);
          if (value) { values[1] = __Pyx_Arg_NewRef_FASTCALL(value); kw_args--; }
          else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 6, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "blur_cython_typed") < 0)) __PYX_ERR(0, 6, __pyx_L3_error)
      }
    } else {
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        break;
        default: goto __pyx_L5_argtuple_error;
      }
    }
    __pyx_v_x_ = __Pyx_PyObject_to_MemoryviewSlice_dc_double(values[0], PyBUF_WRITABLE); if (unlikely(!__pyx_v_x_.memview)) __PYX_ERR(0, 6, __pyx_L3_error)
    if (values[1]) {
      __pyx_v_steps = __Pyx_PyInt_As_int(values[1]); if (unlikely((__pyx_v_steps == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 6, __pyx_L3_error)
    } else {
      __pyx_v_steps = ((int)((int)0x400));
    }
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("blur_cython_typed", 0, 1, 2, __pyx_nargs); __PYX_ERR(0, 6, __pyx_L3_error)
  goto __pyx_L3_error;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x_, 1);
  __Pyx_AddTraceback("_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a.blur_cython_typed", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  if (unlikely(((PyObject *)__pyx_v_x_.memview) == Py_None)) {
    PyErr_Format(PyExc_TypeError, "Argument '%.200s' must not be None", "x_"); __PYX_ERR(0, 6, __pyx_L1_error)
  }
  __pyx_r = __pyx_pf_54_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a_blur_cython_typed(__pyx_self, __pyx_v_x_, __pyx_v_steps);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  goto __pyx_L0;
  __pyx_L1_error:;
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x_, 1);
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a_blur_cython_typed(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_x_, int __pyx_v_steps) {
  size_t __pyx_v_i;
  size_t __pyx_v_N;
  __Pyx_memviewslice __pyx_v_x = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_x_k = { 0, 0, { 0 }, { 0 }, { 0 } };
  CYTHON_UNUSED int __pyx_v__;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("blur_cython_typed", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_1, 1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_19, 1);
  __Pyx_AddTraceback("_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a.blur_cython_typed", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x_k, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__21 = PyTuple_Pack(7, __pyx_n_s_x, __pyx_n_s_steps, __pyx_n_s_i, __pyx_n_s_N, __pyx_n_s_x_2, __pyx_n_s_x_k, __pyx_n_s__20); if (unlikely(!__pyx_tuple__21)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__21);
  __Pyx_GIVEREF(__pyx_tuple__21);
/* … */
  __pyx_t_7 = __Pyx_PyInt_From_int(((int)0x400)); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __pyx_t_4 = PyTuple_New(1); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_GIVEREF(__pyx_t_7);
  if (__Pyx_PyTuple_SET_ITEM(__pyx_t_4, 0, __pyx_t_7)) __PYX_ERR(0, 6, __pyx_L1_error);
  __pyx_t_7 = 0;
  __pyx_t_7 = __Pyx_PyDict_NewPresized(2); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_t_7, __pyx_n_s_x, __pyx_kp_s_C_double_1) < 0) __PYX_ERR(0, 6, __pyx_L1_error)
  if (PyDict_SetItem(__pyx_t_7, __pyx_n_s_steps, __pyx_kp_s_C_int) < 0) __PYX_ERR(0, 6, __pyx_L1_error)
  __pyx_t_5 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_5e8733fa9800c350e490ea003224a0ea9bb4060a_1blur_cython_typed, 0, __pyx_n_s_blur_cython_typed, NULL, __pyx_n_s_cython_magic_5e8733fa9800c350e4, __pyx_d, ((PyObject *)__pyx_codeobj__22)); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __Pyx_CyFunction_SetDefaultsTuple(__pyx_t_5, __pyx_t_4);
  __Pyx_CyFunction_SetAnnotationsDict(__pyx_t_5, __pyx_t_7);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_blur_cython_typed, __pyx_t_5) < 0) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
 07:     i: C.size_t
+08:     N: C.size_t = x_.shape[0]
  __pyx_v_N = (__pyx_v_x_.shape[0]);
+09:     x: C.double[::1] = x_.copy()
  __pyx_t_1 = __pyx_memoryview_copy_slice_dc_double_c(__pyx_v_x_); if (unlikely(!__pyx_t_1.memview)) __PYX_ERR(0, 9, __pyx_L1_error)
  __pyx_v_x = __pyx_t_1;
  __pyx_t_1.memview = NULL;
  __pyx_t_1.data = NULL;
+10:     x_k: C.double[::1] = np.empty_like(x_)
  __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_4 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_empty_like); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = __pyx_memoryview_fromslice(__pyx_v_x_, 1, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_5 = NULL;
  __pyx_t_6 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_4))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_4);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_4, function);
      __pyx_t_6 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_3};
    __pyx_t_2 = __Pyx_PyObject_FastCall(__pyx_t_4, __pyx_callargs+1-__pyx_t_6, 1+__pyx_t_6);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 10, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  }
  __pyx_t_1 = __Pyx_PyObject_to_MemoryviewSlice_dc_double(__pyx_t_2, PyBUF_WRITABLE); if (unlikely(!__pyx_t_1.memview)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_v_x_k = __pyx_t_1;
  __pyx_t_1.memview = NULL;
  __pyx_t_1.data = NULL;
+11:     x_k[0] = x[0]
  __pyx_t_7 = 0;
  __pyx_t_6 = -1;
  if (__pyx_t_7 < 0) {
    __pyx_t_7 += __pyx_v_x.shape[0];
    if (unlikely(__pyx_t_7 < 0)) __pyx_t_6 = 0;
  } else if (unlikely(__pyx_t_7 >= __pyx_v_x.shape[0])) __pyx_t_6 = 0;
  if (unlikely(__pyx_t_6 != -1)) {
    __Pyx_RaiseBufferIndexError(__pyx_t_6);
    __PYX_ERR(0, 11, __pyx_L1_error)
  }
  __pyx_t_8 = 0;
  __pyx_t_6 = -1;
  if (__pyx_t_8 < 0) {
    __pyx_t_8 += __pyx_v_x_k.shape[0];
    if (unlikely(__pyx_t_8 < 0)) __pyx_t_6 = 0;
  } else if (unlikely(__pyx_t_8 >= __pyx_v_x_k.shape[0])) __pyx_t_6 = 0;
  if (unlikely(__pyx_t_6 != -1)) {
    __Pyx_RaiseBufferIndexError(__pyx_t_6);
    __PYX_ERR(0, 11, __pyx_L1_error)
  }
  *((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x_k.data) + __pyx_t_8)) )) = (*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_7)) )));
+12:     x_k[N - 1] = x[N - 1]
  __pyx_t_9 = (__pyx_v_N - 1);
  __pyx_t_6 = -1;
  if (unlikely(__pyx_t_9 >= (size_t)__pyx_v_x.shape[0])) __pyx_t_6 = 0;
  if (unlikely(__pyx_t_6 != -1)) {
    __Pyx_RaiseBufferIndexError(__pyx_t_6);
    __PYX_ERR(0, 12, __pyx_L1_error)
  }
  __pyx_t_10 = (__pyx_v_N - 1);
  __pyx_t_6 = -1;
  if (unlikely(__pyx_t_10 >= (size_t)__pyx_v_x_k.shape[0])) __pyx_t_6 = 0;
  if (unlikely(__pyx_t_6 != -1)) {
    __Pyx_RaiseBufferIndexError(__pyx_t_6);
    __PYX_ERR(0, 12, __pyx_L1_error)
  }
  *((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x_k.data) + __pyx_t_10)) )) = (*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_9)) )));
+13:     for _ in range(steps):
  __pyx_t_6 = __pyx_v_steps;
  __pyx_t_11 = __pyx_t_6;
  for (__pyx_t_12 = 0; __pyx_t_12 < __pyx_t_11; __pyx_t_12+=1) {
    __pyx_v__ = __pyx_t_12;
+14:         for i in range(1, N - 1):
    __pyx_t_9 = (__pyx_v_N - 1);
    __pyx_t_10 = __pyx_t_9;
    for (__pyx_t_13 = 1; __pyx_t_13 < __pyx_t_10; __pyx_t_13+=1) {
      __pyx_v_i = __pyx_t_13;
+15:             x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
      __pyx_t_14 = (__pyx_v_i - 1);
      __pyx_t_15 = -1;
      if (unlikely(__pyx_t_14 >= (size_t)__pyx_v_x.shape[0])) __pyx_t_15 = 0;
      if (unlikely(__pyx_t_15 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_15);
        __PYX_ERR(0, 15, __pyx_L1_error)
      }
      __pyx_t_16 = __pyx_v_i;
      __pyx_t_15 = -1;
      if (unlikely(__pyx_t_16 >= (size_t)__pyx_v_x.shape[0])) __pyx_t_15 = 0;
      if (unlikely(__pyx_t_15 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_15);
        __PYX_ERR(0, 15, __pyx_L1_error)
      }
      __pyx_t_17 = (__pyx_v_i + 1);
      __pyx_t_15 = -1;
      if (unlikely(__pyx_t_17 >= (size_t)__pyx_v_x.shape[0])) __pyx_t_15 = 0;
      if (unlikely(__pyx_t_15 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_15);
        __PYX_ERR(0, 15, __pyx_L1_error)
      }
      __pyx_t_18 = __pyx_v_i;
      __pyx_t_15 = -1;
      if (unlikely(__pyx_t_18 >= (size_t)__pyx_v_x_k.shape[0])) __pyx_t_15 = 0;
      if (unlikely(__pyx_t_15 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_15);
        __PYX_ERR(0, 15, __pyx_L1_error)
      }
      *((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x_k.data) + __pyx_t_18)) )) = (0.25 * (((*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_14)) ))) + (2.0 * (*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_16)) ))))) + (*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_17)) )))));
    }
+16:         x, x_k = x_k, x  # swap for next step
    __pyx_t_1 = __pyx_v_x_k;
    __PYX_INC_MEMVIEW(&__pyx_t_1, 1);
    __pyx_t_19 = __pyx_v_x;
    __PYX_INC_MEMVIEW(&__pyx_t_19, 1);
    __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
    __pyx_v_x = __pyx_t_1;
    __pyx_t_1.memview = NULL;
    __pyx_t_1.data = NULL;
    __PYX_XCLEAR_MEMVIEW(&__pyx_v_x_k, 1);
    __pyx_v_x_k = __pyx_t_19;
    __pyx_t_19.memview = NULL;
    __pyx_t_19.data = NULL;
  }
+17:     return np.asarray(x)
  __Pyx_XDECREF(__pyx_r);
  __Pyx_GetModuleGlobalName(__pyx_t_4, __pyx_n_s_np); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_4, __pyx_n_s_asarray); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __pyx_t_4 = __pyx_memoryview_fromslice(__pyx_v_x, 1, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_5 = NULL;
  __pyx_t_6 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_3))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_3);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_3, function);
      __pyx_t_6 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_4};
    __pyx_t_2 = __Pyx_PyObject_FastCall(__pyx_t_3, __pyx_callargs+1-__pyx_t_6, 1+__pyx_t_6);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 17, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  }
  __pyx_r = __pyx_t_2;
  __pyx_t_2 = 0;
  goto __pyx_L0;
ct = %timeit -o y = blur_cython_typed(x, steps)
t_ct = ct.best

times["cython (types)"] = t_ct
plot_times()
1.92 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
../../_images/f5f5a1e545690bdc550e25cb7fc8c6494e4a4150f62ae0a493d8fdff72373518.png

We can further optimize with Cython macros, which disable bounds checking and negative indexing, and avoiding the Python variable swaping by using indices into a single array:

x[len(x)]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[41], line 1
----> 1 x[len(x)]

IndexError: index 1024 is out of bounds for axis 0 with size 1024
%%cython --annotate

import cython as C
import numpy as np


@C.boundscheck(False)
@C.wraparound(False)
def blur_cython_optimized(x_: C.double[::1], steps: C.int = 1024) -> C.double[::1]:
    i: C.size_t
    N: C.size_t = x_.shape[0]
    x: C.double[::1] = x_.copy()
    x_k: C.double[::1] = np.empty_like(x_)
    x_k[0] = x[0]
    x_k[N - 1] = x[N - 1]
    for _ in range(steps):
        for i in range(1, N - 1):
            x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
        x, x_k = x_k, x  # swap for next step
    return np.asarray(x)

Generated by Cython 3.0.2

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

 01: 
+02: import cython as C
  __pyx_t_5 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_5) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
+03: import numpy as np
  __pyx_t_7 = __Pyx_ImportDottedModule(__pyx_n_s_numpy, NULL); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_7) < 0) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
 04: 
 05: 
+06: @C.boundscheck(False)
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_f838c940fd52286287b522261b5b14c55c752853_1blur_cython_optimized(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_f838c940fd52286287b522261b5b14c55c752853_1blur_cython_optimized = {"blur_cython_optimized", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_f838c940fd52286287b522261b5b14c55c752853_1blur_cython_optimized, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_f838c940fd52286287b522261b5b14c55c752853_1blur_cython_optimized(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  __Pyx_memviewslice __pyx_v_x_ = { 0, 0, { 0 }, { 0 }, { 0 } };
  int __pyx_v_steps;
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("blur_cython_optimized (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args);
  if (unlikely((__pyx_nargs < 0))) __PYX_ERR(0, 6, __pyx_L3_error)
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_x,&__pyx_n_s_steps,0};
  PyObject* values[2] = {0,0};
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_x)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 6, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (kw_args > 0) {
          PyObject* value = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_steps);
          if (value) { values[1] = __Pyx_Arg_NewRef_FASTCALL(value); kw_args--; }
          else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 6, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "blur_cython_optimized") < 0)) __PYX_ERR(0, 6, __pyx_L3_error)
      }
    } else {
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        break;
        default: goto __pyx_L5_argtuple_error;
      }
    }
    __pyx_v_x_ = __Pyx_PyObject_to_MemoryviewSlice_dc_double(values[0], PyBUF_WRITABLE); if (unlikely(!__pyx_v_x_.memview)) __PYX_ERR(0, 8, __pyx_L3_error)
    if (values[1]) {
      __pyx_v_steps = __Pyx_PyInt_As_int(values[1]); if (unlikely((__pyx_v_steps == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 8, __pyx_L3_error)
    } else {
      __pyx_v_steps = ((int)((int)0x400));
    }
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("blur_cython_optimized", 0, 1, 2, __pyx_nargs); __PYX_ERR(0, 6, __pyx_L3_error)
  goto __pyx_L3_error;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x_, 1);
  __Pyx_AddTraceback("_cython_magic_f838c940fd52286287b522261b5b14c55c752853.blur_cython_optimized", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  if (unlikely(((PyObject *)__pyx_v_x_.memview) == Py_None)) {
    PyErr_Format(PyExc_TypeError, "Argument '%.200s' must not be None", "x_"); __PYX_ERR(0, 8, __pyx_L1_error)
  }
  __pyx_r = __pyx_pf_54_cython_magic_f838c940fd52286287b522261b5b14c55c752853_blur_cython_optimized(__pyx_self, __pyx_v_x_, __pyx_v_steps);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  goto __pyx_L0;
  __pyx_L1_error:;
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x_, 1);
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_f838c940fd52286287b522261b5b14c55c752853_blur_cython_optimized(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_x_, int __pyx_v_steps) {
  size_t __pyx_v_i;
  size_t __pyx_v_N;
  __Pyx_memviewslice __pyx_v_x = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_x_k = { 0, 0, { 0 }, { 0 }, { 0 } };
  CYTHON_UNUSED int __pyx_v__;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("blur_cython_optimized", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_1, 1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_18, 1);
  __Pyx_AddTraceback("_cython_magic_f838c940fd52286287b522261b5b14c55c752853.blur_cython_optimized", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_x_k, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__21 = PyTuple_Pack(7, __pyx_n_s_x, __pyx_n_s_steps, __pyx_n_s_i, __pyx_n_s_N, __pyx_n_s_x_2, __pyx_n_s_x_k, __pyx_n_s__20); if (unlikely(!__pyx_tuple__21)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__21);
  __Pyx_GIVEREF(__pyx_tuple__21);
/* … */
  __pyx_t_4 = PyTuple_New(1); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_GIVEREF(__pyx_t_7);
  if (__Pyx_PyTuple_SET_ITEM(__pyx_t_4, 0, __pyx_t_7)) __PYX_ERR(0, 6, __pyx_L1_error);
  __pyx_t_7 = 0;
  __pyx_t_7 = __Pyx_PyDict_NewPresized(3); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_t_7, __pyx_n_s_x, __pyx_kp_s_C_double_1) < 0) __PYX_ERR(0, 6, __pyx_L1_error)
  if (PyDict_SetItem(__pyx_t_7, __pyx_n_s_steps, __pyx_kp_s_C_int) < 0) __PYX_ERR(0, 6, __pyx_L1_error)
  if (PyDict_SetItem(__pyx_t_7, __pyx_n_s_return, __pyx_kp_s_C_double_1) < 0) __PYX_ERR(0, 6, __pyx_L1_error)
  __pyx_t_5 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_f838c940fd52286287b522261b5b14c55c752853_1blur_cython_optimized, 0, __pyx_n_s_blur_cython_optimized, NULL, __pyx_n_s_cython_magic_f838c940fd52286287, __pyx_d, ((PyObject *)__pyx_codeobj__22)); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __Pyx_CyFunction_SetDefaultsTuple(__pyx_t_5, __pyx_t_4);
  __Pyx_CyFunction_SetAnnotationsDict(__pyx_t_5, __pyx_t_7);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_blur_cython_optimized, __pyx_t_5) < 0) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
 07: @C.wraparound(False)
+08: def blur_cython_optimized(x_: C.double[::1], steps: C.int = 1024) -> C.double[::1]:
  __pyx_t_7 = __Pyx_PyInt_From_int(((int)0x400)); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 8, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
 09:     i: C.size_t
+10:     N: C.size_t = x_.shape[0]
  __pyx_v_N = (__pyx_v_x_.shape[0]);
+11:     x: C.double[::1] = x_.copy()
  __pyx_t_1 = __pyx_memoryview_copy_slice_dc_double_c(__pyx_v_x_); if (unlikely(!__pyx_t_1.memview)) __PYX_ERR(0, 11, __pyx_L1_error)
  __pyx_v_x = __pyx_t_1;
  __pyx_t_1.memview = NULL;
  __pyx_t_1.data = NULL;
+12:     x_k: C.double[::1] = np.empty_like(x_)
  __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 12, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_4 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_empty_like); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 12, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = __pyx_memoryview_fromslice(__pyx_v_x_, 1, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 12, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_5 = NULL;
  __pyx_t_6 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_4))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_4);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_4, function);
      __pyx_t_6 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_3};
    __pyx_t_2 = __Pyx_PyObject_FastCall(__pyx_t_4, __pyx_callargs+1-__pyx_t_6, 1+__pyx_t_6);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  }
  __pyx_t_1 = __Pyx_PyObject_to_MemoryviewSlice_dc_double(__pyx_t_2, PyBUF_WRITABLE); if (unlikely(!__pyx_t_1.memview)) __PYX_ERR(0, 12, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_v_x_k = __pyx_t_1;
  __pyx_t_1.memview = NULL;
  __pyx_t_1.data = NULL;
+13:     x_k[0] = x[0]
  __pyx_t_7 = 0;
  __pyx_t_8 = 0;
  *((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x_k.data) + __pyx_t_8)) )) = (*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_7)) )));
+14:     x_k[N - 1] = x[N - 1]
  __pyx_t_9 = (__pyx_v_N - 1);
  __pyx_t_10 = (__pyx_v_N - 1);
  *((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x_k.data) + __pyx_t_10)) )) = (*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_9)) )));
+15:     for _ in range(steps):
  __pyx_t_6 = __pyx_v_steps;
  __pyx_t_11 = __pyx_t_6;
  for (__pyx_t_12 = 0; __pyx_t_12 < __pyx_t_11; __pyx_t_12+=1) {
    __pyx_v__ = __pyx_t_12;
+16:         for i in range(1, N - 1):
    __pyx_t_9 = (__pyx_v_N - 1);
    __pyx_t_10 = __pyx_t_9;
    for (__pyx_t_13 = 1; __pyx_t_13 < __pyx_t_10; __pyx_t_13+=1) {
      __pyx_v_i = __pyx_t_13;
+17:             x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
      __pyx_t_14 = (__pyx_v_i - 1);
      __pyx_t_15 = __pyx_v_i;
      __pyx_t_16 = (__pyx_v_i + 1);
      __pyx_t_17 = __pyx_v_i;
      *((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x_k.data) + __pyx_t_17)) )) = (0.25 * (((*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_14)) ))) + (2.0 * (*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_15)) ))))) + (*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_x.data) + __pyx_t_16)) )))));
    }
+18:         x, x_k = x_k, x  # swap for next step
    __pyx_t_1 = __pyx_v_x_k;
    __PYX_INC_MEMVIEW(&__pyx_t_1, 1);
    __pyx_t_18 = __pyx_v_x;
    __PYX_INC_MEMVIEW(&__pyx_t_18, 1);
    __PYX_XCLEAR_MEMVIEW(&__pyx_v_x, 1);
    __pyx_v_x = __pyx_t_1;
    __pyx_t_1.memview = NULL;
    __pyx_t_1.data = NULL;
    __PYX_XCLEAR_MEMVIEW(&__pyx_v_x_k, 1);
    __pyx_v_x_k = __pyx_t_18;
    __pyx_t_18.memview = NULL;
    __pyx_t_18.data = NULL;
  }
+19:     return np.asarray(x)
  __Pyx_XDECREF(__pyx_r);
  __Pyx_GetModuleGlobalName(__pyx_t_4, __pyx_n_s_np); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 19, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_4, __pyx_n_s_asarray); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 19, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __pyx_t_4 = __pyx_memoryview_fromslice(__pyx_v_x, 1, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 19, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_5 = NULL;
  __pyx_t_6 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_3))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_3);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_3, function);
      __pyx_t_6 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_4};
    __pyx_t_2 = __Pyx_PyObject_FastCall(__pyx_t_3, __pyx_callargs+1-__pyx_t_6, 1+__pyx_t_6);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 19, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  }
  __pyx_r = __pyx_t_2;
  __pyx_t_2 = 0;
  goto __pyx_L0;

Note how there is now zero yellow called in any of the loops, only in the initial copy of the input array.

copt = %timeit -o y = blur_cython_optimized(x, steps)
t_copt = copt.best
times["cython (optimized)"] = t_copt
plot_times()
1.84 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
../../_images/21283172609c0381a52871a63f3c2208695014a689a85579e4aa195819ad5fb2.png
y = blur_cython_optimized(x, steps)  # noqa: F821
plt.plot(t, x, "--")
plt.plot(t, y)
[<matplotlib.lines.Line2D at 0x16bb5c890>]
../../_images/3cf60ecc74017a5d654a9885344d4247315fee53c365a27609a43bbac51b4044.png

numba#

numba is a library that attempts to automatically do type-based optimizations like we did with Cython. To use numba, you decorate functions with @jit.

import numba


@numba.jit(nopython=True)
def blur_numba(x, steps=1024):
    """identical to blur_py, other than the decorator"""
    x = 1 * x  # copy
    x_k = np.empty_like(x)
    x_k[0] = x[0]
    x_k[-1] = x[-1]
    for _ in range(steps):
        for i in range(1, len(x) - 1):
            x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
        x, x_k = x_k, x  # swap for next step
    return x


y = blur_numba(x, steps)
nb = %timeit -o blur_numba(x, steps)
t_nb = nb.best
times["numba"] = t_nb
plot_times()
2.01 ms ± 2.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
../../_images/4e96730a2f8d5d15ece891f0bdb1390c3e47b69f4ec81752a81686b18eb8004d.png
timing_table(times)
implementationspeed
python1.0 (normalized)
numpy89.2x
cython (no hints)1.05x
cython (loops)3.38x
cython (types)1,560x
cython (optimized)1,634x
numba1,472x

How did numba do that without any type info?

blur_numba.inspect_types()
blur_numba (Array(float64, 1, 'C', False, aligned=True), int64)
--------------------------------------------------------------------------------
# File: /var/folders/qr/3vxfnp1x2t1fw55dr288mphc0000gn/T/ipykernel_35659/4028818462.py
# --- LINE 4 --- 
# label 0
#   x = arg(0, name=x)  :: array(float64, 1d, C)
#   steps = arg(1, name=steps)  :: int64

@numba.jit(nopython=True)

# --- LINE 5 --- 

def blur_numba(x, steps=1024):

    # --- LINE 6 --- 

    """identical to blur_py, other than the decorator"""

    # --- LINE 7 --- 
    #   $const4.0 = const(int, 1)  :: Literal[int](1)
    #   del $const4.0
    #   x.1 = arrayexpr(expr=(<built-in function mul>, [const(int, 1), Var(x, 4028818462.py:4)]), ty=array(float64, 1d, C))  :: array(float64, 1d, C)
    #   del x
    #   x.1.2 = x.1  :: array(float64, 1d, C)

    x = 1 * x  # copy

    # --- LINE 8 --- 
    #   $14load_global.3 = global(np: <module 'numpy' from '/Users/minrk/conda/lib/python3.11/site-packages/numpy/__init__.py'>)  :: Module(<module 'numpy' from '/Users/minrk/conda/lib/python3.11/site-packages/numpy/__init__.py'>)
    #   $26load_method.5 = getattr(value=$14load_global.3, attr=empty_like)  :: Function(<function empty_like at 0x10602fd80>)
    #   del $14load_global.3
    #   x_k = call $26load_method.5(x.1, func=$26load_method.5, args=[Var(x.1, 4028818462.py:7)], kws=(), vararg=None, varkwarg=None, target=None)  :: (Array(float64, 1, 'C', False, aligned=True), omitted(default=None)) -> array(float64, 1d, C)
    #   del $26load_method.5
    #   x_k.2 = x_k  :: array(float64, 1d, C)

    x_k = np.empty_like(x)

    # --- LINE 9 --- 
    #   $const68.9 = const(int, 0)  :: Literal[int](0)
    #   $70binary_subscr.10 = static_getitem(value=x.1, index=0, index_var=$const68.9, fn=<built-in function getitem>)  :: float64
    #   del $const68.9
    #   $const82.12 = const(int, 0)  :: Literal[int](0)
    #   x_k[0] = $70binary_subscr.10
    #   del $const82.12
    #   del $70binary_subscr.10

    x_k[0] = x[0]

    # --- LINE 10 --- 
    #   $const90.14 = const(int, -1)  :: Literal[int](-1)
    #   $92binary_subscr.15 = static_getitem(value=x.1, index=-1, index_var=$const90.14, fn=<built-in function getitem>)  :: float64
    #   del x.1
    #   del $const90.14
    #   $const104.17 = const(int, -1)  :: Literal[int](-1)
    #   x_k[-1] = $92binary_subscr.15
    #   del x_k
    #   del $const104.17
    #   del $92binary_subscr.15

    x_k[-1] = x[-1]

    # --- LINE 11 --- 
    #   $110load_global.18 = global(range: <class 'range'>)  :: Function(<class 'range'>)
    #   $128call.21 = call $110load_global.18(steps, func=$110load_global.18, args=[Var(steps, 4028818462.py:4)], kws=(), vararg=None, varkwarg=None, target=None)  :: (int64,) -> range_state_int64
    #   del steps
    #   del $110load_global.18
    #   $138get_iter.22 = getiter(value=$128call.21)  :: range_iter_int64
    #   del $128call.21
    #   $phi140.0 = $138get_iter.22  :: range_iter_int64
    #   del $138get_iter.22
    #   jump 140
    # label 140
    #   $140for_iter.1 = iternext(value=$phi140.0)  :: pair<int64, bool>
    #   $140for_iter.2 = pair_first(value=$140for_iter.1)  :: int64
    #   $140for_iter.3 = pair_second(value=$140for_iter.1)  :: bool
    #   del $140for_iter.1
    #   $phi142.1 = $140for_iter.2  :: int64
    #   del $140for_iter.2
    #   branch $140for_iter.3, 142, 306
    # label 142
    #   del $140for_iter.3
    #   _ = $phi142.1  :: int64
    #   del _
    #   del $phi142.1

    for _ in range(steps):

        # --- LINE 12 --- 
        #   $144load_global.2 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $const156.4 = const(int, 1)  :: Literal[int](1)
        #   $158load_global.5 = global(len: <built-in function len>)  :: Function(<built-in function len>)
        #   $176call.8 = call $158load_global.5(x.1.2, func=$158load_global.5, args=[Var(x.1.2, 4028818462.py:11)], kws=(), vararg=None, varkwarg=None, target=None)  :: (Array(float64, 1, 'C', False, aligned=True),) -> int64
        #   del $158load_global.5
        #   $const186.9 = const(int, 1)  :: Literal[int](1)
        #   $binop_sub188.10 = $176call.8 - $const186.9  :: int64
        #   del $const186.9
        #   del $176call.8
        #   $196call.11 = call $144load_global.2($const156.4, $binop_sub188.10, func=$144load_global.2, args=[Var($const156.4, 4028818462.py:12), Var($binop_sub188.10, 4028818462.py:12)], kws=(), vararg=None, varkwarg=None, target=None)  :: (int64, int64) -> range_state_int64
        #   del $const156.4
        #   del $binop_sub188.10
        #   del $144load_global.2
        #   $206get_iter.12 = getiter(value=$196call.11)  :: range_iter_int64
        #   del $196call.11
        #   $phi208.1 = $206get_iter.12  :: range_iter_int64
        #   del $206get_iter.12
        #   jump 208
        # label 208
        #   $208for_iter.2 = iternext(value=$phi208.1)  :: pair<int64, bool>
        #   $208for_iter.3 = pair_first(value=$208for_iter.2)  :: int64
        #   $208for_iter.4 = pair_second(value=$208for_iter.2)  :: bool
        #   del $208for_iter.2
        #   $phi210.2 = $208for_iter.3  :: int64
        #   del $208for_iter.3
        #   branch $208for_iter.4, 210, 296
        # label 210
        #   del $208for_iter.4
        #   i = $phi210.2  :: int64
        #   del $phi210.2

        for i in range(1, len(x) - 1):

            # --- LINE 13 --- 
            #   $const212.3 = const(float, 0.25)  :: float64
            #   $const218.6 = const(int, 1)  :: Literal[int](1)
            #   $binop_sub220.7 = i - $const218.6  :: int64
            #   del $const218.6
            #   $224binary_subscr.8 = getitem(value=x.1.2, index=$binop_sub220.7, fn=<built-in function getitem>)  :: float64
            #   del $binop_sub220.7
            #   $const234.9 = const(int, 2)  :: Literal[int](2)
            #   $240binary_subscr.12 = getitem(value=x.1.2, index=i, fn=<built-in function getitem>)  :: float64
            #   $binop_mul250.13 = $const234.9 * $240binary_subscr.12  :: float64
            #   del $const234.9
            #   del $240binary_subscr.12
            #   $binop_add254.14 = $224binary_subscr.8 + $binop_mul250.13  :: float64
            #   del $binop_mul250.13
            #   del $224binary_subscr.8
            #   $const262.17 = const(int, 1)  :: Literal[int](1)
            #   $binop_add264.18 = i + $const262.17  :: int64
            #   del $const262.17
            #   $268binary_subscr.19 = getitem(value=x.1.2, index=$binop_add264.18, fn=<built-in function getitem>)  :: float64
            #   del $binop_add264.18
            #   $binop_add278.20 = $binop_add254.14 + $268binary_subscr.19  :: float64
            #   del $binop_add254.14
            #   del $268binary_subscr.19
            #   $binop_mul282.21 = $const212.3 * $binop_add278.20  :: float64
            #   del $const212.3
            #   del $binop_add278.20
            #   x_k.2[i] = $binop_mul282.21  :: (Array(float64, 1, 'C', False, aligned=True), int64, float64) -> none
            #   del i
            #   del $binop_mul282.21
            #   jump 208

            x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])

        # --- LINE 14 --- 
        # label 296
        #   del $phi210.2
        #   del $phi208.1
        #   del $208for_iter.4
        #   $x_k296.1 = x_k.2  :: array(float64, 1d, C)
        #   x_k.1 = x.1.2  :: array(float64, 1d, C)
        #   x_k.2 = x_k.1  :: array(float64, 1d, C)
        #   del x_k.1
        #   x.1.1 = $x_k296.1  :: array(float64, 1d, C)
        #   del $x_k296.1
        #   x.1.2 = x.1.1  :: array(float64, 1d, C)
        #   del x.1.1
        #   jump 140

        x, x_k = x_k, x  # swap for next step

    # --- LINE 15 --- 
    # label 306
    #   del x_k.2
    #   del $phi142.1
    #   del $phi140.0
    #   del $140for_iter.3
    #   $308return_value.1 = cast(value=x.1.2)  :: array(float64, 1d, C)
    #   del x.1.2
    #   return $308return_value.1

    return x


================================================================================

What’s impressive about numba in this case is that it is able to beat all but the most optimized of our implementations without any help. Like Cython, numba can do an even better job when you provide it with more information about how a function will be called.

Profiling#

A script can be run with

python3 -m cProfile myscript.py
import sys
%prun list(os.walk(sys.prefix))
         3244810 function calls (2789533 primitive calls) in 6.431 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   741078    2.552    0.000    2.552    0.000 {built-in method builtins.next}
    56634    1.431    0.000    1.431    0.000 {built-in method posix.lstat}
   684494    1.138    0.000    1.138    0.000 {method 'is_dir' of 'posix.DirEntry' objects}
511862/56585    0.607    0.000    6.418    0.000 <frozen os>:345(_walk)
    56584    0.532    0.000    0.532    0.000 {built-in method posix.scandir}
    56634    0.063    0.000    0.095    0.000 <frozen posixpath>:71(join)
   684494    0.032    0.000    0.032    0.000 {method 'append' of 'list' objects}
    56634    0.022    0.000    1.456    0.000 <frozen posixpath>:164(islink)
        1    0.013    0.013    6.431    6.431 <string>:1(<module>)
    56634    0.012    0.000    0.016    0.000 <frozen posixpath>:41(_get_sep)
    56634    0.007    0.000    0.007    0.000 {method 'startswith' of 'str' objects}
    56634    0.006    0.000    0.006    0.000 {method 'endswith' of 'str' objects}
    56584    0.005    0.000    0.005    0.000 {method '__exit__' of 'posix.ScandirIterator' objects}
    56634    0.004    0.000    0.004    0.000 {built-in method builtins.isinstance}
    56634    0.004    0.000    0.004    0.000 {built-in method _stat.S_ISLNK}
    56635    0.003    0.000    0.003    0.000 {built-in method posix.fspath}
        1    0.000    0.000    6.431    6.431 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 <frozen os>:282(walk)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 cycler.py:239(__iter__)
        1    0.000    0.000    0.000    0.000 {built-in method sys.audit}
 

Snakeviz instaleld with:

python3 -m pip install snakeviz
%load_ext snakeviz
%snakeviz -t blur_py(x, steps)
 
*** Profile stats marshalled to file '/var/folders/qr/3vxfnp1x2t1fw55dr288mphc0000gn/T/tmpzxarik_z'.
Opening SnakeViz in a new tab...
import hashlib
%%snakeviz -t
for dirpath, dirnames, filenames in os.walk('/usr/local'):
    for filename in filenames:
        if filename.endswith('.txt'):
            full_path = os.path.join(dirpath, filename)
            with open(full_path, 'rb') as f:
                hashlib.md5(f.read())
 
*** Profile stats marshalled to file '/var/folders/qr/3vxfnp1x2t1fw55dr288mphc0000gn/T/tmpf9tbru6q'.
Opening SnakeViz in a new tab...

line_profiler installed with:

python3 -m pip install line_profiler
%load_ext line_profiler
%lprun -f blur_py blur_py(x, steps)
Timer unit: 1e-09 s

Total time: 5.55848 s
File: /var/folders/qr/3vxfnp1x2t1fw55dr288mphc0000gn/T/ipykernel_35659/2340594174.py
Function: blur_py at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def blur_py(x0, steps=1024):
     2                                               """Advance a Gaussian blur `steps` steps"""
     3         1     101000.0 101000.0      0.0      x = 1 * x0  # copy
     4         1      32000.0  32000.0      0.0      x_k = np.empty_like(x)
     5         1       7000.0   7000.0      0.0      x_k[0] = x[0]
     6         1       1000.0   1000.0      0.0      x_k[-1] = x[-1]
     7     10001     803000.0     80.3      0.0      for k in range(steps):
     8  10230000  793656000.0     77.6     14.3          for i in range(1, len(x) - 1):
     9  10220000 4762891000.0    466.0     85.7              x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
    10     10000     988000.0     98.8      0.0          x, x_k = x_k, x  # swap for next step
    11         1          0.0      0.0      0.0      return x
%lprun -f blur_np blur_np(x, steps)
Timer unit: 1e-09 s

Total time: 0.037441 s
File: /var/folders/qr/3vxfnp1x2t1fw55dr288mphc0000gn/T/ipykernel_35659/109327557.py
Function: blur_np at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           def blur_np(x, steps=1024):
     5         1      21000.0  21000.0      0.1      x = 1 * x
     6         1       5000.0   5000.0      0.0      x_k = np.empty_like(x)
     7         1       1000.0   1000.0      0.0      x_k[0] = x[0]
     8         1          0.0      0.0      0.0      x_k[-1] = x[-1]
     9     10001     796000.0     79.6      2.1      for _ in range(steps):
    10     10000   35757000.0   3575.7     95.5          x_k[1:-1] = 0.25 * (x[0:-2] + 2 * x[1:-1] + x[2:])
    11     10000     861000.0     86.1      2.3          x, x_k = x_k, x
    12         1          0.0      0.0      0.0      return x
%lprun -f blur_numba -f blur_numba.__wrapped__ blur_numba(x, steps)
/Users/minrk/conda/lib/python3.11/site-packages/line_profiler/ipython_extension.py:97: UserWarning: Adding a function with a __wrapped__ attribute. You may want to profile the wrapped function by adding blur_numba.__wrapped__ instead.
  profile = LineProfiler(*funcs)
Timer unit: 1e-09 s

Total time: 0 s
File: /var/folders/qr/3vxfnp1x2t1fw55dr288mphc0000gn/T/ipykernel_35659/4028818462.py
Function: blur_numba at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @numba.jit(nopython=True)
     5                                           def blur_numba(x, steps=1024):
     6                                               """identical to blur_py, other than the decorator"""
     7                                               x = 1 * x  # copy
     8                                               x_k = np.empty_like(x)
     9                                               x_k[0] = x[0]
    10                                               x_k[-1] = x[-1]
    11                                               for _ in range(steps):
    12                                                   for i in range(1, len(x) - 1):
    13                                                       x_k[i] = 0.25 * (x[i - 1] + 2 * x[i] + x[i + 1])
    14                                                   x, x_k = x_k, x  # swap for next step
    15                                               return x
%lprun -f blur_cython blur_cython(x, steps)
Timer unit: 1e-09 s
%snakeviz -t np.dot(x, x)
 
*** Profile stats marshalled to file '/var/folders/qr/3vxfnp1x2t1fw55dr288mphc0000gn/T/tmpykxzyuqd'.
Opening SnakeViz in a new tab...