2.4. Optimizing code¶

Author: Gaël Varoquaux

This chapter deals with strategies to make Python code go faster.

2.4.1. Optimization workflow ¶

Make it work: write the code in a simple legible ways.
Make it work reliably: write automated test cases, make really sure that your algorithm is right and that if you break it, the tests will capture the breakage.
Optimize the code by profiling simple use-cases to find the bottlenecks and speeding up these bottleneck, finding a better algorithm or implementation. Keep in mind that a trade off should be found between profiling on a realistic example and the simplicity and speed of execution of the code. For efficient work, it is best to work with profiling runs lasting around 10s.

2.4.2. Profiling Python code ¶

2.4.2.1. Timeit ¶

In IPython, use timeit (https://docs.python.org/3/library/timeit.html) to time elementary operations:

In [1]: import numpy as np
In [2]: a=np.arange(1000)
In [3]: %timeit a ** 2
944 ns +- 3.08 ns per loop (mean +- std. dev. of 7 runs, 1,000,000 loops each)
In [4]: %timeit a ** 2.1
16 us +- 103 ns per loop (mean +- std. dev. of 7 runs, 100,000 loops each)
In [5]: %timeit a * a
1.06 us +- 4.72 ns per loop (mean +- std. dev. of 7 runs, 1,000,000 loops each)

Use this to guide your choice between strategies.

Note

For long running calls, using %time instead of %timeit; it is less precise but faster

This is a combination of two unsupervised learning techniques, principal component analysis (PCA) and independent component analysis (ICA). PCA is a technique for dimensionality reduction, i.e. an algorithm to explain the observed variance in your data using less dimensions. ICA is a source separation technique, for example to unmix multiple signals that have been recorded through multiple sensors. Doing a PCA first and then an ICA can be useful if you have more sensors than signals. For more information see: the FastICA example from scikits-learn.

To run it, you also need to download the ica module. In IPython we can time the script:

In [6]: %run -t demo.py
IPython CPU timings (estimated):
    User  :    14.3929 s.
    System:   0.256016 s.

and profile it:

In [7]: %run -p demo.py
function calls in 14.551 CPU seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno (function)
 14.457   14.457   14.479   14.479 decomp.py:849 (svd)
  0.054    0.054    0.054    0.054 {method 'random_sample' of 'mtrand.RandomState' objects}
  0.017    0.017    0.021    0.021 function_base.py:645 (asarray_chkfinite)
  0.011    0.000    0.011    0.000 {numpy.core._dotblas.dot}
  0.005    0.002    0.005    0.002 {method 'any' of 'numpy.ndarray' objects}
  0.001    0.000    0.001    0.000 ica.py:195 (gprime)
  0.001    0.000    0.001    0.000 ica.py:192 (g)
  0.001    0.000    0.001    0.000 {numpy.linalg.lapack_lite.dsyevd}
  0.001    0.000    0.001    0.000 twodim_base.py:204 (diag)
  0.001    0.001    0.008    0.008 ica.py:69 (_ica_par)
  0.001    0.001   14.551   14.551 {execfile}
  0.000    0.000    0.001    0.000 defmatrix.py:239 (__array_finalize__)
  0.000    0.000    0.004    0.001 ica.py:58 (_sym_decorrelation)
  0.000    0.000    0.002    0.000 linalg.py:841 (eigh)
  0.000    0.000    0.000    0.000 {isinstance}
  0.000    0.000   14.551   14.551 demo.py:1 (<module>)
  0.000    0.000    0.000    0.000 numeric.py:180 (asarray)
  0.000    0.000    0.000    0.000 defmatrix.py:193 (__new__)
  0.000    0.000    0.001    0.000 defmatrix.py:43 (asmatrix)
  0.000    0.000    0.001    0.000 defmatrix.py:287 (__mul__)
  0.000    0.000    0.000    0.000 {numpy.core.multiarray.zeros}
  0.000    0.000    0.000    0.000 {method 'transpose' of 'numpy.ndarray' objects}
  0.000    0.000    0.008    0.008 ica.py:97 (fastica)
     ...

Clearly the svd (in decomp.py) is what takes most of our time, a.k.a. the bottleneck. We have to find a way to make this step go faster, or to avoid this step (algorithmic optimization). Spending time on the rest of the code is useless.

2.4.2.3. Line-profiler ¶

The profiler tells us which function takes most of the time, but not where it is called.

For this, we use the line_profiler: in the source file, we decorate a few functions that we want to inspect with @profile (no need to import it)

@profile
def test():
rng=np.random.default_rng()
data=rng.random((5000,100))
u,s,v=linalg.svd(data)
pca=u[:,:10]@data
results=fastica(pca.T,whiten=False)

Then we run the script using the kernprof command, with switches -l, --line-by-line and -v, --view to use the line-by-line profiler and view the results in addition to saving them:

$ kernprof -l -v demo.py
Wrote profile results to demo.py.lprof
Timer unit: 1e-06 s
Total time: 1.27874 s
File: demo.py
Function: test at line 9
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def test():
    11         1         69.0     69.0      0.0      rng = np.random.default_rng()
    12         1       2453.0   2453.0      0.2      data = rng.random((5000, 100))
    13         1    1274715.0 1274715.0     99.7      u, s, v = sp.linalg.svd(data)
    14         1        413.0    413.0      0.0      pca = u[:, :10].T @ data
    15         1       1094.0   1094.0      0.1      results = fastica(pca.T, whiten=False)

The SVD is taking all the time. We need to optimise this line.

2.4.3. Making code go faster ¶

Once we have identified the bottlenecks, we need to make the corresponding code go faster.

2.4.3.1. Algorithmic optimization ¶

The first thing to look for is algorithmic optimization: are there ways to compute less, or better?

For a high-level view of the problem, a good understanding of the maths behind the algorithm helps. However, it is not uncommon to find simple changes, like moving computation or memory allocation outside a for loop, that bring in big gains.

Example of the SVD ¶

In both examples above, the SVD - Singular Value Decomposition - is what takes most of the time. Indeed, the computational cost of this algorithm is roughly $n^3$ in the size of the input matrix.

However, in both of these example, we are not using all the output of the SVD, but only the first few rows of its first return argument. If we use the svd implementation of SciPy, we can ask for an incomplete version of the SVD. Note that implementations of linear algebra in SciPy are richer then those in NumPy and should be preferred.

In [8]: %timeit np.linalg.svd(data)
1 loops, best of 3: 14.5 s per loop
In [9]: import scipy as sp
In [10]: %timeit sp.linalg.svd(data)
1 loops, best of 3: 14.2 s per loop
In [11]: %timeit sp.linalg.svd(data, full_matrices=False)
1 loops, best of 3: 295 ms per loop
In [12]: %timeit np.linalg.svd(data, full_matrices=False)
1 loops, best of 3: 293 ms per loop

We can then use this insight to optimize the previous code:

def test():
rng=np.random.default_rng()
data=rng.random((5000,100))
u,s,v=sp.linalg.svd(data,full_matrices=False)
pca=u[:,:10].T@data
results=fastica(pca.T,whiten=False)

In [13]: import demo
In [14]: %timeit demo.
demo.fastica   demo.np        demo.prof.pdf  demo.py        demo.pyc
demo.linalg    demo.prof      demo.prof.png  demo.py.lprof  demo.test
In [15]: %timeit demo.test()
ica.py:65: RuntimeWarning: invalid value encountered in sqrt
  W = (u * np.diag(1.0/np.sqrt(s)) * u.T) * W  # W = (W * W.T) ^{-1/2} * W
1 loops, best of 3: 17.5 s per loop
In [16]: import demo_opt
In [17]: %timeit demo_opt.test()
1 loops, best of 3: 208 ms per loop

Real incomplete SVDs, e.g. computing only the first 10 eigenvectors, can be computed with arpack, available in scipy.sparse.linalg.eigsh.

2.4.4. Writing faster numerical code ¶

A complete discussion on advanced use of NumPy is found in chapter Advanced NumPy, or in the article The NumPy array: a structure for efficient numerical computation by van der Walt et al. Here we discuss only some commonly encountered tricks to make code faster.

Vectorizing for loops

Find tricks to avoid for loops using NumPy arrays. For this, masks and indices arrays can be useful.
Broadcasting

Use broadcasting to do operations on arrays as small as possible before combining them.

In place operations

In [18]: a=np.zeros(1e7)
In [19]: %timeit global a ; a = 0*a
10 loops, best of 3: 111 ms per loop
In [20]: %timeit global a ; a *= 0
10 loops, best of 3: 48.4 ms per loop

note: we need global a in the timeit so that it work, as it is assigning to a, and thus considers it as a local variable.

Be easy on the memory: use views, and not copies

Copying big arrays is as costly as making simple numerical operations on them:

In [21]: a=np.zeros(1e7)
In [22]: %timeit a.copy()
10 loops, best of 3: 124 ms per loop
In [23]: %timeit a + 1
10 loops, best of 3: 112 ms per loop

Beware of cache effects

Memory access is cheaper when it is grouped: accessing a big array in a continuous way is much faster than random access. This implies amongst other things that smaller strides are faster (see CPU cache effects):

In [24]: c=np.zeros((1e4,1e4),order='C')
In [25]: %timeit c.sum(axis=0)
1 loops, best of 3: 3.89 s per loop
In [26]: %timeit c.sum(axis=1)
1 loops, best of 3: 188 ms per loop
In [27]: c.strides
Out[27]: (80000, 8)

This is the reason why Fortran ordering or C ordering may make a big difference on operations:

In [28]: rng=np.random.default_rng()
In [29]: a=rng.random((20,2**18))
In [30]: b=rng.random((20,2**18))
In [31]: %timeit b @ a.T
9.39 ms +- 11.3 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [32]: c=np.ascontiguousarray(a.T)
In [33]: %timeit b @ c
10.4 ms +- 32.6 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

Note that copying the data to work around this effect may not be worth it:

In [34]: %timeit c = np.ascontiguousarray(a.T)
15.5 ms +- 61.2 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

Using numexpr can be useful to automatically optimize code for such effects.

Use compiled code

The last resort, once you are sure that all the high-level optimizations have been explored, is to transfer the hot spots, i.e. the few lines or functions in which most of the time is spent, to compiled code. For compiled code, the preferred option is to use Cython: it is easy to transform exiting Python code in compiled code, and with a good use of the NumPy support yields efficient code on NumPy arrays, for instance by unrolling loops.

Warning

For all the above: profile and time your choices. Don’t base your optimization on theoretical considerations.

2.4.4.1. Additional Links ¶

If you need to profile memory usage, you could try the memory_profiler
If you need to profile down into C extensions, you could try using gperftools from Python with yep.
If you would like to track performance of your code across time, i.e. as you make new commits to your repository, you could try: asv
If you need some interactive visualization why not try RunSnakeRun

2.4. Optimizing code¶

2.4.1. Optimization workflow ¶

2.4.2. Profiling Python code ¶

2.4.2.1. Timeit ¶

2.4.2.2. Profiler ¶

2.4.2.3. Line-profiler ¶

2.4.3. Making code go faster ¶

2.4.3.1. Algorithmic optimization ¶

Example of the SVD ¶

2.4.4. Writing faster numerical code ¶

2.4.4.1. Additional Links ¶

Table of Contents

Previous topic

Next topic

This Page