--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.18.0-dev kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- (advanced-numpy)= # Advanced NumPy **Author**: _Pauli Virtanen_ NumPy is at the base of Python's scientific stack of tools. Its purpose to implement efficient operations on many items in a block of memory. Understanding how it works in detail helps in making efficient use of its flexibility, taking useful shortcuts. This section covers: - Anatomy of NumPy arrays, and its consequences. Tips and tricks. - Universal functions: what, why, and what to do if you want a new one. - Integration with other tools: NumPy offers several ways to wrap any data in an ndarray, without unnecessary copies. - Recently added features, and what's in them: PEP 3118 buffers, generalized ufuncs, ... :::{admonition} Prerequisites - NumPy - Cython - Pillow (Python imaging library, used in a couple of examples) ::: ```{code-cell} # Import Numpy module. import numpy as np # Import Matplotlib (for later). import matplotlib.pyplot as plt ``` ## Life of ndarray ### It's... ::: {admonition} What is an **ndarray** An **ndarray** is: - A block of memory and - an indexing scheme and - a data type descriptor. ::: Put another way, an ndarray has **raw data**, and algorithms to: - locate an element - interpret an element ::: {image} threefundamental.png ::: ```c typedef struct PyArrayObject { PyObject_HEAD /* Block of memory */ char *data; /* Data type descriptor */ PyArray_Descr *descr; /* Indexing scheme */ int nd; npy_intp *dimensions; npy_intp *strides; /* Other stuff */ PyObject *base; int flags; PyObject *weakreflist; } PyArrayObject; ``` ### Block of memory ```{code-cell} x = np.array([1, 2, 3], dtype=np.int32) x.data ``` ```{code-cell} bytes(x.data) ``` Memory address of the data: ```{code-cell} x.__array_interface__['data'][0] ``` The whole `__array_interface__`: ```{code-cell} x.__array_interface__ ``` Reminder: two {class}`ndarrays ` may share the same memory: ```{code-cell} x = np.array([1, 2, 3, 4]) y = x[:-1] x[0] = 9 y ``` Memory does not need to be owned by an {class}`ndarray`: ```{code-cell} x = b'1234' ``` x is a string (in Python 3 a bytes), we can represent its data as an array of ints: ```{code-cell} y = np.frombuffer(x, dtype=np.int8) y.data ``` ```{code-cell} y.base is x ``` ```{code-cell} y.flags ``` The `owndata` and `writeable` flags indicate status of the memory block. :::{admonition} See also [array interface](https://numpy.org/doc/stable/reference/arrays.interface.html) ::: ### Data types #### The descriptor {class}`dtype` describes a single item in the array: ::: {list-table} **Dtypes** - - type - **scalar type** of the data, one of: - int8, int16, float64, _et al._ (fixed size) - str, unicode, void (flexible size) - - itemsize - **size** of the data block - - byteorder - **byte order**: - big-endian `>` - little-endian `<` - not applicable `|` - - fields - sub-dtypes, if it's a **structured data type** - - shape - shape of the array, if it's a **sub-array** ::: ```{code-cell} np.dtype(int).type ``` ```{code-cell} np.dtype(int).itemsize ``` ```{code-cell} np.dtype(int).byteorder ``` #### Example: reading `.wav` files The `.wav` file header: | | | | --------------- | ------------------------------------- | | chunk_id | `"RIFF"` | | chunk_size | 4-byte unsigned little-endian integer | | format | `"WAVE"` | | fmt_id | `"fmt "` | | fmt_size | 4-byte unsigned little-endian integer | | audio_fmt | 2-byte unsigned little-endian integer | | num_channels | 2-byte unsigned little-endian integer | | sample_rate | 4-byte unsigned little-endian integer | | byte_rate | 4-byte unsigned little-endian integer | | block_align | 2-byte unsigned little-endian integer | | bits_per_sample | 2-byte unsigned little-endian integer | | data_id | `"data"` | | data_size | 4-byte unsigned little-endian integer | - 44-byte block of raw data (in the beginning of the file) - ... followed by `data_size` bytes of actual sound data. The `.wav` file header as a NumPy _structured_ data type: ```{code-cell} wav_header_dtype = np.dtype([ ("chunk_id", (bytes, 4)), # flexible-sized scalar type, item size 4 ("chunk_size", " 1000`. Use it to determine which `c` are in the Mandelbrot set. Our function is a simple one, so make use of the `PyUFunc_*` helpers. Write it in Cython :::{admonition} See also mandel.pyx, mandelplot.py ::: :::{only} latex ```{literalinclude} examples/mandel.pyx ``` ::: **Reminder**: some pre-made Ufunc loops: | | | | -------------- | --------------------------------------------------------------------------------- | | `PyUfunc_f_f` | `float elementwise_func(float input_1)` | | `PyUfunc_ff_f` | `float elementwise_func(float input_1, float input_2)` | | `PyUfunc_d_d` | `double elementwise_func(double input_1)` | | `PyUfunc_dd_d` | `double elementwise_func(double input_1, double input_2)` | | `PyUfunc_D_D` | `elementwise_func(complex_double *input, complex_double* output)` | | `PyUfunc_DD_D` | `elementwise_func(complex_double *in1, complex_double *in2, complex_double* out)` | Type codes: ``` NPY_BOOL, NPY_BYTE, NPY_UBYTE, NPY_SHORT, NPY_USHORT, NPY_INT, NPY_UINT, NPY_LONG, NPY_ULONG, NPY_LONGLONG, NPY_ULONGLONG, NPY_FLOAT, NPY_DOUBLE, NPY_LONGDOUBLE, NPY_CFLOAT, NPY_CDOUBLE, NPY_CLONGDOUBLE, NPY_DATETIME, NPY_TIMEDELTA, NPY_OBJECT, NPY_STRING, NPY_UNICODE, NPY_VOID ``` ::: {exercise-end} ::: ::: {solution-start} mandelbrot-ufunc :class: dropdown ::: ```{literalinclude} examples/mandel-answer.pyx :language: python ``` ```{literalinclude} examples/mandelplot.py :language: python ``` ::: {image} mandelbrot.png ::: :::{note} Most of the boilerplate could be automated by these Cython modules: ::: **Several accepted input types** E.g. supporting both single- and double-precision versions ```cython cdef void mandel_single_point(double complex *z_in, double complex *c_in, double complex *z_out) nogil: ... cdef void mandel_single_point_singleprec(float complex *z_in, float complex *c_in, float complex *z_out) nogil: ... cdef PyUFuncGenericFunction loop_funcs[2] cdef char input_output_types[3*2] cdef void *elementwise_funcs[1*2] loop_funcs[0] = PyUFunc_DD_D input_output_types[0] = NPY_CDOUBLE input_output_types[1] = NPY_CDOUBLE input_output_types[2] = NPY_CDOUBLE elementwise_funcs[0] = mandel_single_point loop_funcs[1] = PyUFunc_FF_F input_output_types[3] = NPY_CFLOAT input_output_types[4] = NPY_CFLOAT input_output_types[5] = NPY_CFLOAT elementwise_funcs[1] = mandel_single_point_singleprec mandel = PyUFunc_FromFuncAndData( loop_func, elementwise_funcs, input_output_types, 2, # number of supported input types <---------------- 2, # number of input args 1, # number of output args 0, # `identity` element, never mind this "mandel", # function name "mandel(z, c) -> computes iterated z*z + c", # docstring 0 # unused ) ``` ::: {solution-end} ::: ### Generalized ufuncs **ufunc** `output = elementwise_function(input)` Both `output` and `input` can be a single array element only. **generalized ufunc** `output` and `input` can be arrays with a fixed number of dimensions For example, matrix trace (sum of diag elements): ```text input shape = (n, n) output shape = () # i.e. scalar (n, n) -> () ``` Matrix product: ```text input_1 shape = (m, n) input_2 shape = (n, p) output shape = (m, p) (m, n), (n, p) -> (m, p) ``` - This is called the _"signature"_ of the generalized ufunc - The dimensions on which the g-ufunc acts, are _"core dimensions"_ **Status in NumPy** - g-ufuncs are in NumPy already ... - new ones can be created with `PyUFunc_FromFuncAndDataAndSignature` - most linear-algebra functions are implemented as g-ufuncs to enable working with stacked arrays: ```{code-cell} import numpy as np rng = np.random.default_rng(27446968) np.linalg.det(rng.random((3, 5, 5))) ``` ```{code-cell} np.linalg._umath_linalg.det.signature ``` - matrix multiplication this way could be useful for operating on many small matrices at once - Also see `tensordot` and `einsum` **Generalized ufunc loop** Matrix multiplication `(m,n),(n,p) -> (m,p)` ```c void gufunc_loop(void **args, int *dimensions, int *steps, void *data) { char *input_1 = (char*)args[0]; /* these are as previously */ char *input_2 = (char*)args[1]; char *output = (char*)args[2]; int input_1_stride_m = steps[3]; /* strides for the core dimensions */ int input_1_stride_n = steps[4]; /* are added after the non-core */ int input_2_strides_n = steps[5]; /* steps */ int input_2_strides_p = steps[6]; int output_strides_n = steps[7]; int output_strides_p = steps[8]; int m = dimension[1]; /* core dimensions are added after */ int n = dimension[2]; /* the main dimension; order as in */ int p = dimension[3]; /* signature */ int i; for (i = 0; i < dimensions[0]; ++i) { matmul_for_strided_matrices(input_1, input_2, output, strides for each array...); input_1 += steps[0]; input_2 += steps[1]; output += steps[2]; } } ``` ## Interoperability features ### Sharing multidimensional, typed data Suppose you 1. Write a library than handles (multidimensional) binary data, 2. Want to make it easy to manipulate the data with NumPy, or whatever other library, 3. ... but would **not** like to have NumPy as a dependency. Currently, 3 solutions: 1. the "old" buffer interface 2. the array interface 3. the "new" buffer interface ({pep}`3118`) ### The old buffer protocol - Only 1-D buffers - No data type information - C-level interface; `PyBufferProcs tp_as_buffer` in the type object - But it's integrated into Python (e.g. strings support it) Mini-exercise using [Pillow](https://python-pillow.org/) (Python Imaging Library): :::{admonition} See also pilbuffer.py ::: ::: {exercise-start} :label: pil-buffer :class: dropdown ::: ```{code-cell} from PIL import Image data = np.zeros((200, 200, 4), dtype=np.uint8) data[:, :] = [255, 0, 0, 255] # Red # In PIL, RGBA images consist of 32-bit integers whose bytes are [RR,GG,BB,AA] data = data.view(np.int32).squeeze() img = Image.frombuffer("RGBA", (200, 200), data, "raw", "RGBA", 0, 1) img.save('test.png') ``` **The question** What happens if `data` is now modified, and `img` saved again? ::: {exercise-end} ::: ### The old buffer protocol Show how to exchange data between numpy and a library that only knows the buffer interface: ```{code-cell} # Make a sample image, RGBA format x = np.zeros((200, 200, 4), dtype=np.uint8) x[:, :, 0] = 255 # red x[:, :, 3] = 255 # opaque data_i32 = x.view(np.int32) # Check that you understand why this is OK! img = Image.frombuffer("RGBA", (200, 200), data_i32) img.save("test_red.png") # Modify the original data, and save again. x[:, :, 1] = 255 img.save("test_recolored.png") ``` ::: {image} test_red.png ::: ::: {image} test_recolored.png ::: ### Array interface protocol - Multidimensional buffers - Data type information present - NumPy-specific approach; slowly deprecated (but not going away) - Not integrated in Python otherwise :::{admonition} See also Documentation: ::: ```{code-cell} x = np.array([[1, 2], [3, 4]]) x.__array_interface__ ``` ```{code-cell} :tags: [hide-input] import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt import os if not os.path.exists('data'): os.mkdir('data') plt.imsave('data/test.png', data) ``` ```{code-cell} from PIL import Image img = Image.open('data/test.png') img.__array_interface__ ``` ```{code-cell} x = np.asarray(img) x.shape ``` :::{note} A more C-friendly variant of the array interface is also defined. ::: (array-siblings)= ## Array siblings: {class}`chararray`, {class}`MaskedArray` ### {class}`chararray `: vectorized string operations ```{code-cell} x = np.char.asarray(['a', ' bbb', ' ccc']) x ``` ```{code-cell} x.upper() ``` ### {class}`MaskedArray ` missing data Masked arrays are arrays that may have missing or invalid entries. For example, suppose we have an array where the fourth entry is invalid: ```{code-cell} x = np.array([1, 2, 3, -99, 5]) ``` One way to describe this is to create a masked array: ```{code-cell} mx = np.ma.MaskedArray(x, mask=[0, 0, 0, 1, 0]) mx ``` Masked mean ignores masked data: ```{code-cell} mx.mean() ``` ```{code-cell} np.mean(mx) ``` :::{warning} Not all NumPy functions respect masks, for instance `np.dot`, so check the return types. ::: The `MaskedArray` returns a **view** to the original array: ```{code-cell} mx[1] = 9 x ``` #### The mask You can modify the mask by assigning: ```{code-cell} mx[1] = np.ma.masked mx ``` The mask is cleared on assignment: ```{code-cell} mx[1] = 9 mx ``` The mask is also available directly: ```{code-cell} mx.mask ``` The masked entries can be filled with a given value to get an usual array back: ```{code-cell} x2 = mx.filled(-1) x2 ``` The mask can also be cleared: ```{code-cell} mx.mask = np.ma.nomask mx ``` #### Domain-aware functions The masked array package also contains domain-aware functions: ```{code-cell} np.ma.log(np.array([1, 2, -1, -2, 3, -5])) ``` :::{note} Streamlined and more seamless support for dealing with missing data in arrays is making its way into NumPy 1.7. Stay tuned! ::: **Example: Masked statistics** Canadian rangers were distracted when counting hares and lynxes in 1903-1910 and 1917-1918, and got the numbers are wrong. (Carrot farmers stayed alert, though.) Compute the mean populations over time, ignoring the invalid numbers. ```{code-cell} data = np.loadtxt('data/populations.txt') populations = np.ma.MaskedArray(data[:,1:]) year = data[:, 0] ``` ```{code-cell} bad_years = (((year >= 1903) & (year <= 1910)) | ((year >= 1917) & (year <= 1918))) # '&' means 'and' and '|' means 'or' populations[bad_years, 0] = np.ma.masked populations[bad_years, 1] = np.ma.masked ``` ```{code-cell} populations.mean(axis=0) ``` ```{code-cell} populations.std(axis=0) ``` Note that Matplotlib knows about masked arrays: ```{code-cell} plt.plot(year, populations, 'o-') ``` ### `np.recarray`: purely convenience ```{code-cell} arr = np.array([('a', 1), ('b', 2)], dtype=[('x', 'S1'), ('y', int)]) arr2 = arr.view(np.recarray) arr2.x ``` ```{code-cell} arr2.y ``` ## Summary - Anatomy of the ndarray: data, dtype, strides. - Universal functions: elementwise operations, how to make new ones - Ndarray subclasses - Various buffer interfaces for integration with other tools - Recent additions: PEP 3118, generalized ufuncs ## Contributing to NumPy/SciPy Get this tutorial: ### Why - "There's a bug?" - "I don't understand what this is supposed to do?" - "I have this fancy code. Would you like to have it?" - "I'd like to help! What can I do?" ### Reporting bugs - Bug tracker (prefer **this**) - - - Click the "Sign up" link to get an account - Mailing lists () - If you're unsure - No replies in a week or so? Just file a bug ticket. #### Good bug report ```text Title: numpy.random.permutations fails for non-integer arguments I'm trying to generate random permutations, using numpy.random.permutations When calling numpy.random.permutation with non-integer arguments it fails with a cryptic error message:: >>> rng.permutation(12) array([ 2, 6, 4, 1, 8, 11, 10, 5, 9, 3, 7, 0]) >>> rng.permutation(12.) Traceback (most recent call last): File "", line 1, in File "_generator.pyx", line 4844, in numpy.random._generator.Generator.permutation numpy.exceptions.AxisError: axis 0 is out of bounds for array of dimension 0 This also happens with long arguments, and so np.random.permutation(X.shape[0]) where X is an array fails on 64 bit windows (where shape is a tuple of longs). It would be great if it could cast to integer or at least raise a proper error for non-integer types. I'm using NumPy 1.4.1, built from the official tarball, on Windows 64 with Visual studio 2008, on Python.org 64-bit Python. ``` 0. What are you trying to do? 1. **Small code snippet reproducing the bug** (if possible) - What actually happens - What you'd expect 2. Platform (Windows / Linux / OSX, 32/64 bits, x86/PPC, ...) 3. Version of NumPy/SciPy ```{code-cell} print(np.__version__) ``` **Check that the following is what you expect** ```{code-cell} print(np.__file__) ``` In case you have old/broken NumPy installations lying around. If unsure, try to remove existing NumPy installations, and reinstall... ### Contributing to documentation 1. Documentation editor - - Registration - Register an account - Subscribe to `scipy-dev` mailing list (subscribers-only) - Problem with mailing lists: you get mail - But: **you can turn mail delivery off** - "change your subscription options", at the bottom of - Send a mail @ `scipy-dev` mailing list; ask for activation: ```text To: scipy-dev@scipy.org Hi, I'd like to edit NumPy/SciPy docstrings. My account is XXXXX Cheers, N. N. ``` - Check the style guide: - - Don't be intimidated; to fix a small thing, just fix it - Edit 2. Edit sources and send patches (as for bugs) 3. Complain on the mailing list ### Contributing features The contribution of features is documented on ### How to help, in general - Bug fixes always welcome! - What irks you most - Browse the tracker - Documentation work - API docs: improvements to docstrings - Know some SciPy module well? - _User guide_ - - Ask on communication channels: - `numpy-discussion` list - `scipy-dev` list