Commit 1fb0017a authored by dugupeiwen's avatar dugupeiwen
Browse files

init 0.58

parents
User Manual
===========
.. toctree::
5minguide.rst
overview.rst
installing.rst
jit.rst
generated-jit.rst
vectorize.rst
jitclass.rst
cfunc.rst
pycc.rst
parallel.rst
stencil.rst
withobjmode.rst
jit-module.rst
performance-tips.rst
threading-layer.rst
cli.rst
troubleshoot.rst
faq.rst
examples.rst
talks.rst
Installation
============
Compatibility
-------------
For software compatability, please see the section on :ref:`version support
information<numba_support_info>` for details.
Our supported platforms are:
* Linux x86_64
* Linux ppcle64 (POWER8, POWER9)
* Windows 10 and later (64-bit)
* OS X 10.9 and later (64-bit and unofficial support on M1/Arm64)
* \*BSD (unofficial support only)
* NVIDIA GPUs of compute capability 5.0 and later
* Compute capabilities 3.5 and 3.7 are supported, but deprecated.
* ARMv8 (64-bit little-endian, such as the NVIDIA Jetson)
:ref:`numba-parallel` is only available on 64-bit platforms.
Installing using conda on x86/x86_64/POWER Platforms
----------------------------------------------------
The easiest way to install Numba and get updates is by using ``conda``,
a cross-platform package manager and software distribution maintained
by Anaconda, Inc. You can either use `Anaconda
<https://www.anaconda.com/download>`_ to get the full stack in one download,
or `Miniconda <https://conda.io/miniconda.html>`_ which will install
the minimum packages required for a conda environment.
Once you have conda installed, just type::
$ conda install numba
or::
$ conda update numba
Note that Numba, like Anaconda, only supports PPC in 64-bit little-endian mode.
To enable CUDA GPU support for Numba, install the latest `graphics drivers from
NVIDIA <https://www.nvidia.com/Download/index.aspx>`_ for your platform.
(Note that the open source Nouveau drivers shipped by default with many Linux
distributions do not support CUDA.) Then install the ``cudatoolkit`` package::
$ conda install cudatoolkit
You do not need to install the CUDA SDK from NVIDIA.
Installing using pip on x86/x86_64 Platforms
--------------------------------------------
Binary wheels for Windows, Mac, and Linux are also available from `PyPI
<https://pypi.org/project/numba/>`_. You can install Numba using ``pip``::
$ pip install numba
This will download all of the needed dependencies as well. You do not need to
have LLVM installed to use Numba (in fact, Numba will ignore all LLVM
versions installed on the system) as the required components are bundled into
the llvmlite wheel.
To use CUDA with Numba installed by ``pip``, you need to install the `CUDA SDK
<https://developer.nvidia.com/cuda-downloads>`_ from NVIDIA. Please refer to
:ref:`cudatoolkit-lookup` for details. Numba can also detect CUDA libraries
installed system-wide on Linux.
Installing on Linux ARMv8 (AArch64) Platforms
---------------------------------------------
We build and test conda packages on the `NVIDIA Jetson TX2
<https://www.nvidia.com/en-us/autonomous-machines/embedded-systems-dev-kits-modules/>`_,
but they are likely to work for other AArch64 platforms. (Note that while the
CPUs in the Raspberry Pi 3, 4, and Zero 2 W are 64-bit, Raspberry Pi OS may be
running in 32-bit mode depending on the OS image in use).
Conda-forge support for AArch64 is still quite experimental and packages are limited,
but it does work enough for Numba to build and pass tests. To set up the environment:
* Install `miniforge <https://github.com/conda-forge/miniforge>`_.
This will create a minimal conda environment.
* Then you can install Numba from the ``numba`` channel::
$ conda install -c numba numba
On CUDA-enabled systems, like the Jetson, the CUDA toolkit should be
automatically detected in the environment.
.. _numba-source-install-instructions:
Installing from source
----------------------
Installing Numba from source is fairly straightforward (similar to other
Python packages), but installing `llvmlite
<https://github.com/numba/llvmlite>`_ can be quite challenging due to the need
for a special LLVM build. If you are building from source for the purposes of
Numba development, see :ref:`buildenv` for details on how to create a Numba
development environment with conda.
If you are building Numba from source for other reasons, first follow the
`llvmlite installation guide <https://llvmlite.readthedocs.io/en/latest/admin-guide/install.html>`_.
Once that is completed, you can download the latest Numba source code from
`Github <https://github.com/numba/numba>`_::
$ git clone https://github.com/numba/numba.git
Source archives of the latest release can also be found on
`PyPI <https://pypi.org/project/numba/>`_. In addition to ``llvmlite``, you will also need:
* A C compiler compatible with your Python installation. If you are using
Anaconda, you can use the following conda packages:
* Linux ``x86_64``: ``gcc_linux-64`` and ``gxx_linux-64``
* Linux ``POWER``: ``gcc_linux-ppc64le`` and ``gxx_linux-ppc64le``
* Linux ``ARM``: no conda packages, use the system compiler
* Mac OSX: ``clang_osx-64`` and ``clangxx_osx-64`` or the system compiler at
``/usr/bin/clang`` (Mojave onwards)
* Mac OSX (M1): ``clang_osx-arm64`` and ``clangxx_osx-arm64``
* Windows: a version of Visual Studio appropriate for the Python version in
use
* `NumPy <http://www.numpy.org/>`_
Then you can build and install Numba from the top level of the source tree::
$ python setup.py install
If you wish to run the test suite, see the instructions in the
:ref:`developer documentation <running-tests>`.
.. _numba-source-install-env_vars:
Build time environment variables and configuration of optional components
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Below are environment variables that are applicable to altering how Numba would
otherwise build by default along with information on configuration options.
.. envvar:: NUMBA_DISABLE_OPENMP (default: not set)
To disable compilation of the OpenMP threading backend set this environment
variable to a non-empty string when building. If not set (default):
* For Linux and Windows it is necessary to provide OpenMP C headers and
runtime libraries compatible with the compiler tool chain mentioned above,
and for these to be accessible to the compiler via standard flags.
* For OSX the conda package ``llvm-openmp`` provides suitable C headers and
libraries. If the compilation requirements are not met the OpenMP threading
backend will not be compiled.
.. envvar:: NUMBA_DISABLE_TBB (default: not set)
To disable the compilation of the TBB threading backend set this environment
variable to a non-empty string when building. If not set (default) the TBB C
headers and libraries must be available at compile time. If building with
``conda build`` this requirement can be met by installing the ``tbb-devel``
package. If not building with ``conda build`` the requirement can be met via a
system installation of TBB or through the use of the ``TBBROOT`` environment
variable to provide the location of the TBB installation. For more
information about setting ``TBBROOT`` see the `Intel documentation <https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/appendix/adding-parallelism-to-your-program/adding-the-parallel-framework-to-your-build-environment/defining-the-tbbroot-environment-variable.html>`_.
.. _numba-source-install-check:
Dependency List
---------------
Numba has numerous required and optional dependencies which additionally may
vary with target operating system and hardware. The following lists them all
(as of July 2020).
* Required build time:
* ``setuptools``
* ``numpy``
* ``llvmlite``
* Compiler toolchain mentioned above
* Required run time:
* ``numpy``
* ``llvmlite``
* Optional build time:
See :ref:`numba-source-install-env_vars` for more details about additional
options for the configuration and specification of these optional components.
* ``llvm-openmp`` (OSX) - provides headers for compiling OpenMP support into
Numba's threading backend
* ``tbb-devel`` - provides TBB headers/libraries for compiling TBB support
into Numba's threading backend (version >= 2021.6 required).
* ``importlib_metadata`` (for Python versions < 3.9)
* Optional runtime are:
* ``scipy`` - provides cython bindings used in Numba's ``np.linalg.*``
support
* ``tbb`` - provides the TBB runtime libraries used by Numba's TBB threading
backend (version >= 2021 required).
* ``jinja2`` - for "pretty" type annotation output (HTML) via the ``numba``
CLI
* ``cffi`` - permits use of CFFI bindings in Numba compiled functions
* ``llvm-openmp`` - (OSX) provides OpenMP library support for Numba's OpenMP
threading backend.
* ``intel-openmp`` - (OSX) provides an alternative OpenMP library for use with
Numba's OpenMP threading backend.
* ``ipython`` - if in use, caching will use IPython's cache
directories/caching still works
* ``pyyaml`` - permits the use of a ``.numba_config.yaml``
file for storing per project configuration options
* ``colorama`` - makes error message highlighting work
* ``intel-cmplr-lib-rt`` - allows Numba to use Intel SVML for extra
performance
* ``pygments`` - for "pretty" type annotation
* ``gdb`` as an executable on the ``$PATH`` - if you would like to use the gdb
support
* ``setuptools`` - permits the use of ``pycc`` for Ahead-of-Time (AOT)
compilation
* Compiler toolchain mentioned above, if you would like to use ``pycc`` for
Ahead-of-Time (AOT) compilation
* ``r2pipe`` - required for assembly CFG inspection.
* ``radare2`` as an executable on the ``$PATH`` - required for assembly CFG
inspection. `See here <https://github.com/radareorg/radare2>`_ for
information on obtaining and installing.
* ``graphviz`` - for some CFG inspection functionality.
* ``typeguard`` - used by ``runtests.py`` for
:ref:`runtime type-checking <type_anno_check>`.
* ``cuda-python`` - The NVIDIA CUDA Python bindings. See :ref:`cuda-bindings`.
Numba requires Version 11.6 or greater.
* ``cubinlinker`` and ``ptxcompiler`` to support
:ref:`minor-version-compatibility`.
* To build the documentation:
* ``sphinx``
* ``pygments``
* ``sphinx_rtd_theme``
* ``numpydoc``
* ``make`` as an executable on the ``$PATH``
.. _numba_support_info:
Version support information
---------------------------
This is the canonical reference for information concerning which versions of
Numba's dependencies were tested and known to work against a given version of
Numba. Other versions of the dependencies (especially NumPy) may work reasonably
well but were not tested. The use of ``x`` in a version number indicates all
patch levels supported. The use of ``?`` as a version is due to missing
information.
+----------++--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| Numba | Release date | Python | NumPy | llvmlite | LLVM | TBB |
+===========+==============+===========================+============================+==============================+===================+=============================+
| 0.58.1 | 2023-10-17 | 3.8.x <= version < 3.12 | 1.22 <= version < 1.27 | 0.41.x | 14.x | 2021.6 <= version |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.58.0 | 2023-09-20 | 3.8.x <= version < 3.12 | 1.22 <= version < 1.26 | 0.41.x | 14.x | 2021.6 <= version |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.57.1 | 2023-06-21 | 3.8.x <= version < 3.12 | 1.21 <= version < 1.25 | 0.40.x | 14.x | 2021.6 <= version |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.57.0 | 2023-05-01 | 3.8.x <= version < 3.12 | 1.21 <= version < 1.25 | 0.40.x | 14.x | 2021.6 <= version |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.56.4 | 2022-11-03 | 3.7.x <= version < 3.11 | 1.18 <= version < 1.24 | 0.39.x | 11.x | 2021.x |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.56.3 | 2022-10-13 | 3.7.x <= version < 3.11 | 1.18 <= version < 1.24 | 0.39.x | 11.x | 2021.x |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.56.2 | 2022-09-01 | 3.7.x <= version < 3.11 | 1.18 <= version < 1.24 | 0.39.x | 11.x | 2021.x |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.56.1 | NO RELEASE | | | | | |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.56.0 | 2022-07-25 | 3.7.x <= version < 3.11 | 1.18 <= version < 1.23 | 0.39.x | 11.x | 2021.x |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.55.2 | 2022-05-25 | 3.7.x <= version < 3.11 | 1.18 <= version < 1.23 | 0.38.x | 11.x | 2021.x |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.55.{0,1}| 2022-01-13 | 3.7.x <= version < 3.11 | 1.18 <= version < 1.22 | 0.38.x | 11.x | 2021.x |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.54.x | 2021-08-19 | 3.6.x <= version < 3.10 | 1.17 <= version < 1.21 | 0.37.x | 11.x | 2021.x |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.53.x | 2021-03-11 | 3.6.x <= version < 3.10 | 1.15 <= version < 1.21 | 0.36.x | 11.x | 2019.5 <= version < 2021.4 |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.52.x | 2020-11-30 | 3.6.x <= version < 3.9 | 1.15 <= version < 1.20 | 0.35.x | 10.x | 2019.5 <= version < 2020.3 |
| | | | | | (9.x for aarch64) | |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.51.x | 2020-08-12 | 3.6.x <= version < 3.9 | 1.15 <= version < 1.19 | 0.34.x | 10.x | 2019.5 <= version < 2020.0 |
| | | | | | (9.x for aarch64) | |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.50.x | 2020-06-10 | 3.6.x <= version < 3.9 | 1.15 <= version < 1.19 | 0.33.x | 9.x | 2019.5 <= version < 2020.0 |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.49.x | 2020-04-16 | 3.6.x <= version < 3.9 | 1.15 <= version < 1.18 | 0.31.x <= version < 0.33.x | 9.x | 2019.5 <= version < 2020.0 |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.48.x | 2020-01-27 | 3.6.x <= version < 3.9 | 1.15 <= version < 1.18 | 0.31.x | 8.x | 2018.0.5 <= version < ? |
| | | | | | (7.x for ppc64le) | |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
| 0.47.x | 2020-01-02 | 3.5.x <= version < 3.9; | 1.15 <= version < 1.18 | 0.30.x | 8.x | 2018.0.5 <= version < ? |
| | | version == 2.7.x | | | (7.x for ppc64le) | |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
Checking your installation
--------------------------
You should be able to import Numba from the Python prompt::
$ python
Python 3.10.2 | packaged by conda-forge | (main, Jan 14 2022, 08:02:09) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numba
>>> numba.__version__
'0.55.1'
You can also try executing the ``numba --sysinfo`` (or ``numba -s`` for short)
command to report information about your system capabilities. See :ref:`cli` for
further information.
::
$ numba -s
System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time) : 2022-01-18 10:35:08.981319
__Hardware Information__
Machine : x86_64
CPU Name : skylake-avx512
CPU Count : 12
CPU Features :
64bit adx aes avx avx2 avx512bw avx512cd avx512dq avx512f avx512vl bmi bmi2
clflushopt clwb cmov cx16 cx8 f16c fma fsgsbase fxsr invpcid lzcnt mmx
movbe pclmul pku popcnt prfchw rdrnd rdseed rtm sahf sse sse2 sse3 sse4.1
sse4.2 ssse3 xsave xsavec xsaveopt xsaves
__OS Information__
Platform Name : Linux-5.4.0-94-generic-x86_64-with-glibc2.31
Platform Release : 5.4.0-94-generic
OS Name : Linux
OS Version : #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022
__Python Information__
Python Compiler : GCC 9.4.0
Python Implementation : CPython
Python Version : 3.10.2
Python Locale : en_GB.UTF-8
__LLVM information__
LLVM Version : 11.1.0
__CUDA Information__
Found 1 CUDA devices
id 0 b'Quadro RTX 8000' [SUPPORTED]
Compute Capability: 7.5
PCI Device ID: 0
PCI Bus ID: 21
UUID: GPU-e6489c45-5b68-3b03-bab7-0e7c8e809643
Watchdog: Enabled
FP32/FP64 Performance Ratio: 32
(output truncated due to length)
.. _jit-module:
============================================
Automatic module jitting with ``jit_module``
============================================
A common usage pattern is to have an entire module containing user-defined
functions that all need to be jitted. One option to accomplish this is to
manually apply the ``@jit`` decorator to each function definition. This approach
works and is great in many cases. However, for large modules with many functions,
manually ``jit``-wrapping each function definition can be tedious. For these
situations, Numba provides another option, the ``jit_module`` function, to
automatically replace functions declared in a module with their ``jit``-wrapped
equivalents.
It's important to note the conditions under which ``jit_module`` will *not*
impact a function:
1. Functions which have already been wrapped with a Numba decorator (e.g.
``jit``, ``vectorize``, ``cfunc``, etc.) are not impacted by ``jit_module``.
2. Functions which are declared outside the module from which ``jit_module``
is called are not automatically ``jit``-wrapped.
3. Function declarations which occur logically after calling ``jit_module``
are not impacted.
All other functions in a module will have the ``@jit`` decorator automatically
applied to them. See the following section for an example use case.
.. note:: This feature is for use by module authors. ``jit_module`` should not
be called outside the context of a module containing functions to be jitted.
Example usage
=============
Let's assume we have a Python module we've created, ``mymodule.py`` (shown
below), which contains several functions. Some of these functions are defined
in ``mymodule.py`` while others are imported from other modules. We wish to have
all the functions which are defined in ``mymodule.py`` jitted using
``jit_module``.
.. _jit-module-usage:
.. code-block:: python
# mymodule.py
from numba import jit, jit_module
def inc(x):
return x + 1
def add(x, y):
return x + y
import numpy as np
# Use NumPy's mean function
mean = np.mean
@jit(nogil=True)
def mul(a, b):
return a * b
jit_module(nopython=True, error_model="numpy")
def div(a, b):
return a / b
There are several things to note in the above example:
- Both the ``inc`` and ``add`` functions will be replaced with their
``jit``-wrapped equivalents with :ref:`compilation options <jit-options>`
``nopython=True`` and ``error_model="numpy"``.
- The ``mean`` function, because it's defined *outside* of ``mymodule.py`` in
NumPy, will not be modified.
- ``mul`` will not be modified because it has been manually decorated with
``jit``.
- ``div`` will not be automatically ``jit``-wrapped because it is declared
after ``jit_module`` is called.
When the above module is imported, we have:
.. code-block:: python
>>> import mymodule
>>> mymodule.inc
CPUDispatcher(<function inc at 0x1032f86a8>)
>>> mymodule.mean
<function mean at 0x1096b8950>
API
===
.. warning:: This feature is experimental. The supported features may change
with or without notice.
.. autofunction:: numba.jit_module
.. _jit:
===================================
Compiling Python code with ``@jit``
===================================
Numba provides several utilities for code generation, but its central
feature is the :func:`numba.jit` decorator. Using this decorator, you can mark
a function for optimization by Numba's JIT compiler. Various invocation
modes trigger differing compilation options and behaviours.
Basic usage
===========
.. _jit-lazy:
Lazy compilation
----------------
The recommended way to use the ``@jit`` decorator is to let Numba decide
when and how to optimize::
from numba import jit
@jit
def f(x, y):
# A somewhat trivial example
return x + y
In this mode, compilation will be deferred until the first function
execution. Numba will infer the argument types at call time, and generate
optimized code based on this information. Numba will also be able to
compile separate specializations depending on the input types. For example,
calling the ``f()`` function above with integer or complex numbers will
generate different code paths::
>>> f(1, 2)
3
>>> f(1j, 2)
(2+1j)
Eager compilation
-----------------
You can also tell Numba the function signature you are expecting. The
function ``f()`` would now look like::
from numba import jit, int32
@jit(int32(int32, int32))
def f(x, y):
# A somewhat trivial example
return x + y
``int32(int32, int32)`` is the function's signature. In this case, the
corresponding specialization will be compiled by the ``@jit`` decorator,
and no other specialization will be allowed. This is useful if you want
fine-grained control over types chosen by the compiler (for example,
to use single-precision floats).
If you omit the return type, e.g. by writing ``(int32, int32)`` instead of
``int32(int32, int32)``, Numba will try to infer it for you. Function
signatures can also be strings, and you can pass several of them as a list;
see the :func:`numba.jit` documentation for more details.
Of course, the compiled function gives the expected results::
>>> f(1,2)
3
and if we specified ``int32`` as return type, the higher-order bits get
discarded::
>>> f(2**31, 2**31 + 1)
1
Calling and inlining other functions
====================================
Numba-compiled functions can call other compiled functions. The function
calls may even be inlined in the native code, depending on optimizer
heuristics. For example::
@jit
def square(x):
return x ** 2
@jit
def hypot(x, y):
return math.sqrt(square(x) + square(y))
The ``@jit`` decorator *must* be added to any such library function,
otherwise Numba may generate much slower code.
Signature specifications
========================
Explicit ``@jit`` signatures can use a number of types. Here are some
common ones:
* ``void`` is the return type of functions returning nothing (which
actually return :const:`None` when called from Python)
* ``intp`` and ``uintp`` are pointer-sized integers (signed and unsigned,
respectively)
* ``intc`` and ``uintc`` are equivalent to C ``int`` and ``unsigned int``
integer types
* ``int8``, ``uint8``, ``int16``, ``uint16``, ``int32``, ``uint32``,
``int64``, ``uint64`` are fixed-width integers of the corresponding bit
width (signed and unsigned)
* ``float32`` and ``float64`` are single- and double-precision floating-point
numbers, respectively
* ``complex64`` and ``complex128`` are single- and double-precision complex
numbers, respectively
* array types can be specified by indexing any numeric type, e.g. ``float32[:]``
for a one-dimensional single-precision array or ``int8[:,:]`` for a
two-dimensional array of 8-bit integers.
.. _jit-options:
Compilation options
===================
A number of keyword-only arguments can be passed to the ``@jit`` decorator.
.. _jit-nopython:
``nopython``
------------
Numba has two compilation modes: :term:`nopython mode` and
:term:`object mode`. The former produces much faster code, but has
limitations that can force Numba to fall back to the latter. To prevent
Numba from falling back, and instead raise an error, pass ``nopython=True``.
::
@jit(nopython=True)
def f(x, y):
return x + y
.. seealso:: :ref:`numba-troubleshooting`
.. _jit-nogil:
``nogil``
---------
Whenever Numba optimizes Python code to native code that only works on
native types and variables (rather than Python objects), it is not necessary
anymore to hold Python's :py:term:`global interpreter lock` (GIL).
Numba will release the GIL when entering such a compiled function if you
passed ``nogil=True``.
::
@jit(nogil=True)
def f(x, y):
return x + y
Code running with the GIL released runs concurrently with other
threads executing Python or Numba code (either the same compiled function,
or another one), allowing you to take advantage of multi-core systems.
This will not be possible if the function is compiled in :term:`object mode`.
When using ``nogil=True``, you'll have to be wary of the usual pitfalls
of multi-threaded programming (consistency, synchronization, race conditions,
etc.).
.. _jit-cache:
``cache``
---------
To avoid compilation times each time you invoke a Python program,
you can instruct Numba to write the result of function compilation into
a file-based cache. This is done by passing ``cache=True``::
@jit(cache=True)
def f(x, y):
return x + y
.. note::
Caching of compiled functions has several known limitations:
- The caching of compiled functions is not performed on a
function-by-function basis. The cached function is the the main jit
function, and all secondary functions (those called by the main
function) are incorporated in the cache of the main function.
- Cache invalidation fails to recognize changes in functions defined in a
different file. This means that when a main jit function calls
functions that were imported from a different module, a change in those
other modules will not be detected and the cache will not be updated.
This carries the risk that "old" function code might be used in the
calculations.
- Global variables are treated as constants. The cache will remember the value
of the global variable at compilation time. On cache load, the cached
function will not rebind to the new value of the global variable.
.. _parallel_jit_option:
``parallel``
------------
Enables automatic parallelization (and related optimizations) for those
operations in the function known to have parallel semantics. For a list of
supported operations, see :ref:`numba-parallel`. This feature is enabled by
passing ``parallel=True`` and must be used in conjunction with
``nopython=True``::
@jit(nopython=True, parallel=True)
def f(x, y):
return x + y
.. seealso:: :ref:`numba-parallel`
.. _jitclass:
===========================================
Compiling Python classes with ``@jitclass``
===========================================
.. note::
This is a early version of jitclass support. Not all compiling features are
exposed or implemented, yet.
Numba supports code generation for classes via the
:func:`numba.experimental.jitclass` decorator. A class can be marked for
optimization using this decorator along with a specification of the types of
each field. We call the resulting class object a *jitclass*. All methods of a
jitclass are compiled into nopython functions. The data of a jitclass instance
is allocated on the heap as a C-compatible structure so that any compiled
functions can have direct access to the underlying data, bypassing the
interpreter.
Basic usage
===========
Here's an example of a jitclass:
.. literalinclude:: ../../../numba/tests/doc_examples/test_jitclass.py
:language: python
:start-after: magictoken.ex_jitclass.begin
:end-before: magictoken.ex_jitclass.end
:dedent: 8
In the above example, a ``spec`` is provided as a list of 2-tuples. The tuples
contain the name of the field and the Numba type of the field. Alternatively,
user can use a dictionary (an ``OrderedDict`` preferably for stable field
ordering), which maps field names to types.
The definition of the class requires at least a ``__init__`` method for
initializing each defined fields. Uninitialized fields contains garbage data.
Methods and properties (getters and setters only) can be defined. They will be
automatically compiled.
Inferred class member types from type annotations with ``as_numba_type``
========================================================================
Fields of a ``jitclass`` can also be inferred from Python type annotations.
.. literalinclude:: ../../../numba/tests/doc_examples/test_jitclass.py
:language: python
:start-after: magictoken.ex_jitclass_type_hints.begin
:end-before: magictoken.ex_jitclass_type_hints.end
:dedent: 8
Any type annotations on the class will be used to extend the spec if that field
is not already present. The Numba type corresponding to the given Python type
is inferred using ``as_numba_type``. For example, if we have the class
.. code-block:: python
@jitclass([("w", int32), ("y", float64[:])])
class Foo:
w: int
x: float
y: np.ndarray
z: SomeOtherType
def __init__(self, w: int, x: float, y: np.ndarray, z: SomeOtherType):
...
then the full spec used for ``Foo`` will be:
* ``"w": int32`` (specified in the ``spec``)
* ``"x": float64`` (added from type annotation)
* ``"y": array(float64, 1d, A)`` (specified in the ``spec``)
* ``"z": numba.as_numba_type(SomeOtherType)`` (added from type annotation)
Here ``SomeOtherType`` could be any supported Python type (e.g.
``bool``, ``typing.Dict[int, typing.Tuple[float, float]]``, or another
``jitclass``).
Note that only type annotations on the class will be used to infer spec
elements. Method type annotations (e.g. those of ``__init__`` above) are
ignored.
Numba requires knowing the dtype and rank of NumPy arrays, which cannot
currently be expressed with type annotations. Because of this, NumPy arrays need
to be included in the ``spec`` explicitly.
Specifying ``numba.typed`` containers as class members explicitly
=================================================================
The following patterns demonstrate how to specify a ``numba.typed.Dict`` or
``numba.typed.List`` explicitly as part of the ``spec`` passed to ``jitclass``.
First, using explicit Numba types and explicit construction.
.. code-block:: python
from numba import types, typed
from numba.experimental import jitclass
# key and value types
kv_ty = (types.int64, types.unicode_type)
# A container class with:
# * member 'd' holding a typed dictionary of int64 -> unicode string (kv_ty)
# * member 'l' holding a typed list of float64
@jitclass([('d', types.DictType(*kv_ty)),
('l', types.ListType(types.float64))])
class ContainerHolder(object):
def __init__(self):
# initialize the containers
self.d = typed.Dict.empty(*kv_ty)
self.l = typed.List.empty_list(types.float64)
container = ContainerHolder()
container.d[1] = "apple"
container.d[2] = "orange"
container.l.append(123.)
container.l.append(456.)
print(container.d) # {1: apple, 2: orange}
print(container.l) # [123.0, 456.0]
Another useful pattern is to use the ``numba.typed`` container attribute
``_numba_type_`` to find the type of a container, this can be accessed directly
from an instance of the container in the Python interpreter. The same
information can be obtained by calling :func:`numba.typeof` on the instance. For
example:
.. code-block:: python
from numba import typed, typeof
from numba.experimental import jitclass
d = typed.Dict()
d[1] = "apple"
d[2] = "orange"
l = typed.List()
l.append(123.)
l.append(456.)
@jitclass([('d', typeof(d)), ('l', typeof(l))])
class ContainerInstHolder(object):
def __init__(self, dict_inst, list_inst):
self.d = dict_inst
self.l = list_inst
container = ContainerInstHolder(d, l)
print(container.d) # {1: apple, 2: orange}
print(container.l) # [123.0, 456.0]
It is worth noting that the instance of the container in a ``jitclass`` must be
initialized before use, for example, this will cause an invalid memory access
as ``self.d`` is written to without ``d`` being initialized as a ``type.Dict``
instance of the type specified.
.. code-block:: python
from numba import types
from numba.experimental import jitclass
dict_ty = types.DictType(types.int64, types.unicode_type)
@jitclass([('d', dict_ty)])
class NotInitialisingContainer(object):
def __init__(self):
self.d[10] = "apple" # this is invalid, `d` is not initialized
NotInitialisingContainer() # segmentation fault/memory access violation
Support operations
==================
The following operations of jitclasses work in both the interpreter and Numba
compiled functions:
* calling the jitclass class object to construct a new instance
(e.g. ``mybag = Bag(123)``);
* read/write access to attributes and properties (e.g. ``mybag.value``);
* calling methods (e.g. ``mybag.increment(3)``);
* calling static methods as instance attributes (e.g. ``mybag.add(1, 1)``);
* calling static methods as class attributes (e.g. ``Bag.add(1, 2)``);
* using select dunder methods (e.g. ``__add__`` with ``mybag + otherbag``);
Using jitclasses in Numba compiled function is more efficient.
Short methods can be inlined (at the discretion of LLVM inliner).
Attributes access are simply reading from a C structure.
Using jitclasses from the interpreter has the same overhead of calling any
Numba compiled function from the interpreter. Arguments and return values
must be unboxed or boxed between Python objects and native representation.
Values encapsulated by a jitclass does not get boxed into Python object when
the jitclass instance is handed to the interpreter. It is during attribute
access to the field values that they are boxed.
Calling static methods as class attributes is only supported outside of the
class definition (i.e. code cannot call ``Bag.add()`` from within another method
of ``Bag``).
Supported dunder methods
------------------------
The following dunder methods may be defined for jitclasses:
* ``__abs__``
* ``__bool__``
* ``__complex__``
* ``__contains__``
* ``__float__``
* ``__getitem__``
* ``__hash__``
* ``__index__``
* ``__int__``
* ``__len__``
* ``__setitem__``
* ``__str__``
* ``__eq__``
* ``__ne__``
* ``__ge__``
* ``__gt__``
* ``__le__``
* ``__lt__``
* ``__add__``
* ``__floordiv__``
* ``__lshift__``
* ``__matmul__``
* ``__mod__``
* ``__mul__``
* ``__neg__``
* ``__pos__``
* ``__pow__``
* ``__rshift__``
* ``__sub__``
* ``__truediv__``
* ``__and__``
* ``__or__``
* ``__xor__``
* ``__iadd__``
* ``__ifloordiv__``
* ``__ilshift__``
* ``__imatmul__``
* ``__imod__``
* ``__imul__``
* ``__ipow__``
* ``__irshift__``
* ``__isub__``
* ``__itruediv__``
* ``__iand__``
* ``__ior__``
* ``__ixor__``
* ``__radd__``
* ``__rfloordiv__``
* ``__rlshift__``
* ``__rmatmul__``
* ``__rmod__``
* ``__rmul__``
* ``__rpow__``
* ``__rrshift__``
* ``__rsub__``
* ``__rtruediv__``
* ``__rand__``
* ``__ror__``
* ``__rxor__``
Refer to the `Python Data Model documentation
<https://docs.python.org/3/reference/datamodel.html>`_ for descriptions of
these methods.
Limitations
===========
* A jitclass class object is treated as a function (the constructor) inside
a Numba compiled function.
* ``isinstance()`` only works in the interpreter.
* Manipulating jitclass instances in the interpreter is not optimized, yet.
* Support for jitclasses are available on CPU only.
(Note: Support for GPU devices is planned for a future release.)
The decorator: ``@jitclass``
============================
.. autofunction:: numba.experimental.jitclass
Overview
========
Numba is a compiler for Python array and numerical functions that gives
you the power to speed up your applications with high performance
functions written directly in Python.
Numba generates optimized machine code from pure Python code using
the `LLVM compiler infrastructure <http://llvm.org/>`_. With a few simple
annotations, array-oriented and math-heavy Python code can be
just-in-time optimized to performance similar as C, C++ and Fortran, without
having to switch languages or Python interpreters.
Numba's main features are:
* :ref:`on-the-fly code generation <jit>` (at import time or runtime, at the
user's preference)
* native code generation for the CPU (default) and
:doc:`GPU hardware <../cuda/index>`
* integration with the Python scientific software stack (thanks to Numpy)
Here is how a Numba-optimized function, taking a Numpy array as argument,
might look like::
@numba.jit
def sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
.. Copyright (c) 2017 Intel Corporation
SPDX-License-Identifier: BSD-2-Clause
.. _numba-parallel:
=======================================
Automatic parallelization with ``@jit``
=======================================
Setting the :ref:`parallel_jit_option` option for :func:`~numba.jit` enables
a Numba transformation pass that attempts to automatically parallelize and
perform other optimizations on (part of) a function. At the moment, this
feature only works on CPUs.
Some operations inside a user defined function, e.g. adding a scalar value to
an array, are known to have parallel semantics. A user program may contain
many such operations and while each operation could be parallelized
individually, such an approach often has lackluster performance due to poor
cache behavior. Instead, with auto-parallelization, Numba attempts to
identify such operations in a user program, and fuse adjacent ones together,
to form one or more kernels that are automatically run in parallel.
The process is fully automated without modifications to the user program,
which is in contrast to Numba's :func:`~numba.vectorize` or
:func:`~numba.guvectorize` mechanism, where manual effort is required
to create parallel kernels.
.. _numba-parallel-supported:
Supported Operations
====================
In this section, we give a list of all the array operations that have
parallel semantics and for which we attempt to parallelize.
#. All numba array operations that are supported by :ref:`case-study-array-expressions`,
which include common arithmetic functions between Numpy arrays, and between
arrays and scalars, as well as Numpy ufuncs. They are often called
`element-wise` or `point-wise` array operations:
* unary operators: ``+`` ``-`` ``~``
* binary operators: ``+`` ``-`` ``*`` ``/`` ``/?`` ``%`` ``|`` ``>>`` ``^`` ``<<`` ``&`` ``**`` ``//``
* comparison operators: ``==`` ``!=`` ``<`` ``<=`` ``>`` ``>=``
* :ref:`Numpy ufuncs <supported_ufuncs>` that are supported in :term:`nopython mode`.
* User defined :class:`~numba.DUFunc` through :func:`~numba.vectorize`.
#. Numpy reduction functions ``sum``, ``prod``, ``min``, ``max``, ``argmin``,
and ``argmax``. Also, array math functions ``mean``, ``var``, and ``std``.
#. Numpy array creation functions ``zeros``, ``ones``, ``arange``, ``linspace``,
and several random functions (rand, randn, ranf, random_sample, sample,
random, standard_normal, chisquare, weibull, power, geometric, exponential,
poisson, rayleigh, normal, uniform, beta, binomial, f, gamma, lognormal,
laplace, randint, triangular).
#. Numpy ``dot`` function between a matrix and a vector, or two vectors.
In all other cases, Numba's default implementation is used.
#. Multi-dimensional arrays are also supported for the above operations
when operands have matching dimension and size. The full semantics of
Numpy broadcast between arrays with mixed dimensionality or size is
not supported, nor is the reduction across a selected dimension.
#. Array assignment in which the target is an array selection using a slice
or a boolean array, and the value being assigned is either a scalar or
another selection where the slice range or bitarray are inferred to be
compatible.
#. The ``reduce`` operator of ``functools`` is supported for specifying parallel
reductions on 1D Numpy arrays but the initial value argument is mandatory.
.. _numba-prange:
Explicit Parallel Loops
========================
Another feature of the code transformation pass (when ``parallel=True``) is
support for explicit parallel loops. One can use Numba's ``prange`` instead of
``range`` to specify that a loop can be parallelized. The user is required to
make sure that the loop does not have cross iteration dependencies except for
supported reductions.
A reduction is inferred automatically if a variable is updated by a supported binary
function/operator using its previous value in the loop body. The following
functions/operators are supported: ``+=``, ``+``, ``-=``, ``-``, ``*=``,
``*``, ``/=``, ``/``, ``max()``, ``min()``.
The initial value of the reduction is inferred automatically for the
supported operators (i.e., not the ``max`` and ``min`` functions).
Note that the ``//=`` operator is not supported because
in the general case the result depends on the order in which the divisors are
applied. However, if all divisors are integers then the programmer may be
able to rewrite the ``//=`` reduction as a ``*=`` reduction followed by
a single floor division after the parallel region where the divisor is the
accumulated product.
For the ``max`` and ``min`` functions, the reduction variable should hold the identity
value right before entering the ``prange`` loop. Reductions in this manner
are supported for scalars and for arrays of arbitrary dimensions.
The example below demonstrates a parallel loop with a
reduction (``A`` is a one-dimensional Numpy array)::
from numba import njit, prange
@njit(parallel=True)
def prange_test(A):
s = 0
# Without "parallel=True" in the jit-decorator
# the prange statement is equivalent to range
for i in prange(A.shape[0]):
s += A[i]
return s
The following example demonstrates a product reduction on a two-dimensional array::
from numba import njit, prange
import numpy as np
@njit(parallel=True)
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in prange(n):
result1 *= tmp
return result1
.. note:: When using Python's ``range`` to induce a loop, Numba types the
induction variable as a signed integer. This is also the case for
Numba's ``prange`` when ``parallel=False``. However, for
``parallel=True``, if the range is identifiable as strictly positive,
the type of the induction variable will be ``uint64``. The impact of
a ``uint64`` induction variable is often most noticable when
undertaking operations involving it and a signed integer. Under
Numba's type coercion rules, such a case will commonly result in the
operation producing a floating point result type.
Care should be taken, however, when reducing into slices or elements of an array
if the elements specified by the slice or index are written to simultaneously by
multiple parallel threads. The compiler may not detect such cases and then a race condition
would occur.
The following example demonstrates such a case where a race condition in the execution of the
parallel for-loop results in an incorrect return value::
from numba import njit, prange
import numpy as np
@njit(parallel=True)
def prange_wrong_result(x):
n = x.shape[0]
y = np.zeros(4)
for i in prange(n):
# accumulating into the same element of `y` from different
# parallel iterations of the loop results in a race condition
y[:] += x[i]
return y
as does the following example where the accumulating element is explicitly specified::
from numba import njit, prange
import numpy as np
@njit(parallel=True)
def prange_wrong_result(x):
n = x.shape[0]
y = np.zeros(4)
for i in prange(n):
# accumulating into the same element of `y` from different
# parallel iterations of the loop results in a race condition
y[i % 4] += x[i]
return y
whereas performing a whole array reduction is fine::
from numba import njit, prange
import numpy as np
@njit(parallel=True)
def prange_ok_result_whole_arr(x):
n = x.shape[0]
y = np.zeros(4)
for i in prange(n):
y += x[i]
return y
as is creating a slice reference outside of the parallel reduction loop::
from numba import njit, prange
import numpy as np
@njit(parallel=True)
def prange_ok_result_outer_slice(x):
n = x.shape[0]
y = np.zeros(4)
z = y[:]
for i in prange(n):
z += x[i]
return y
Examples
========
In this section, we give an example of how this feature helps
parallelize Logistic Regression::
@numba.jit(nopython=True, parallel=True)
def logistic_regression(Y, X, w, iterations):
for i in range(iterations):
w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
return w
We will not discuss details of the algorithm, but instead focus on how
this program behaves with auto-parallelization:
1. Input ``Y`` is a vector of size ``N``, ``X`` is an ``N x D`` matrix,
and ``w`` is a vector of size ``D``.
2. The function body is an iterative loop that updates variable ``w``.
The loop body consists of a sequence of vector and matrix operations.
3. The inner ``dot`` operation produces a vector of size ``N``, followed by a
sequence of arithmetic operations either between a scalar and vector of
size ``N``, or two vectors both of size ``N``.
4. The outer ``dot`` produces a vector of size ``D``, followed by an inplace
array subtraction on variable ``w``.
5. With auto-parallelization, all operations that produce array of size
``N`` are fused together to become a single parallel kernel. This includes
the inner ``dot`` operation and all point-wise array operations following it.
6. The outer ``dot`` operation produces a result array of different dimension,
and is not fused with the above kernel.
Here, the only thing required to take advantage of parallel hardware is to set
the :ref:`parallel_jit_option` option for :func:`~numba.jit`, with no
modifications to the ``logistic_regression`` function itself. If we were to
give an equivalence parallel implementation using :func:`~numba.guvectorize`,
it would require a pervasive change that rewrites the code to extract kernel
computation that can be parallelized, which was both tedious and challenging.
Unsupported Operations
======================
This section contains a non-exhaustive list of commonly encountered but
currently unsupported features:
#. **Mutating a list is not threadsafe**
Concurrent write operations on container types (i.e. lists, sets and
dictionaries) in a ``prange`` parallel region are not threadsafe e.g.::
@njit(parallel=True)
def invalid():
z = []
for i in prange(10000):
z.append(i)
return z
It is highly likely that the above will result in corruption or an access
violation as containers require thread-safety under mutation but this feature
is not implemented.
#. **Induction variables are not associated with thread ID**
The use of the induction variable induced by a ``prange`` based loop in
conjunction with ``get_num_threads`` as a method of ensuring safe writes into
a pre-sized container is not valid e.g.::
@njit(parallel=True)
def invalid():
n = get_num_threads()
z = [0 for _ in range(n)]
for i in prange(100):
z[i % n] += i
return z
The above can on occasion appear to work, but it does so by luck. There's no
guarantee about which indexes are assigned to which executing threads or the
order in which the loop iterations execute.
.. _numba-parallel-diagnostics:
Diagnostics
===========
.. note:: At present not all parallel transforms and functions can be tracked
through the code generation process. Occasionally diagnostics about
some loops or transforms may be missing.
The :ref:`parallel_jit_option` option for :func:`~numba.jit` can produce
diagnostic information about the transforms undertaken in automatically
parallelizing the decorated code. This information can be accessed in two ways,
the first is by setting the environment variable
:envvar:`NUMBA_PARALLEL_DIAGNOSTICS`, the second is by calling
:meth:`~Dispatcher.parallel_diagnostics`, both methods give the same information
and print to ``STDOUT``. The level of verbosity in the diagnostic information is
controlled by an integer argument of value between 1 and 4 inclusive, 1 being
the least verbose and 4 the most. For example::
@njit(parallel=True)
def test(x):
n = x.shape[0]
a = np.sin(x)
b = np.cos(a * a)
acc = 0
for i in prange(n - 2):
for j in prange(n - 1):
acc += b[i] + b[j + 1]
return acc
test(np.arange(10))
test.parallel_diagnostics(level=4)
produces::
================================================================================
======= Parallel Accelerator Optimizing: Function test, example.py (4) =======
================================================================================
Parallel loop listing for Function test, example.py (4)
--------------------------------------|loop #ID
@njit(parallel=True) |
def test(x): |
n = x.shape[0] |
a = np.sin(x)---------------------| #0
b = np.cos(a * a)-----------------| #1
acc = 0 |
for i in prange(n - 2):-----------| #3
for j in prange(n - 1):-------| #2
acc += b[i] + b[j + 1] |
return acc |
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
Trying to fuse loops #0 and #1:
- fusion succeeded: parallel for-loop #1 is fused into for-loop #0.
Trying to fuse loops #0 and #3:
- fusion failed: loop dimension mismatched in axis 0. slice(0, x_size0.1, 1)
!= slice(0, $40.4, 1)
----------------------------- Before Optimization ------------------------------
Parallel region 0:
+--0 (parallel)
+--1 (parallel)
Parallel region 1:
+--3 (parallel)
+--2 (parallel)
--------------------------------------------------------------------------------
------------------------------ After Optimization ------------------------------
Parallel region 0:
+--0 (parallel, fused with loop(s): 1)
Parallel region 1:
+--3 (parallel)
+--2 (serial)
Parallel region 0 (loop #0) had 1 loop(s) fused.
Parallel region 1 (loop #3) had 0 loop(s) fused and 1 loop(s) serialized as part
of the larger parallel loop (#3).
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------Loop invariant code motion---------------------------
Instruction hoisting:
loop #0:
Failed to hoist the following:
dependency: $arg_out_var.10 = getitem(value=x, index=$parfor__index_5.99)
dependency: $0.6.11 = getattr(value=$0.5, attr=sin)
dependency: $expr_out_var.9 = call $0.6.11($arg_out_var.10, func=$0.6.11, args=[Var($arg_out_var.10, example.py (7))], kws=(), vararg=None)
dependency: $arg_out_var.17 = $expr_out_var.9 * $expr_out_var.9
dependency: $0.10.20 = getattr(value=$0.9, attr=cos)
dependency: $expr_out_var.16 = call $0.10.20($arg_out_var.17, func=$0.10.20, args=[Var($arg_out_var.17, example.py (8))], kws=(), vararg=None)
loop #3:
Has the following hoisted:
$const58.3 = const(int, 1)
$58.4 = _n_23 - $const58.3
--------------------------------------------------------------------------------
To aid users unfamiliar with the transforms undertaken when the
:ref:`parallel_jit_option` option is used, and to assist in the understanding of
the subsequent sections, the following definitions are provided:
* Loop fusion
`Loop fusion <https://en.wikipedia.org/wiki/Loop_fission_and_fusion>`_ is a
technique whereby loops with equivalent bounds may be combined under certain
conditions to produce a loop with a larger body (aiming to improve data
locality).
* Loop serialization
Loop serialization occurs when any number of ``prange`` driven loops are
present inside another ``prange`` driven loop. In this case the outermost
of all the ``prange`` loops executes in parallel and any inner ``prange``
loops (nested or otherwise) are treated as standard ``range`` based loops.
Essentially, nested parallelism does not occur.
* Loop invariant code motion
`Loop invariant code motion
<https://en.wikipedia.org/wiki/Loop-invariant_code_motion>`_ is an
optimization technique that analyses a loop to look for statements that can
be moved outside the loop body without changing the result of executing the
loop, these statements are then "hoisted" out of the loop to save repeated
computation.
* Allocation hoisting
Allocation hoisting is a specialized case of loop invariant code motion that
is possible due to the design of some common NumPy allocation methods.
Explanation of this technique is best driven by an example:
.. code-block:: python
@njit(parallel=True)
def test(n):
for i in prange(n):
temp = np.zeros((50, 50)) # <--- Allocate a temporary array with np.zeros()
for j in range(50):
temp[j, j] = i
# ...do something with temp
internally, this is transformed to approximately the following:
.. code-block:: python
@njit(parallel=True)
def test(n):
for i in prange(n):
temp = np.empty((50, 50)) # <--- np.zeros() is rewritten as np.empty()
temp[:] = 0 # <--- and then a zero initialisation
for j in range(50):
temp[j, j] = i
# ...do something with temp
then after hoisting:
.. code-block:: python
@njit(parallel=True)
def test(n):
temp = np.empty((50, 50)) # <--- allocation is hoisted as a loop invariant as `np.empty` is considered pure
for i in prange(n):
temp[:] = 0 # <--- this remains as assignment is a side effect
for j in range(50):
temp[j, j] = i
# ...do something with temp
it can be seen that the ``np.zeros`` allocation is split into an allocation
and an assignment, and then the allocation is hoisted out of the loop in
``i``, this producing more efficient code as the allocation only occurs
once.
The parallel diagnostics report sections
----------------------------------------
The report is split into the following sections:
#. Code annotation
This is the first section and contains the source code of the decorated
function with loops that have parallel semantics identified and enumerated.
The ``loop #ID`` column on the right of the source code lines up with
identified parallel loops. From the example, ``#0`` is ``np.sin``, ``#1``
is ``np.cos`` and ``#2`` and ``#3`` are ``prange()``:
.. code-block:: python
Parallel loop listing for Function test, example.py (4)
--------------------------------------|loop #ID
@njit(parallel=True) |
def test(x): |
n = x.shape[0] |
a = np.sin(x)---------------------| #0
b = np.cos(a * a)-----------------| #1
acc = 0 |
for i in prange(n - 2):-----------| #3
for j in prange(n - 1):-------| #2
acc += b[i] + b[j + 1] |
return acc |
It is worth noting that the loop IDs are enumerated in the order they are
discovered which is not necessarily the same order as present in the source.
Further, it should also be noted that the parallel transforms use a static
counter for loop ID indexing. As a consequence it is possible for the loop
ID index to not start at 0 due to use of the same counter for internal
optimizations/transforms taking place that are invisible to the user.
#. Fusing loops
This section describes the attempts made at fusing discovered
loops noting which succeeded and which failed. In the case of failure to
fuse a reason is given (e.g. dependency on other data). From the example:
.. code-block:: text
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
Trying to fuse loops #0 and #1:
- fusion succeeded: parallel for-loop #1 is fused into for-loop #0.
Trying to fuse loops #0 and #3:
- fusion failed: loop dimension mismatched in axis 0. slice(0, x_size0.1, 1)
!= slice(0, $40.4, 1)
It can be seen that fusion of loops ``#0`` and ``#1`` was attempted and this
succeeded (both are based on the same dimensions of ``x``). Following the
successful fusion of ``#0`` and ``#1``, fusion was attempted between ``#0``
(now including the fused ``#1`` loop) and ``#3``. This fusion failed because
there is a loop dimension mismatch, ``#0`` is size ``x.shape`` whereas
``#3`` is size ``x.shape[0] - 2``.
#. Before Optimization
This section shows the structure of the parallel regions in the code before
any optimization has taken place, but with loops associated with their final
parallel region (this is to make before/after optimization output directly
comparable). Multiple parallel regions may exist if there are loops which
cannot be fused, in this case code within each region will execute in
parallel, but each parallel region will run sequentially. From the example:
.. code-block:: text
Parallel region 0:
+--0 (parallel)
+--1 (parallel)
Parallel region 1:
+--3 (parallel)
+--2 (parallel)
As alluded to by the `Fusing loops` section, there are necessarily two
parallel regions in the code. The first contains loops ``#0`` and ``#1``,
the second contains ``#3`` and ``#2``, all loops are marked ``parallel`` as
no optimization has taken place yet.
#. After Optimization
This section shows the structure of the parallel regions in the code after
optimization has taken place. Again, parallel regions are enumerated with
their corresponding loops but this time loops which are fused or serialized
are noted and a summary is presented. From the example:
.. code-block:: text
Parallel region 0:
+--0 (parallel, fused with loop(s): 1)
Parallel region 1:
+--3 (parallel)
+--2 (serial)
Parallel region 0 (loop #0) had 1 loop(s) fused.
Parallel region 1 (loop #3) had 0 loop(s) fused and 1 loop(s) serialized as part
of the larger parallel loop (#3).
It can be noted that parallel region 0 contains loop ``#0`` and, as seen in
the `fusing loops` section, loop ``#1`` is fused into loop ``#0``. It can
also be noted that parallel region 1 contains loop ``#3`` and that loop
``#2`` (the inner ``prange()``) has been serialized for execution in the
body of loop ``#3``.
#. Loop invariant code motion
This section shows for each loop, after optimization has occurred:
* the instructions that failed to be hoisted and the reason for failure
(dependency/impure).
* the instructions that were hoisted.
* any allocation hoisting that may have occurred.
From the example:
.. code-block:: text
Instruction hoisting:
loop #0:
Failed to hoist the following:
dependency: $arg_out_var.10 = getitem(value=x, index=$parfor__index_5.99)
dependency: $0.6.11 = getattr(value=$0.5, attr=sin)
dependency: $expr_out_var.9 = call $0.6.11($arg_out_var.10, func=$0.6.11, args=[Var($arg_out_var.10, example.py (7))], kws=(), vararg=None)
dependency: $arg_out_var.17 = $expr_out_var.9 * $expr_out_var.9
dependency: $0.10.20 = getattr(value=$0.9, attr=cos)
dependency: $expr_out_var.16 = call $0.10.20($arg_out_var.17, func=$0.10.20, args=[Var($arg_out_var.17, example.py (8))], kws=(), vararg=None)
loop #3:
Has the following hoisted:
$const58.3 = const(int, 1)
$58.4 = _n_23 - $const58.3
The first thing to note is that this information is for advanced users as it
refers to the :term:`Numba IR` of the function being transformed. As an
example, the expression ``a * a`` in the example source partly translates to
the expression ``$arg_out_var.17 = $expr_out_var.9 * $expr_out_var.9`` in
the IR, this clearly cannot be hoisted out of ``loop #0`` because it is not
loop invariant! Whereas in ``loop #3``, the expression
``$const58.3 = const(int, 1)`` comes from the source ``b[j + 1]``, the
number ``1`` is clearly a constant and so can be hoisted out of the loop.
.. _numba-parallel-scheduling:
Scheduling
==========
By default, Numba divides the iterations of a parallel region into approximately equal
sized chunks and gives one such chunk to each configured thread.
(See :ref:`setting_the_number_of_threads`).
This scheduling approach is equivalent to OpenMP's static schedule with no specified
chunk size and is appropriate when the work required for each iteration is nearly constant.
Conversely, if the work required per iteration, as shown in the ``prange`` loop below,
varies significantly then this static
scheduling approach can lead to load imbalances and longer execution times.
.. literalinclude:: ../../../numba/tests/doc_examples/test_parallel_chunksize.py
:language: python
:caption: from ``test_unbalanced_example`` of ``numba/tests/doc_examples/test_parallel_chunksize.py``
:start-after: magictoken.ex_unbalanced.begin
:end-before: magictoken.ex_unbalanced.end
:dedent: 12
:linenos:
In such cases,
Numba provides a mechanism to control how many iterations of a parallel region
(i.e., the chunk size) go into each chunk.
Numba then computes the number of required chunks which is
equal to the number of iterations divided by the chunk size, truncated to the nearest
integer. All of these chunks are then approximately, equally sized.
Numba then gives one such chunk to each configured
thread as above and when a thread finishes a chunk, Numba gives that thread the next
available chunk.
This scheduling approach is similar to OpenMP's dynamic scheduling
option with the specified chunk size.
(Note that Numba is only capable of supporting this dynamic scheduling
of parallel regions if the underlying Numba threading backend,
:ref:`numba-threading-layer`, is also capable of dynamic scheduling.
At the moment, only the ``tbb`` backend is capable of dynamic
scheduling and so is required if any performance
benefit is to be achieved from this chunk size selection mechanism.)
To minimize execution time, the programmer must
pick a chunk size that strikes a balance between greater load balancing with smaller
chunk sizes and less scheduling overhead with larger chunk sizes.
See :ref:`chunk-details-label` for additional details on the internal implementation
of chunk sizes.
The number of iterations of a parallel region in a chunk is stored as a thread-local
variable and can be set using
:func:`numba.set_parallel_chunksize`. This function takes one integer parameter
whose value must be greater than
or equal to 0. A value of 0 is the default value and instructs Numba to use the
static scheduling approach above. Values greater than 0 instruct Numba to use that value
as the chunk size in the dynamic scheduling approach described above.
:func:`numba.set_parallel_chunksize` returns the previous value of the chunk size.
The current value of this thread local variable is used as the chunk size for all
subsequent parallel regions invoked by this thread.
However, upon entering a parallel region, Numba sets the chunk size thread local variable
for each of the threads executing that parallel region back to the default of 0,
since it is unlikely
that any nested parallel regions would require the same chunk size. If the same thread is
used to execute a sequential and parallel region then that thread's chunk size
variable is set to 0 at the beginning of the parallel region and restored to
its original value upon exiting the parallel region.
This behavior is demonstrated in ``func1`` in the example below in that the
reported chunk size inside the ``prange`` parallel region is 0 but is 4 outside
the parallel region. Note that if the ``prange`` is not executed in parallel for
any reason (e.g., setting ``parallel=False``) then the chunk size reported inside
the non-parallel prange would be reported as 4.
This behavior may initially be counterintuitive to programmers as it differs from
how thread local variables typically behave in other languages.
A programmer may use
the chunk size API described in this section within the threads executing a parallel
region if the programmer wishes to specify a chunk size for any nested parallel regions
that may be launched.
The current value of the parallel chunk size can be obtained by calling
:func:`numba.get_parallel_chunksize`.
Both of these functions can be used from standard Python and from within Numba JIT compiled functions
as shown below. Both invocations of ``func1`` would be executed with a chunk size of 4 whereas
``func2`` would use a chunk size of 8.
.. literalinclude:: ../../../numba/tests/doc_examples/test_parallel_chunksize.py
:language: python
:caption: from ``test_chunksize_manual`` of ``numba/tests/doc_examples/test_parallel_chunksize.py``
:start-after: magictoken.ex_chunksize_manual.begin
:end-before: magictoken.ex_chunksize_manual.end
:dedent: 12
:linenos:
Since this idiom of saving and restoring is so common, Numba provides the
:func:`parallel_chunksize` with clause context-manager to simplify the idiom.
As shown below, this with clause can be invoked from both standard Python and
within Numba JIT compiled functions. As with other Numba context-managers, be
aware that the raising of exceptions is not supported from within a context managed
block that is part of a Numba JIT compiled function.
.. literalinclude:: ../../../numba/tests/doc_examples/test_parallel_chunksize.py
:language: python
:caption: from ``test_chunksize_with`` of ``numba/tests/doc_examples/test_parallel_chunksize.py``
:start-after: magictoken.ex_chunksize_with.begin
:end-before: magictoken.ex_chunksize_with.end
:dedent: 12
:linenos:
Note that these functions to set the chunk size only have an effect on
Numba automatic parallelization with the :ref:`parallel_jit_option` option.
Chunk size specification has no effect on the :func:`~numba.vectorize` decorator
or the :func:`~numba.guvectorize` decorator.
.. seealso:: :ref:`parallel_jit_option`, :ref:`Parallel FAQs <parallel_FAQs>`
.. _performance-tips:
Performance Tips
================
This is a short guide to features present in Numba that can help with obtaining
the best performance from code. Two examples are used, both are entirely
contrived and exist purely for pedagogical reasons to motivate discussion.
The first is the computation of the trigonometric identity
``cos(x)^2 + sin(x)^2``, the second is a simple element wise square root of a
vector with reduction over summation. All performance numbers are indicative
only and unless otherwise stated were taken from running on an Intel ``i7-4790``
CPU (4 hardware threads) with an input of ``np.arange(1.e7)``.
.. note::
A reasonably effective approach to achieving high performance code is to
profile the code running with real data and use that to guide performance
tuning. The information presented here is to demonstrate features, not to act
as canonical guidance!
No Python mode vs Object mode
-----------------------------
A common pattern is to decorate functions with ``@jit`` as this is the most
flexible decorator offered by Numba. ``@jit`` essentially encompasses two modes
of compilation, first it will try and compile the decorated function in no
Python mode, if this fails it will try again to compile the function using
object mode. Whilst the use of looplifting in object mode can enable some
performance increase, getting functions to compile under no python mode is
really the key to good performance. To make it such that only no python mode is
used and if compilation fails an exception is raised the decorators ``@njit``
and ``@jit(nopython=True)`` can be used (the first is an alias of the
second for convenience).
Loops
-----
Whilst NumPy has developed a strong idiom around the use of vector operations,
Numba is perfectly happy with loops too. For users familiar with C or Fortran,
writing Python in this style will work fine in Numba (after all, LLVM gets a
lot of use in compiling C lineage languages). For example::
@njit
def ident_np(x):
return np.cos(x) ** 2 + np.sin(x) ** 2
@njit
def ident_loops(x):
r = np.empty_like(x)
n = len(x)
for i in range(n):
r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2
return r
The above run at almost identical speeds when decorated with ``@njit``, without
the decorator the vectorized function is a couple of orders of magnitude faster.
+-----------------+-------+----------------+
| Function Name | @njit | Execution time |
+=================+=======+================+
| ``ident_np`` | No | 0.581s |
+-----------------+-------+----------------+
| ``ident_np`` | Yes | 0.659s |
+-----------------+-------+----------------+
| ``ident_loops`` | No | 25.2s |
+-----------------+-------+----------------+
| ``ident_loops`` | Yes | 0.670s |
+-----------------+-------+----------------+
.. _fast-math:
Fastmath
--------
In certain classes of applications strict IEEE 754 compliance is less
important. As a result it is possible to relax some numerical rigour with
view of gaining additional performance. The way to achieve this behaviour in
Numba is through the use of the ``fastmath`` keyword argument::
@njit(fastmath=False)
def do_sum(A):
acc = 0.
# without fastmath, this loop must accumulate in strict order
for x in A:
acc += np.sqrt(x)
return acc
@njit(fastmath=True)
def do_sum_fast(A):
acc = 0.
# with fastmath, the reduction can be vectorized as floating point
# reassociation is permitted.
for x in A:
acc += np.sqrt(x)
return acc
+-----------------+-----------------+
| Function Name | Execution time |
+=================+=================+
| ``do_sum`` | 35.2 ms |
+-----------------+-----------------+
| ``do_sum_fast`` | 17.8 ms |
+-----------------+-----------------+
In some cases you may wish to opt-in to only a subset of possible fast-math
optimizations. This can be done by supplying a set of `LLVM fast-math flags
<https://llvm.org/docs/LangRef.html#fast-math-flags>`_ to ``fastmath``.::
def add_assoc(x, y):
return (x - y) + y
print(njit(fastmath=False)(add_assoc)(0, np.inf)) # nan
print(njit(fastmath=True) (add_assoc)(0, np.inf)) # 0.0
print(njit(fastmath={'reassoc', 'nsz'})(add_assoc)(0, np.inf)) # 0.0
print(njit(fastmath={'reassoc'}) (add_assoc)(0, np.inf)) # nan
print(njit(fastmath={'nsz'}) (add_assoc)(0, np.inf)) # nan
Parallel=True
-------------
If code contains operations that are parallelisable (:ref:`and supported
<numba-parallel-supported>`) Numba can compile a version that will run in
parallel on multiple native threads (no GIL!). This parallelisation is performed
automatically and is enabled by simply adding the ``parallel`` keyword
argument::
@njit(parallel=True)
def ident_parallel(x):
return np.cos(x) ** 2 + np.sin(x) ** 2
Executions times are as follows:
+--------------------+-----------------+
| Function Name | Execution time |
+====================+=================+
| ``ident_parallel`` | 112 ms |
+--------------------+-----------------+
The execution speed of this function with ``parallel=True`` present is
approximately 5x that of the NumPy equivalent and 6x that of standard
``@njit``.
Numba parallel execution also has support for explicit parallel loop
declaration similar to that in OpenMP. To indicate that a loop should be
executed in parallel the ``numba.prange`` function should be used, this function
behaves like Python ``range`` and if ``parallel=True`` is not set it acts
simply as an alias of ``range``. Loops induced with ``prange`` can be used for
embarrassingly parallel computation and also reductions.
Revisiting the reduce over sum example, assuming it is safe for the sum to be
accumulated out of order, the loop in ``n`` can be parallelised through the use
of ``prange``. Further, the ``fastmath=True`` keyword argument can be added
without concern in this case as the assumption that out of order execution is
valid has already been made through the use of ``parallel=True`` (as each thread
computes a partial sum).
::
@njit(parallel=True)
def do_sum_parallel(A):
# each thread can accumulate its own partial sum, and then a cross
# thread reduction is performed to obtain the result to return
n = len(A)
acc = 0.
for i in prange(n):
acc += np.sqrt(A[i])
return acc
@njit(parallel=True, fastmath=True)
def do_sum_parallel_fast(A):
n = len(A)
acc = 0.
for i in prange(n):
acc += np.sqrt(A[i])
return acc
Execution times are as follows, ``fastmath`` again improves performance.
+-------------------------+-----------------+
| Function Name | Execution time |
+=========================+=================+
| ``do_sum_parallel`` | 9.81 ms |
+-------------------------+-----------------+
| ``do_sum_parallel_fast``| 5.37 ms |
+-------------------------+-----------------+
.. _intel-svml:
Intel SVML
----------
Intel provides a short vector math library (SVML) that contains a large number
of optimised transcendental functions available for use as compiler
intrinsics. If the ``intel-cmplr-lib-rt`` package is present in the
environment (or the SVML libraries are simply locatable!) then Numba
automatically configures the LLVM back end to use the SVML intrinsic functions
where ever possible. SVML provides both high and low accuracy versions of each
intrinsic and the version that is used is determined through the use of the
``fastmath`` keyword. The default is to use high accuracy which is accurate to
within ``1 ULP``, however if ``fastmath`` is set to ``True`` then the lower
accuracy versions of the intrinsics are used (answers to within ``4 ULP``).
First obtain SVML, using conda for example::
conda install intel-cmplr-lib-rt
.. note::
The SVML library was previously provided through the ``icc_rt`` conda
package. The ``icc_rt`` package has since become a meta-package and as of
version ``2021.1.1`` it has ``intel-cmplr-lib-rt`` amongst other packages as
a dependency. Installing the recommended ``intel-cmplr-lib-rt`` package
directly results in fewer installed packages.
Rerunning the identity function example ``ident_np`` from above with various
combinations of options to ``@njit`` and with/without SVML yields the following
performance results (input size ``np.arange(1.e8)``). For reference, with just
NumPy the function executed in ``5.84s``:
+-----------------------------------+--------+-------------------+
| ``@njit`` kwargs | SVML | Execution time |
+===================================+========+===================+
| ``None`` | No | 5.95s |
+-----------------------------------+--------+-------------------+
| ``None`` | Yes | 2.26s |
+-----------------------------------+--------+-------------------+
| ``fastmath=True`` | No | 5.97s |
+-----------------------------------+--------+-------------------+
| ``fastmath=True`` | Yes | 1.8s |
+-----------------------------------+--------+-------------------+
| ``parallel=True`` | No | 1.36s |
+-----------------------------------+--------+-------------------+
| ``parallel=True`` | Yes | 0.624s |
+-----------------------------------+--------+-------------------+
| ``parallel=True, fastmath=True`` | No | 1.32s |
+-----------------------------------+--------+-------------------+
| ``parallel=True, fastmath=True`` | Yes | 0.576s |
+-----------------------------------+--------+-------------------+
It is evident that SVML significantly increases the performance of this
function. The impact of ``fastmath`` in the case of SVML not being present is
zero, this is expected as there is nothing in the original function that would
benefit from relaxing numerical strictness.
Linear algebra
--------------
Numba supports most of ``numpy.linalg`` in no Python mode. The internal
implementation relies on a LAPACK and BLAS library to do the numerical work
and it obtains the bindings for the necessary functions from SciPy. Therefore,
to achieve good performance in ``numpy.linalg`` functions with Numba it is
necessary to use a SciPy built against a well optimised LAPACK/BLAS library.
In the case of the Anaconda distribution SciPy is built against Intel's MKL
which is highly optimised and as a result Numba makes use of this performance.
============================
Compiling code ahead of time
============================
.. _pycc:
While Numba's main use case is :term:`Just-in-Time compilation`, it also
provides a facility for :term:`Ahead-of-Time compilation` (AOT).
.. note:: To use this feature the ``setuptools`` package is required at
compilation time, but it is not a runtime dependency of the
extension module produced.
.. note:: This module is pending deprecation. Please see
:ref:`deprecation-numba-pycc` for more information.
Overview
========
Benefits
--------
#. AOT compilation produces a compiled extension module which does not depend
on Numba: you can distribute the module on machines which do not have
Numba installed (but NumPy is required).
#. There is no compilation overhead at runtime (but see the
``@jit`` :ref:`cache <jit-cache>` option), nor any overhead of importing
Numba.
.. seealso::
Compiled extension modules are discussed in the
`Python packaging user guide <https://packaging.python.org/en/latest/guides/packaging-binary-extensions/>`_.
Limitations
-----------
#. AOT compilation only allows for regular functions, not :term:`ufuncs <ufunc>`.
#. You have to specify function signatures explicitly.
#. Each exported function can have only one signature (but you can export
several different signatures under different names).
#. Exported functions do not check the types of the arguments that are passed
to them; the caller is expected to provide arguments of the correct type.
#. AOT compilation produces generic code for your CPU's architectural family
(for example "x86-64"), while JIT compilation produces code optimized
for your particular CPU model.
Usage
=====
Standalone example
------------------
::
from numba.pycc import CC
cc = CC('my_module')
# Uncomment the following line to print out the compilation steps
#cc.verbose = True
@cc.export('multf', 'f8(f8, f8)')
@cc.export('multi', 'i4(i4, i4)')
def mult(a, b):
return a * b
@cc.export('square', 'f8(f8)')
def square(a):
return a ** 2
if __name__ == "__main__":
cc.compile()
If you run this Python script, it will generate an extension module named
``my_module``. Depending on your platform, the actual filename may be
``my_module.so``, ``my_module.pyd``, ``my_module.cpython-34m.so``, etc.
The generated module has three functions: ``multf``, ``multi`` and ``square``.
``multi`` operates on 32-bit integers (``i4``), while ``multf`` and ``square``
operate on double-precision floats (``f8``)::
>>> import my_module
>>> my_module.multi(3, 4)
12
>>> my_module.square(1.414)
1.9993959999999997
Distutils integration
---------------------
You can also integrate the compilation step for your extension modules
in your ``setup.py`` script, using distutils or setuptools::
from distutils.core import setup
from source_module import cc
setup(...,
ext_modules=[cc.distutils_extension()])
The ``source_module`` above is the module defining the ``cc`` object.
Extensions compiled like this will be automatically included in the
build files for your Python project, so you can distribute them inside
binary packages such as wheels or Conda packages. Note that in the case of
using conda, the compilers used for AOT need to be those that are available
in the Anaconda distribution.
Signature syntax
----------------
The syntax for exported signatures is the same as in the ``@jit``
decorator. You can read more about it in the :ref:`types <numba-types>`
reference.
Here is an example of exporting an implementation of the second-order
centered difference on a 1d array::
@cc.export('centdiff_1d', 'f8[:](f8[:], f8)')
def centdiff_1d(u, dx):
D = np.empty_like(u)
D[0] = 0
D[-1] = 0
for i in range(1, len(D) - 1):
D[i] = (u[i+1] - 2 * u[i] + u[i-1]) / dx**2
return D
.. (example from http://nbviewer.ipython.org/gist/ketch/ae87a94f4ef0793d5d52)
You can also omit the return type, which will then be inferred by Numba::
@cc.export('centdiff_1d', '(f8[:], f8)')
def centdiff_1d(u, dx):
# Same code as above
...
.. Copyright (c) 2017 Intel Corporation
SPDX-License-Identifier: BSD-2-Clause
.. _numba-stencil:
================================
Using the ``@stencil`` decorator
================================
Stencils are a common computational pattern in which array elements
are updated according to some fixed pattern called the stencil kernel.
Numba provides the ``@stencil`` decorator so that users may
easily specify a stencil kernel and Numba then generates the looping
code necessary to apply that kernel to some input array. Thus, the
stencil decorator allows clearer, more concise code and in conjunction
with :ref:`the parallel jit option <parallel_jit_option>` enables higher
performance through parallelization of the stencil execution.
Basic usage
===========
An example use of the ``@stencil`` decorator::
from numba import stencil
@stencil
def kernel1(a):
return 0.25 * (a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0])
The stencil kernel is specified by what looks like a standard Python
function definition but there are different semantics with
respect to array indexing.
Stencils produce an output array of the same size and shape as the
input array although depending on the kernel definition may have a
different type.
Conceptually, the stencil kernel is run once for each element in the
output array. The return value from the stencil kernel is the value
written into the output array for that particular element.
The parameter ``a`` represents the input array over which the
kernel is applied.
Indexing into this array takes place with respect to the current element
of the output array being processed. For example, if element ``(x, y)``
is being processed then ``a[0, 0]`` in the stencil kernel corresponds to
``a[x + 0, y + 0]`` in the input array. Similarly, ``a[-1, 1]`` in the stencil
kernel corresponds to ``a[x - 1, y + 1]`` in the input array.
Depending on the specified kernel, the kernel may not be applicable to the
borders of the output array as this may cause the input array to be
accessed out-of-bounds. The way in which the stencil decorator handles
this situation is dependent upon which :ref:`stencil-mode` is selected.
The default mode is for the stencil decorator to set the border elements
of the output array to zero.
To invoke a stencil on an input array, call the stencil as if it were
a regular function and pass the input array as the argument. For example, using
the kernel defined above::
>>> import numpy as np
>>> input_arr = np.arange(100).reshape((10, 10))
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
[90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])
>>> output_arr = kernel1(input_arr)
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 11., 12., 13., 14., 15., 16., 17., 18., 0.],
[ 0., 21., 22., 23., 24., 25., 26., 27., 28., 0.],
[ 0., 31., 32., 33., 34., 35., 36., 37., 38., 0.],
[ 0., 41., 42., 43., 44., 45., 46., 47., 48., 0.],
[ 0., 51., 52., 53., 54., 55., 56., 57., 58., 0.],
[ 0., 61., 62., 63., 64., 65., 66., 67., 68., 0.],
[ 0., 71., 72., 73., 74., 75., 76., 77., 78., 0.],
[ 0., 81., 82., 83., 84., 85., 86., 87., 88., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
>>> input_arr.dtype
dtype('int64')
>>> output_arr.dtype
dtype('float64')
Note that the stencil decorator has determined that the output type
of the specified stencil kernel is ``float64`` and has thus created the
output array as ``float64`` while the input array is of type ``int64``.
Stencil Parameters
==================
Stencil kernel definitions may take any number of arguments with
the following provisions. The first argument must be an array.
The size and shape of the output array will be the same as that of the
first argument. Additional arguments may either be scalars or
arrays. For array arguments, those arrays must be at least as large
as the first argument (array) in each dimension. Array indexing is relative for
all such input array arguments.
.. _stencil-kernel-shape-inference:
Kernel shape inference and border handling
==========================================
In the above example and in most cases, the array indexing in the
stencil kernel will exclusively use ``Integer`` literals.
In such cases, the stencil decorator is able to analyze the stencil
kernel to determine its size. In the above example, the stencil
decorator determines that the kernel is ``3 x 3`` in shape since indices
``-1`` to ``1`` are used for both the first and second dimensions. Note that
the stencil decorator also correctly handles non-symmetric and
non-square stencil kernels.
Based on the size of the stencil kernel, the stencil decorator is
able to compute the size of the border in the output array. If
applying the kernel to some element of input array would cause
an index to be out-of-bounds then that element belongs to the border
of the output array. In the above example, points ``-1`` and ``+1`` are
accessed in each dimension and thus the output array has a border
of size one in all dimensions.
The parallel mode is able to infer kernel indices as constants from
simple expressions if possible. For example::
@njit(parallel=True)
def stencil_test(A):
c = 2
B = stencil(
lambda a, c: 0.3 * (a[-c+1] + a[0] + a[c-1]))(A, c)
return B
Stencil decorator options
=========================
.. note::
The stencil decorator may be augmented in the future to provide additional
mechanisms for border handling. At present, only one behaviour is
implemented, ``"constant"`` (see ``func_or_mode`` below for details).
.. _stencil-neighborhood:
``neighborhood``
----------------
Sometimes it may be inconvenient to write the stencil kernel
exclusively with ``Integer`` literals. For example, let us say we
would like to compute the trailing 30-day moving average of a
time series of data. One could write
``(a[-29] + a[-28] + ... + a[-1] + a[0]) / 30`` but the stencil
decorator offers a more concise form using the ``neighborhood``
option::
@stencil(neighborhood = ((-29, 0),))
def kernel2(a):
cumul = 0
for i in range(-29, 1):
cumul += a[i]
return cumul / 30
The neighborhood option is a tuple of tuples. The outer tuple's
length is equal to the number of dimensions of the input array.
The inner tuple's lengths are always two because
each element of the inner tuple corresponds to minimum and
maximum index offsets used in the corresponding dimension.
If a user specifies a neighborhood but the kernel accesses elements outside the
specified neighborhood, **the behavior is undefined.**
.. _stencil-mode:
``func_or_mode``
----------------
The optional ``func_or_mode`` parameter controls how the border of the output array
is handled. Currently, there is only one supported value, ``"constant"``.
In ``constant`` mode, the stencil kernel is not applied in cases where
the kernel would access elements outside the valid range of the input
array. In such cases, those elements in the output array are assigned
to a constant value, as specified by the ``cval`` parameter.
``cval``
--------
The optional cval parameter defaults to zero but can be set to any
desired value, which is then used for the border of the output array
if the ``func_or_mode`` parameter is set to ``constant``. The cval parameter is
ignored in all other modes. The type of the cval parameter must match
the return type of the stencil kernel. If the user wishes the output
array to be constructed from a particular type then they should ensure
that the stencil kernel returns that type.
``standard_indexing``
---------------------
By default, all array accesses in a stencil kernel are processed as
relative indices as described above. However, sometimes it may be
advantageous to pass an auxiliary array (e.g. an array of weights)
to a stencil kernel and have that array use standard Python indexing
rather than relative indexing. For this purpose, there is the
stencil decorator option ``standard_indexing`` whose value is a
collection of strings whose names match those parameters to the
stencil function that are to be accessed with standard Python indexing
rather than relative indexing::
@stencil(standard_indexing=("b",))
def kernel3(a, b):
return a[-1] * b[0] + a[0] + b[1]
``StencilFunc``
===============
The stencil decorator returns a callable object of type ``StencilFunc``. A
``StencilFunc`` object contains a number of attributes but the only one of
potential interest to users is the ``neighborhood`` attribute.
If the ``neighborhood`` option was passed to the stencil decorator then
the provided neighborhood is stored in this attribute. Else, upon
first execution or compilation, the system calculates the neighborhood
as described above and then stores the computed neighborhood into this
attribute. A user may then inspect the attribute if they wish to verify
that the calculated neighborhood is correct.
Stencil invocation options
==========================
Internally, the stencil decorator transforms the specified stencil
kernel into a regular Python function. This function will have the
same parameters as specified in the stencil kernel definition but will
also include the following optional parameter.
.. _stencil-function-out:
``out``
-------
The optional ``out`` parameter is added to every stencil function
generated by Numba. If specified, the ``out`` parameter tells
Numba that the user is providing their own pre-allocated array
to be used for the output of the stencil. In this case, the
stencil function will not allocate its own output array.
Users should assure that the return type of the stencil kernel can
be safely cast to the element-type of the user-specified output array
following the `NumPy ufunc casting rules`_.
.. _`NumPy ufunc casting rules`: http://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules
An example usage is shown below::
>>> import numpy as np
>>> input_arr = np.arange(100).reshape((10, 10))
>>> output_arr = np.full(input_arr.shape, 0.0)
>>> kernel1(input_arr, out=output_arr)
Talks and Tutorials
===================
.. note:: This is a selection of talks and tutorials that have been given by members of
the Numba team as well as Numba users. If you know of a Numba-related talk
that should be included on this list, please `open an issue <https://github.com/numba/numba/issues>`_.
Talks on Numba
--------------
* AnacondaCON 2018 - Accelerating Scientific Workloads with Numba - Siu Kwan Lam (`Video <https://www.youtube.com/watch?v=6oXedk2tGfk>`__)
* `DIANA-HEP Meeting, 23 April 2018 <https://indico.cern.ch/event/709711/>`__ - Overview of Numba - Stan Seibert
Talks on Applications of Numba
------------------------------
* GPU Technology Conference 2016 - Accelerating a Spectral Algorithm for Plasma Physics with Python/Numba on GPU - Manuel Kirchen & Rémi Lehe (`Slides <http://on-demand.gputechconf.com/gtc/2016/presentation/s6353-manuel-kirchen-spectral-algorithm-plasma-physics.pdf>`__)
* `DIANA-HEP Meeting, 23 April 2018 <https://indico.cern.ch/event/709711/>`_ - Use of Numba in XENONnT - Chris Tunnell
* `DIANA-HEP Meeting, 23 April 2018 <https://indico.cern.ch/event/709711/>`_ - Extending Numba for HEP data types - Jim Pivarski
* STAC Summit, Nov 1 2017 - Scaling High-Performance Python with Minimal Effort - Ehsan Totoni (`Video <https://stacresearch.com/STAC-Summit-1-Nov-2017-Intel-Totoni>`__, `Slides <https://stacresearch.com/system/files/resource/files/STAC-Summit-1-Nov-2017-Intel-Totoni.pdf>`__)
* SciPy 2018 - UMAP: Uniform Manifold Approximation and Projection for Dimensional Reduction - Leland McInnes (`Video <https://www.youtube.com/watch?v=nq6iPZVUxZU>`__, `Github <https://github.com/lmcinnes/umap>`__)
* PyData Berlin 2018 - Extending Pandas using Apache Arrow and Numba - Uwe L. Korn (`Video <https://www.youtube.com/watch?v=tvmX8YAFK80>`__, `Blog <https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html>`__)
* FOSDEM 2019 - Extending Numba - Joris Geessels (`Video, Slides & Examples <https://fosdem.org/2019/schedule/event/python_extending_numba/>`__)
* PyCon India 2019 - Real World Numba: Taking the Path of Least Resistance - Ankit Mahato (`Video <https://www.youtube.com/watch?v=rhbegsr8stc>`__)
* SciPy 2019 - How to Accelerate an Existing Codebase with Numba - Siu Kwan Lam & Stanley Seibert (`Video <https://www.youtube.com/watch?v=-4tD8kNHdXs>`__)
* SciPy 2019 - Real World Numba: Creating a Skeleton Analysis Library - Juan Nunez-Iglesias (`Video <https://www.youtube.com/watch?v=0pUPNMglnaE>`__)
* SciPy 2019 - Fast Gradient Boosting Decision Trees with PyGBM and Numba - Nicholas Hug (`Video <https://www.youtube.com/watch?v=cLpIh8Aiy2w>`__)
* PyCon Sweden 2020 - Accelerating Scientific Computing using Numba - Ankit Mahato (`Video <https://www.youtube.com/watch?v=d_21Q0UoWrQ>`__)
Tutorials
---------
* SciPy 2017 - Numba: Tell those C++ Bullies to Get Lost - Gil Forsyth & Lorena Barba (`Video <https://www.youtube.com/watch?v=1AwG0T4gaO0>`__, `Notebooks <https://github.com/gforsyth/numba_tutorial_scipy2017>`__)
* GPU Technology Conference 2018 - GPU Computing in Python with Numba - Stan Seibert (`Notebooks <https://github.com/ContinuumIO/gtc2018-numba>`__)
* PyData Amsterdam 2019 - Create CUDA kernels from Python using Numba and CuPy - Valentin Haenel (`Video <https://www.youtube.com/watch?v=CQDsT81GyS8>`__)
.. _numba-threading-layer:
The Threading Layers
====================
This section is about the Numba threading layer, this is the library that is
used internally to perform the parallel execution that occurs through the use of
the ``parallel`` targets for CPUs, namely:
* The use of the ``parallel=True`` kwarg in ``@jit`` and ``@njit``.
* The use of the ``target='parallel'`` kwarg in ``@vectorize`` and
``@guvectorize``.
.. note::
If a code base does not use the ``threading`` or ``multiprocessing``
modules (or any other sort of parallelism) the defaults for the threading
layer that ship with Numba will work well, no further action is required!
Which threading layers are available?
-------------------------------------
There are three threading layers available and they are named as follows:
* ``tbb`` - A threading layer backed by Intel TBB.
* ``omp`` - A threading layer backed by OpenMP.
* ``workqueue`` -A simple built-in work-sharing task scheduler.
In practice, the only threading layer guaranteed to be present is ``workqueue``.
The ``omp`` layer requires the presence of a suitable OpenMP runtime library.
The ``tbb`` layer requires the presence of Intel's TBB libraries, these can be
obtained via the conda command::
$ conda install tbb
If you installed Numba with ``pip``, TBB can be enabled by running::
$ pip install tbb
Due to compatibility issues with manylinux1 and other portability concerns,
the OpenMP threading layer is disabled in the Numba binary wheels on PyPI.
.. note::
The default manner in which Numba searches for and loads a threading layer
is tolerant of missing libraries, incompatible runtimes etc.
.. _numba-threading-layer-setting-mech:
Setting the threading layer
---------------------------
The threading layer is set via the environment variable
``NUMBA_THREADING_LAYER`` or through assignment to
``numba.config.THREADING_LAYER``. If the programmatic approach to setting the
threading layer is used it must occur logically before any Numba based
compilation for a parallel target has occurred. There are two approaches to
choosing a threading layer, the first is by selecting a threading layer that is
safe under various forms of parallel execution, the second is through explicit
selection via the threading layer name (e.g. ``tbb``).
Setting the threading layer selection priority
----------------------------------------------
By default the threading layers are searched in the order of ``'tbb'``,
``'omp'``, then ``'workqueue'``. To change this search order whilst
maintaining the selection of a threading layer based on availability, the
environment variable :envvar:`NUMBA_THREADING_LAYER_PRIORITY` can be used.
Note that it can also be set via
:py:data:`numba.config.THREADING_LAYER_PRIORITY`.
Similar to :py:data:`numba.config.THREADING_LAYER`,
it must occur logically before any Numba based
compilation for a parallel target has occurred.
For example, to instruct Numba to choose ``omp`` first if available,
then ``tbb`` and so on, set the environment variable as
``NUMBA_THREADING_LAYER_PRIORITY="omp tbb workqueue"``.
Or programmatically,
``numba.config.THREADING_LAYER_PRIORITY = ["omp", "tbb", "workqueue"]``.
Selecting a threading layer for safe parallel execution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Parallel execution is fundamentally derived from core Python libraries in four
forms (the first three also apply to code using parallel execution via other
means!):
* ``threads`` from the ``threading`` module.
* ``spawn`` ing processes from the ``multiprocessing`` module via ``spawn``
(default on Windows, only available in Python 3.4+ on Unix)
* ``fork`` ing processes from the ``multiprocessing`` module via ``fork``
(default on Unix).
* ``fork`` ing processes from the ``multiprocessing`` module through the use of
a ``forkserver`` (only available in Python 3 on Unix). Essentially a new
process is spawned and then forks are made from this new process on request.
Any library in use with these forms of parallelism must exhibit safe behaviour
under the given paradigm. As a result, the threading layer selection methods
are designed to provide a way to choose a threading layer library that is safe
for a given paradigm in an easy, cross platform and environment tolerant manner.
The options that can be supplied to the
:ref:`setting mechanisms <numba-threading-layer-setting-mech>` are as
follows:
* ``default`` provides no specific safety guarantee and is the default.
* ``safe`` is both fork and thread safe, this requires the ``tbb`` package
(Intel TBB libraries) to be installed.
* ``forksafe`` provides a fork safe library.
* ``threadsafe`` provides a thread safe library.
To discover the threading layer that was selected, the function
``numba.threading_layer()`` may be called after parallel execution. For example,
on a Linux machine with no TBB installed::
from numba import config, njit, threading_layer
import numpy as np
# set the threading layer before any parallel target compilation
config.THREADING_LAYER = 'threadsafe'
@njit(parallel=True)
def foo(a, b):
return a + b
x = np.arange(10.)
y = x.copy()
# this will force the compilation of the function, select a threading layer
# and then execute in parallel
foo(x, y)
# demonstrate the threading layer chosen
print("Threading layer chosen: %s" % threading_layer())
which produces::
Threading layer chosen: omp
and this makes sense as GNU OpenMP, as present on Linux, is thread safe.
Selecting a named threading layer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Advanced users may wish to select a specific threading layer for their use case,
this is done by directly supplying the threading layer name to the
:ref:`setting mechanisms <numba-threading-layer-setting-mech>`. The options
and requirements are as follows:
+----------------------+-----------+-------------------------------------------+
| Threading Layer Name | Platform | Requirements |
+======================+===========+===========================================+
| ``tbb`` | All | The ``tbb`` package (``$ conda install |
| | | tbb``) |
+----------------------+-----------+-------------------------------------------+
| ``omp`` | Linux | GNU OpenMP libraries (very likely this |
| | | will already exist) |
| | | |
| | Windows | MS OpenMP libraries (very likely this will|
| | | already exist) |
| | | |
| | OSX | Either the ``intel-openmp`` package or the|
| | | ``llvm-openmp`` package |
| | | (``conda install`` the package as named). |
+----------------------+-----------+-------------------------------------------+
| ``workqueue`` | All | None |
+----------------------+-----------+-------------------------------------------+
Should the threading layer not load correctly Numba will detect this and provide
a hint about how to resolve the problem. It should also be noted that the Numba
diagnostic command ``numba -s`` has a section
``__Threading Layer Information__`` that reports on the availability of
threading layers in the current environment.
Extra notes
-----------
The threading layers have fairly complex interactions with CPython internals and
system level libraries, some additional things to note:
* The installation of Intel's TBB libraries vastly widens the options available
in the threading layer selection process.
* On Linux, the ``omp`` threading layer is not fork safe due to the GNU OpenMP
runtime library (``libgomp``) not being fork safe. If a fork occurs in a
program that is using the ``omp`` threading layer, a detection mechanism is
present that will try and gracefully terminate the forked child and print an
error message to ``STDERR``.
* On systems with the ``fork(2)`` system call available, if the TBB backed
threading layer is in use and a ``fork`` call is made from a thread other than
the thread that launched TBB (typically the main thread) then this results in
undefined behaviour and a warning will be displayed on ``STDERR``. As
``spawn`` is essentially ``fork`` followed by ``exec`` it is safe to ``spawn``
from a non-main thread, but as this cannot be differentiated from just a
``fork`` call the warning message will still be displayed.
* On OSX, the ``intel-openmp`` package is required to enable the OpenMP based
threading layer.
.. _setting_the_number_of_threads:
Setting the Number of Threads
-----------------------------
The number of threads used by numba is based on the number of CPU cores
available (see :obj:`numba.config.NUMBA_DEFAULT_NUM_THREADS`), but it can be
overridden with the :envvar:`NUMBA_NUM_THREADS` environment variable.
The total number of threads that numba launches is in the variable
:obj:`numba.config.NUMBA_NUM_THREADS`.
For some use cases, it may be desirable to set the number of threads to a
lower value, so that numba can be used with higher level parallelism.
The number of threads can be set dynamically at runtime using
:func:`numba.set_num_threads`. Note that :func:`~.set_num_threads` only allows
setting the number of threads to a smaller value than
:obj:`~.NUMBA_NUM_THREADS`. Numba always launches
:obj:`numba.config.NUMBA_NUM_THREADS` threads, but :func:`~.set_num_threads`
causes it to mask out unused threads so they aren't used in computations.
The current number of threads used by numba can be accessed with
:func:`numba.get_num_threads`. Both functions work inside of a jitted
function.
.. _numba-threading-layer-thread-masking:
Example of Limiting the Number of Threads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this example, suppose the machine we are running on has 8 cores (so
:obj:`numba.config.NUMBA_NUM_THREADS` would be ``8``). Suppose we want to run
some code with ``@njit(parallel=True)``, but we also want to run our code
concurrently in 4 different processes. With the default number of threads,
each Python process would run 8 threads, for a total in 4*8 = 32 threads,
which is oversubscription for our 8 cores. We should rather limit each process
to 2 threads, so that the total will be 4*2 = 8, which matches our number of
physical cores.
There are two ways to do this. One is to set the :envvar:`NUMBA_NUM_THREADS`
environment variable to ``2``.
.. code:: bash
$ NUMBA_NUM_THREADS=2 python ourcode.py
However, there are two downsides to this approach:
1. :envvar:`NUMBA_NUM_THREADS` must be set before Numba is imported, and
ideally before Python is launched. As soon as Numba is imported the
environment variable is read and that number of threads is locked in as the
number of threads Numba launches.
2. If we want to later increase the number of threads used by the process, we
cannot. :envvar:`NUMBA_NUM_THREADS` sets the *maximum* number of threads
that are launched for a process. Calling :func:`~.set_num_threads()` with a
value greater than :obj:`numba.config.NUMBA_NUM_THREADS` results in an
error.
The advantage of this approach is that we can do it from outside of the
process without changing the code.
Another approach is to use the :func:`numba.set_num_threads` function in our code
.. code:: python
from numba import njit, set_num_threads
@njit(parallel=True)
def func():
...
set_num_threads(2)
func()
If we call ``set_num_threads(2)`` before executing our parallel code, it has
the same effect as calling the process with ``NUMBA_NUM_THREADS=2``, in that
the parallel code will only execute on 2 threads. However, we can later call
``set_num_threads(8)`` to increase the number of threads back to the default
size. And we do not have to worry about setting it before Numba gets imported.
It only needs to be called before the parallel function is run.
.. _numba-threading-layer-thread-id:
Getting a Thread ID
-------------------
In some cases it may be beneficial to have access to a unique identifier for the
current thread that is executing a parallel region in Numba. For that purpose,
Numba provides the :func:`numba.get_thread_id` function. This function is the
corollary of OpenMP's function ``omp_get_thread_num`` and returns an integer
between 0 (inclusive) and the number of configured threads as described above
(exclusive).
API Reference
~~~~~~~~~~~~~
.. py:data:: numba.config.NUMBA_NUM_THREADS
The total (maximum) number of threads launched by numba.
Defaults to :obj:`numba.config.NUMBA_DEFAULT_NUM_THREADS`, but can be
overridden with the :envvar:`NUMBA_NUM_THREADS` environment variable.
.. py:data:: numba.config.NUMBA_DEFAULT_NUM_THREADS
The number of usable CPU cores on the system (as determined by
``len(os.sched_getaffinity(0))``, if supported by the OS, or
``multiprocessing.cpu_count()`` if not).
This is the default value for :obj:`numba.config.NUMBA_NUM_THREADS` unless
the :envvar:`NUMBA_NUM_THREADS` environment variable is set.
.. autofunction:: numba.set_num_threads
.. autofunction:: numba.get_num_threads
.. autofunction:: numba.get_thread_id
.. _numba-troubleshooting:
========================
Troubleshooting and tips
========================
.. _what-to-compile:
What to compile
===============
The general recommendation is that you should only try to compile the
critical paths in your code. If you have a piece of performance-critical
computational code amongst some higher-level code, you may factor out
the performance-critical code in a separate function and compile the
separate function with Numba. Letting Numba focus on that small piece
of performance-critical code has several advantages:
* it reduces the risk of hitting unsupported features;
* it reduces the compilation times;
* it allows you to evolve the higher-level code which is outside of the
compiled function much easier.
.. _code-doesnt-compile:
My code doesn't compile
=======================
There can be various reasons why Numba cannot compile your code, and raises
an error instead. One common reason is that your code relies on an
unsupported Python feature, especially in :term:`nopython mode`.
Please see the list of :ref:`pysupported`. If you find something that
is listed there and still fails compiling, please
:ref:`report a bug <report-numba-bugs>`.
When Numba tries to compile your code it first tries to work out the types of
all the variables in use, this is so it can generate a type specific
implementation of your code that can be compiled down to machine code. A common
reason for Numba failing to compile (especially in :term:`nopython mode`) is a
type inference failure, essentially Numba cannot work out what the type of all
the variables in your code should be.
For example, let's consider this trivial function::
@jit(nopython=True)
def f(x, y):
return x + y
If you call it with two numbers, Numba is able to infer the types properly::
>>> f(1, 2)
3
If however you call it with a tuple and a number, Numba is unable to say
what the result of adding a tuple and number is, and therefore compilation
errors out::
>>> f(1, (2,))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<path>/numba/numba/dispatcher.py", line 339, in _compile_for_args
reraise(type(e), e, None)
File "<path>/numba/numba/six.py", line 658, in reraise
raise value.with_traceback(tb)
numba.errors.TypingError: Failed at nopython (nopython frontend)
Invalid use of + with parameters (int64, tuple(int64 x 1))
Known signatures:
* (int64, int64) -> int64
* (int64, uint64) -> int64
* (uint64, int64) -> int64
* (uint64, uint64) -> uint64
* (float32, float32) -> float32
* (float64, float64) -> float64
* (complex64, complex64) -> complex64
* (complex128, complex128) -> complex128
* (uint16,) -> uint64
* (uint8,) -> uint64
* (uint64,) -> uint64
* (uint32,) -> uint64
* (int16,) -> int64
* (int64,) -> int64
* (int8,) -> int64
* (int32,) -> int64
* (float32,) -> float32
* (float64,) -> float64
* (complex64,) -> complex64
* (complex128,) -> complex128
* parameterized
[1] During: typing of intrinsic-call at <stdin> (3)
File "<stdin>", line 3:
The error message helps you find out what went wrong:
"Invalid use of + with parameters (int64, tuple(int64 x 1))" is to be
interpreted as "Numba encountered an addition of variables typed as integer
and 1-tuple of integer, respectively, and doesn't know about any such
operation".
Note that if you allow object mode::
@jit
def g(x, y):
return x + y
compilation will succeed and the compiled function will raise at runtime as
Python would do::
>>> g(1, (2,))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'tuple'
My code has a type unification problem
======================================
Another common reason for Numba not being able to compile your code is that it
cannot statically determine the return type of a function. The most likely
cause of this is the return type depending on a value that is available only at
runtime. Again, this is most often problematic when using
:term:`nopython mode`. The concept of type unification is simply trying to find
a type in which two variables could safely be represented. For example a 64 bit
float and a 64 bit complex number could both be represented in a 128 bit complex
number.
As an example of type unification failure, this function has a return type that
is determined at runtime based on the value of `x`::
In [1]: from numba import jit
In [2]: @jit(nopython=True)
...: def f(x):
...: if x > 10:
...: return (1,)
...: else:
...: return 1
...:
In [3]: f(10)
Trying to execute this function, errors out as follows::
TypingError: Failed at nopython (nopython frontend)
Can't unify return type from the following types: tuple(int64 x 1), int64
Return of: IR name '$8.2', type '(int64 x 1)', location:
File "<ipython-input-2-51ef1cc64bea>", line 4:
def f(x):
<source elided>
if x > 10:
return (1,)
^
Return of: IR name '$12.2', type 'int64', location:
File "<ipython-input-2-51ef1cc64bea>", line 6:
def f(x):
<source elided>
else:
return 1
The error message "Can't unify return type from the following types:
tuple(int64 x 1), int64" should be read as "Numba cannot find a type that
can safely represent a 1-tuple of integer and an integer".
.. _code-has-untyped-list:
My code has an untyped list problem
===================================
As :ref:`noted previously <code-doesnt-compile>` the first part of Numba
compiling your code involves working out what the types of all the variables
are. In the case of lists, a list must contain items that are of the same type
or can be empty if the type can be inferred from some later operation. What is
not possible is to have a list which is defined as empty and has no inferable
type (i.e. an untyped list).
For example, this is using a list of a known type::
from numba import jit
@jit(nopython=True)
def f():
return [1, 2, 3] # this list is defined on construction with `int` type
This is using an empty list, but the type can be inferred::
from numba import jit
@jit(nopython=True)
def f(x):
tmp = [] # defined empty
for i in range(x):
tmp.append(i) # list type can be inferred from the type of `i`
return tmp
This is using an empty list and the type cannot be inferred::
from numba import jit
@jit(nopython=True)
def f(x):
tmp = [] # defined empty
return (tmp, x) # ERROR: the type of `tmp` is unknown
Whilst slightly contrived, if you need an empty list and the type cannot be
inferred but you know what type you want the list to be, this "trick" can be
used to instruct the typing mechanism::
from numba import jit
import numpy as np
@jit(nopython=True)
def f(x):
# define empty list, but instruct that the type is np.complex64
tmp = [np.complex64(x) for x in range(0)]
return (tmp, x) # the type of `tmp` is known, but it is still empty
The compiled code is too slow
=============================
The most common reason for slowness of a compiled JIT function is that
compiling in :term:`nopython mode` has failed and the Numba compiler has
fallen back to :term:`object mode`. :term:`object mode` currently provides
little to no speedup compared to regular Python interpretation, and its
main point is to allow an internal optimization known as
:term:`loop-lifting`: this optimization will allow to compile inner
loops in :term:`nopython mode` regardless of what code surrounds those
inner loops.
To find out if type inference succeeded on your function, you can use
the :meth:`~Dispatcher.inspect_types` method on the compiled function.
For example, let's take the following function::
@jit
def f(a, b):
s = a + float(b)
return s
When called with numbers, this function should be fast as Numba is able
to convert number types to floating-point numbers. Let's see::
>>> f(1, 2)
3.0
>>> f.inspect_types()
f (int64, int64)
--------------------------------------------------------------------------------
# --- LINE 7 ---
@jit
# --- LINE 8 ---
def f(a, b):
# --- LINE 9 ---
# label 0
# a.1 = a :: int64
# del a
# b.1 = b :: int64
# del b
# $0.2 = global(float: <class 'float'>) :: Function(<class 'float'>)
# $0.4 = call $0.2(b.1, ) :: (int64,) -> float64
# del b.1
# del $0.2
# $0.5 = a.1 + $0.4 :: float64
# del a.1
# del $0.4
# s = $0.5 :: float64
# del $0.5
s = a + float(b)
# --- LINE 10 ---
# $0.7 = cast(value=s) :: float64
# del s
# return $0.7
return s
Without trying to understand too much of the Numba intermediate representation,
it is still visible that all variables and temporary values have had their
types inferred properly: for example *a* has the type ``int64``, *$0.5* has
the type ``float64``, etc.
However, if *b* is passed as a string, compilation will fall back on object
mode as the float() constructor with a string is currently not supported
by Numba::
>>> f(1, "2")
3.0
>>> f.inspect_types()
[... snip annotations for other signatures, see above ...]
================================================================================
f (int64, str)
--------------------------------------------------------------------------------
# --- LINE 7 ---
@jit
# --- LINE 8 ---
def f(a, b):
# --- LINE 9 ---
# label 0
# a.1 = a :: pyobject
# del a
# b.1 = b :: pyobject
# del b
# $0.2 = global(float: <class 'float'>) :: pyobject
# $0.4 = call $0.2(b.1, ) :: pyobject
# del b.1
# del $0.2
# $0.5 = a.1 + $0.4 :: pyobject
# del a.1
# del $0.4
# s = $0.5 :: pyobject
# del $0.5
s = a + float(b)
# --- LINE 10 ---
# $0.7 = cast(value=s) :: pyobject
# del s
# return $0.7
return s
Here we see that all variables end up typed as ``pyobject``. This means
that the function was compiled in object mode and values are passed
around as generic Python objects, without Numba trying to look into them
to reason about their raw values. This is a situation you want to avoid
when caring about the speed of your code.
If a function fails to compile in ``nopython`` mode warnings will be emitted
with explanation as to why compilation failed. For example with the ``f()``
function above (slightly edited for documentation purposes)::
>>> f(1, 2)
3.0
>>> f(1, "2")
example.py:7: NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "f" failed type inference due to: Invalid use of Function(<class 'float'>) with argument(s) of type(s): (unicode_type)
* parameterized
In definition 0:
TypeError: float() only support for numbers
raised from <path>/numba/typing/builtins.py:880
In definition 1:
TypeError: float() only support for numbers
raised from <path>/numba/typing/builtins.py:880
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<class 'float'>)
[2] During: typing of call at example.py (9)
File "example.py", line 9:
def f(a, b):
s = a + float(b)
^
<path>/numba/compiler.py:722: NumbaWarning: Function "f" was compiled in object mode without forceobj=True.
File "example.py", line 8:
@jit
def f(a, b):
^
3.0
Disabling JIT compilation
=========================
In order to debug code, it is possible to disable JIT compilation, which makes
the ``jit`` decorator (and the ``njit`` decorator) act as if
they perform no operation, and the invocation of decorated functions calls the
original Python function instead of a compiled version. This can be toggled by
setting the :envvar:`NUMBA_DISABLE_JIT` environment variable to ``1``.
When this mode is enabled, the ``vectorize`` and ``guvectorize`` decorators will
still result in compilation of a ufunc, as there is no straightforward pure
Python implementation of these functions.
.. _debugging-jit-compiled-code:
Debugging JIT compiled code with GDB
====================================
Setting the ``debug`` keyword argument in the ``jit`` decorator
(e.g. ``@jit(debug=True)``) enables the emission of debug info in the jitted
code. To debug, GDB version 7.0 or above is required. Currently, the following
debug info is available:
* Function name will be shown in the backtrace along with type information and
values (if available).
* Source location (filename and line number) is available. For example,
users can set a break point by the absolute filename and line number;
e.g. ``break /path/to/myfile.py:6``.
* Arguments to the current function can be show with ``info args``
* Local variables in the current function can be shown with ``info locals``.
* The type of variables can be shown with ``whatis myvar``.
* The value of variables can be shown with ``print myvar`` or ``display myvar``.
* Simple numeric types, i.e. int, float and double, are shown in their
native representation.
* Other types are shown as a structure based on Numba's memory model
representation of the type.
Further, the Numba ``gdb`` printing extension can be loaded into ``gdb`` (if the
``gdb`` has Python support) to permit the printing of variables as they would be
in native Python. The extension does this by reinterpreting Numba's memory model
representations as Python types. Information about the ``gdb`` installation that
Numba is using, including the path to load the ``gdb`` printing extension, can
be displayed by using the ``numba -g`` command. For best results ensure that the
Python that ``gdb`` is using has a NumPy module accessible. An example output
of the ``gdb`` information follows:
.. code-block:: none
:emphasize-lines: 1
$ numba -g
GDB info:
--------------------------------------------------------------------------------
Binary location : <some path>/gdb
Print extension location : <some python path>/numba/misc/gdb_print_extension.py
Python version : 3.8
NumPy version : 1.20.0
Numba printing extension supported : True
To load the Numba gdb printing extension, execute the following from the gdb prompt:
source <some python path>/numba/misc/gdb_print_extension.py
--------------------------------------------------------------------------------
Known issues:
* Stepping depends heavily on optimization level. At full optimization
(equivalent to O3), most of the variables are optimized out. It is often
beneficial to use the jit option ``_dbg_optnone=True``
or the environment variable :envvar:`NUMBA_OPT` to adjust the
optimization level and the jit option ``_dbg_extend_lifetimes=True``
(which is on by default if ``debug=True``) or
:envvar:`NUMBA_EXTEND_VARIABLE_LIFETIMES` to extend
the lifetime of variables to the end of their scope so as to get a debugging
experience closer to the semantics of Python execution.
* Memory consumption increases significantly with debug info enabled.
The compiler emits extra information (`DWARF <http://www.dwarfstd.org/>`_)
along with the instructions. The emitted object code can be 2x bigger with
debug info.
Internal details:
* Since Python semantics allow variables to bind to value of different types,
Numba internally creates multiple versions of the variable for each type.
So for code like::
x = 1 # type int
x = 2.3 # type float
x = (1, 2, 3) # type 3-tuple of int
Each assignments will store to a different variable name. In the debugger,
the variables will be ``x``, ``x$1`` and ``x$2``. (In the Numba IR, they are
``x``, ``x.1`` and ``x.2``.)
* When debug is enabled, inlining of functions at LLVM IR level is disabled.
JIT options for debug
---------------------
* ``debug`` (bool). Set to ``True`` to enable debug info. Defaults to ``False``.
* ``_dbg_optnone`` (bool). Set to ``True`` to disable all LLVM optimization passes
on the function. Defaults to ``False``. See :envvar:`NUMBA_OPT` for a global setting
to disable optimization.
* ``_dbg_extend_lifetimes`` (bool). Set to ``True`` to extend the lifetime of
objects such that they more closely follow the semantics of Python.
Automatically set to ``True`` when
``debug=True``; otherwise, defaults to ``False``. Users can explicitly set this option
to ``False`` to retain the normal execution semantics of compiled code.
See :envvar:`NUMBA_EXTEND_VARIABLE_LIFETIMES` for a global option to extend object
lifetimes.
Example debug usage
-------------------
The python source:
.. code-block:: python
:linenos:
from numba import njit
@njit(debug=True)
def foo(a):
b = a + 1
c = a * 2.34
d = (a, b, c)
print(a, b, c, d)
r = foo(123)
print(r)
In the terminal:
.. code-block:: none
:emphasize-lines: 1, 3, 7, 12, 14, 16, 20, 22, 26, 28, 30, 32, 34, 36
$ NUMBA_OPT=0 NUMBA_EXTEND_VARIABLE_LIFETIMES=1 gdb -q python
Reading symbols from python...
(gdb) break test1.py:5
No source file named test1.py.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (test1.py:5) pending.
(gdb) run test1.py
Starting program: <path>/bin/python test1.py
...
Breakpoint 1, __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at test1.py:5
5 b = a + 1
(gdb) info args
a = 123
(gdb) n
6 c = a * 2.34
(gdb) info locals
b = 124
c = 0
d = {f0 = 0, f1 = 0, f2 = 0}
(gdb) n
7 d = (a, b, c)
(gdb) info locals
b = 124
c = 287.81999999999999
d = {f0 = 0, f1 = 0, f2 = 0}
(gdb) whatis b
type = int64
(gdb) whatis d
type = Tuple(int64, int64, float64) ({i64, i64, double})
(gdb) n
8 print(a, b, c, d)
(gdb) print b
$1 = 124
(gdb) print d
$2 = {f0 = 123, f1 = 124, f2 = 287.81999999999999}
(gdb) bt
#0 __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at test1.py:8
#1 0x00007ffff06439fa in cpython::__main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) ()
Another example follows that makes use of the Numba ``gdb`` printing extension
mentioned above, note the change in the print format once the extension is
loaded with ``source`` :
The Python source:
.. code-block:: python
:linenos:
from numba import njit
import numpy as np
@njit(debug=True)
def foo(n):
x = np.arange(n)
y = (x[0], x[-1])
return x, y
foo(4)
In the terminal:
.. code-block:: none
:emphasize-lines: 1, 3, 4, 7, 12, 14, 16, 17, 20
$ NUMBA_OPT=0 NUMBA_EXTEND_VARIABLE_LIFETIMES=1 gdb -q python
Reading symbols from python...
(gdb) set breakpoint pending on
(gdb) break test2.py:8
No source file named test2.py.
Breakpoint 1 (test2.py:8) pending.
(gdb) run test2.py
Starting program: <path>/bin/python test2.py
...
Breakpoint 1, __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (n=4) at test2.py:8
8 return x, y
(gdb) print x
$1 = {meminfo = 0x55555688f470 "\001", parent = 0x0, nitems = 4, itemsize = 8, data = 0x55555688f4a0, shape = {4}, strides = {8}}
(gdb) print y
$2 = {0, 3}
(gdb) source numba/misc/gdb_print_extension.py
(gdb) print x
$3 =
[0 1 2 3]
(gdb) print y
$4 = (0, 3)
Globally override debug setting
-------------------------------
It is possible to enable debug for the full application by setting environment
variable ``NUMBA_DEBUGINFO=1``. This sets the default value of the ``debug``
option in ``jit``. Debug can be turned off on individual functions by setting
``debug=False``.
Beware that enabling debug info significantly increases the memory consumption
for each compiled function. For large application, this may cause out-of-memory
error.
Using Numba's direct ``gdb`` bindings in ``nopython`` mode
===========================================================
Numba (version 0.42.0 and later) has some additional functions relating to
``gdb`` support for CPUs that make it easier to debug programs. All the ``gdb``
related functions described in the following work in the same manner
irrespective of whether they are called from the standard CPython interpreter or
code compiled in either :term:`nopython mode` or :term:`object mode`.
.. note:: This feature is experimental!
.. warning:: This feature does unexpected things if used from Jupyter or
alongside the ``pdb`` module. It's behaviour is harmless, just hard
to predict!
Set up
------
Numba's ``gdb`` related functions make use of a ``gdb`` binary, the location and
name of this binary can be configured via the :envvar:`NUMBA_GDB_BINARY`
environment variable if desired.
.. note:: Numba's ``gdb`` support requires the ability for ``gdb`` to attach to
another process. On some systems (notably Ubuntu Linux) default
security restrictions placed on ``ptrace`` prevent this from being
possible. This restriction is enforced at the system level by the
Linux security module `Yama`. Documentation for this module and the
security implications of making changes to its behaviour can be found
in the `Linux Kernel documentation <https://www.kernel.org/doc/Documentation/admin-guide/LSM/Yama.rst>`_.
The `Ubuntu Linux security documentation <https://wiki.ubuntu.com/Security/Features#ptrace>`_
discusses how to adjust the behaviour of `Yama` on with regards to
``ptrace_scope`` so as to permit the required behaviour.
Basic ``gdb`` support
---------------------
.. warning:: Calling :func:`numba.gdb` and/or :func:`numba.gdb_init` more than
once in the same program is not advisable, unexpected things may
happen. If multiple breakpoints are desired within a program,
launch ``gdb`` once via :func:`numba.gdb` or :func:`numba.gdb_init`
and then use :func:`numba.gdb_breakpoint` to register additional
breakpoint locations.
The most simple function for adding ``gdb`` support is :func:`numba.gdb`, which,
at the call location, will:
* launch ``gdb`` and attach it to the running process.
* create a breakpoint at the site of the :func:`numba.gdb()` function call, the
attached ``gdb`` will pause execution here awaiting user input.
use of this functionality is best motivated by example, continuing with the
example used above:
.. code-block:: python
:linenos:
from numba import njit, gdb
@njit(debug=True)
def foo(a):
b = a + 1
gdb() # instruct Numba to attach gdb at this location and pause execution
c = a * 2.34
d = (a, b, c)
print(a, b, c, d)
r= foo(123)
print(r)
In the terminal (``...`` on a line by itself indicates output that is not
presented for brevity):
.. code-block:: none
:emphasize-lines: 1, 4, 8, 13, 24, 26, 28, 30, 32, 37
$ NUMBA_OPT=0 NUMBA_EXTEND_VARIABLE_LIFETIMES=1 python demo_gdb.py
...
Breakpoint 1, 0x00007fb75238d830 in numba_gdb_breakpoint () from numba/_helperlib.cpython-39-x86_64-linux-gnu.so
(gdb) s
Single stepping until exit from function numba_gdb_breakpoint,
which has no line number information.
0x00007fb75233e1cf in numba::misc::gdb_hook::hook_gdb::_3clocals_3e::impl_242[abi:c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3d](StarArgTuple) ()
(gdb) s
Single stepping until exit from function _ZN5numba4misc8gdb_hook8hook_gdb12_3clocals_3e8impl_242B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dE12StarArgTuple,
which has no line number information.
__main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at demo_gdb.py:7
7 c = a * 2.34
(gdb) l
2
3 @njit(debug=True)
4 def foo(a):
5 b = a + 1
6 gdb() # instruct Numba to attach gdb at this location and pause execution
7 c = a * 2.34
8 d = (a, b, c)
9 print(a, b, c, d)
10
11 r= foo(123)
(gdb) p a
$1 = 123
(gdb) p b
$2 = 124
(gdb) p c
$3 = 0
(gdb) b 9
Breakpoint 2 at 0x7fb73d1f7287: file demo_gdb.py, line 9.
(gdb) c
Continuing.
Breakpoint 2, __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at demo_gdb.py:9
9 print(a, b, c, d)
(gdb) info locals
b = 124
c = 287.81999999999999
d = {f0 = 123, f1 = 124, f2 = 287.81999999999999}
It can be seen in the above example that execution of the code is paused at the
location of the ``gdb()`` function call at end of the ``numba_gdb_breakpoint``
function (this is the Numba internal symbol registered as breakpoint with
``gdb``). Issuing a ``step`` twice at this point moves to the stack frame of the
compiled Python source. From there, it can be seen that the variables ``a`` and
``b`` have been evaluated but ``c`` has not, as demonstrated by printing their
values, this is precisely as expected given the location of the ``gdb()`` call.
Issuing a ``break`` on line 9 and then continuing execution leads to the
evaluation of line ``7``. The variable ``c`` is assigned a value as a result of
the execution and this can be seen in output of ``info locals`` when the
breakpoint is hit.
Running with ``gdb`` enabled
----------------------------
The functionality provided by :func:`numba.gdb` (launch and attach ``gdb`` to
the executing process and pause on a breakpoint) is also available as two
separate functions:
* :func:`numba.gdb_init` this function injects code at the call site to launch
and attach ``gdb`` to the executing process but does not pause execution.
* :func:`numba.gdb_breakpoint` this function injects code at the call site that
will call the special ``numba_gdb_breakpoint`` function that is registered as
a breakpoint in Numba's ``gdb`` support. This is demonstrated in the next
section.
This functionality enables more complex debugging capabilities. Again, motivated
by example, debugging a 'segfault' (memory access violation signalling
``SIGSEGV``):
.. code-block:: python
:linenos:
from numba import njit, gdb_init
import numpy as np
# NOTE debug=True switches bounds-checking on, but for the purposes of this
# example it is explicitly turned off so that the out of bounds index is
# not caught!
@njit(debug=True, boundscheck=False)
def foo(a, index):
gdb_init() # instruct Numba to attach gdb at this location, but not to pause execution
b = a + 1
c = a * 2.34
d = c[index] # access an address that is a) invalid b) out of the page
print(a, b, c, d)
bad_index = int(1e9) # this index is invalid
z = np.arange(10)
r = foo(z, bad_index)
print(r)
In the terminal (``...`` on a line by itself indicates output that is not
presented for brevity):
.. code-block:: none
:emphasize-lines: 1, 6, 8, 10, 12
$ NUMBA_OPT=0 python demo_gdb_segfault.py
...
Program received signal SIGSEGV, Segmentation fault.
0x00007f5a4ca655eb in __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](Array<long long, 1, C, mutable, aligned>, long long) (a=..., index=1000000000) at demo_gdb_segfault.py:12
12 d = c[index] # access an address that is a) invalid b) out of the page
(gdb) p index
$1 = 1000000000
(gdb) p c
$2 = {meminfo = 0x5586cfb95830 "\001", parent = 0x0, nitems = 10, itemsize = 8, data = 0x5586cfb95860, shape = {10}, strides = {8}}
(gdb) whatis c
type = array(float64, 1d, C) ({i8*, i8*, i64, i64, double*, [1 x i64], [1 x i64]})
(gdb) p c.nitems
$3 = 10
In the ``gdb`` output it can be noted that a ``SIGSEGV`` signal was caught, and
the line in which the access violation occurred is printed.
Continuing the example as a debugging session demonstration, first ``index``
can be printed, and it is evidently 1e9. Printing ``c`` shows that it is a
structure, so the type needs looking up and it can be seen that it is an
``array(float64, 1d, C)`` type. Given the segfault came from an invalid access
it would be informative to check the number of items in the array and compare
that to the index requested. Inspecting the ``nitems`` member of the structure
``c`` shows 10 items. It's therefore clear that the segfault comes from an
invalid access of index ``1000000000`` in an array containing ``10`` items.
Adding breakpoints to code
--------------------------
The next example demonstrates using multiple breakpoints that are defined
through the invocation of the :func:`numba.gdb_breakpoint` function:
.. code-block:: python
:linenos:
from numba import njit, gdb_init, gdb_breakpoint
@njit(debug=True)
def foo(a):
gdb_init() # instruct Numba to attach gdb at this location
b = a + 1
gdb_breakpoint() # instruct gdb to break at this location
c = a * 2.34
d = (a, b, c)
gdb_breakpoint() # and to break again at this location
print(a, b, c, d)
r= foo(123)
print(r)
In the terminal (``...`` on a line by itself indicates output that is not
presented for brevity):
.. code-block:: none
:emphasize-lines: 1, 4, 9, 20, 22, 24, 29, 31
$ NUMBA_OPT=0 python demo_gdb_breakpoints.py
...
Breakpoint 1, 0x00007fb65bb4c830 in numba_gdb_breakpoint () from numba/_helperlib.cpython-39-x86_64-linux-gnu.so
(gdb) step
Single stepping until exit from function numba_gdb_breakpoint,
which has no line number information.
__main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at demo_gdb_breakpoints.py:8
8 c = a * 2.34
(gdb) l
3 @njit(debug=True)
4 def foo(a):
5 gdb_init() # instruct Numba to attach gdb at this location
6 b = a + 1
7 gdb_breakpoint() # instruct gdb to break at this location
8 c = a * 2.34
9 d = (a, b, c)
10 gdb_breakpoint() # and to break again at this location
11 print(a, b, c, d)
12
(gdb) p b
$1 = 124
(gdb) p c
$2 = 0
(gdb) c
Continuing.
Breakpoint 1, 0x00007fb65bb4c830 in numba_gdb_breakpoint ()
from numba/_helperlib.cpython-39-x86_64-linux-gnu.so
(gdb) step
11 print(a, b, c, d)
(gdb) p c
$3 = 287.81999999999999
From the ``gdb`` output it can be seen that execution paused at line 8 as a
breakpoint was hit, and after a ``continue`` was issued, it broke again at line
11 where the next breakpoint was hit.
Debugging in parallel regions
-----------------------------
The following example is quite involved, it executes with ``gdb``
instrumentation from the outset as per the example above, but it also uses
threads and makes use of the breakpoint functionality. Further, the last
iteration of the parallel section calls the function ``work``, which is
actually just a binding to ``glibc``'s ``free(3)`` in this case, but could
equally be some involved function that is presenting a segfault for unknown
reasons.
.. code-block:: python
:linenos:
from numba import njit, prange, gdb_init, gdb_breakpoint
import ctypes
def get_free():
lib = ctypes.cdll.LoadLibrary('libc.so.6')
free_binding = lib.free
free_binding.argtypes = [ctypes.c_void_p,]
free_binding.restype = None
return free_binding
work = get_free()
@njit(debug=True, parallel=True)
def foo():
gdb_init() # instruct Numba to attach gdb at this location, but not to pause execution
counter = 0
n = 9
for i in prange(n):
if i > 3 and i < 8: # iterations 4, 5, 6, 7 will break here
gdb_breakpoint()
if i == 8: # last iteration segfaults
work(0xBADADD)
counter += 1
return counter
r = foo()
print(r)
In the terminal (``...`` on a line by itself indicates output that is not
presented for brevity), note the setting of ``NUMBA_NUM_THREADS`` to 4 to ensure
that there are 4 threads running in the parallel section:
.. code-block:: none
:emphasize-lines: 1, 19, 29, 44, 50, 56, 62, 69
$ NUMBA_NUM_THREADS=4 NUMBA_OPT=0 python demo_gdb_threads.py
Attaching to PID: 21462
...
Attaching to process 21462
[New LWP 21467]
[New LWP 21468]
[New LWP 21469]
[New LWP 21470]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f59ec31756d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 1 at 0x7f59d631e8f0: file numba/_helperlib.c, line 1090.
Continuing.
[Switching to Thread 0x7f59d1fd1700 (LWP 21470)]
Thread 5 "python" hit Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
1090 }
(gdb) info threads
Id Target Id Frame
1 Thread 0x7f59eca2f740 (LWP 21462) "python" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
2 Thread 0x7f59d37d4700 (LWP 21467) "python" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
3 Thread 0x7f59d2fd3700 (LWP 21468) "python" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
4 Thread 0x7f59d27d2700 (LWP 21469) "python" numba_gdb_breakpoint () at numba/_helperlib.c:1090
* 5 Thread 0x7f59d1fd1700 (LWP 21470) "python" numba_gdb_breakpoint () at numba/_helperlib.c:1090
(gdb) thread apply 2-5 info locals
Thread 2 (Thread 0x7f59d37d4700 (LWP 21467)):
No locals.
Thread 3 (Thread 0x7f59d2fd3700 (LWP 21468)):
No locals.
Thread 4 (Thread 0x7f59d27d2700 (LWP 21469)):
No locals.
Thread 5 (Thread 0x7f59d1fd1700 (LWP 21470)):
sched$35 = '\000' <repeats 55 times>
counter__arr = '\000' <repeats 16 times>, "\001\000\000\000\000\000\000\000\b\000\000\000\000\000\000\000\370B]\"hU\000\000\001", '\000' <repeats 14 times>
counter = 0
(gdb) continue
Continuing.
[Switching to Thread 0x7f59d27d2700 (LWP 21469)]
Thread 4 "python" hit Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
1090 }
(gdb) continue
Continuing.
[Switching to Thread 0x7f59d1fd1700 (LWP 21470)]
Thread 5 "python" hit Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
1090 }
(gdb) continue
Continuing.
[Switching to Thread 0x7f59d27d2700 (LWP 21469)]
Thread 4 "python" hit Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
1090 }
(gdb) continue
Continuing.
Thread 5 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f59d1fd1700 (LWP 21470)]
__GI___libc_free (mem=0xbadadd) at malloc.c:2935
2935 if (chunk_is_mmapped(p)) /* release mmapped memory. */
(gdb) bt
#0 __GI___libc_free (mem=0xbadadd) at malloc.c:2935
#1 0x00007f59d37ded84 in $3cdynamic$3e::__numba_parfor_gufunc__0x7ffff80a61ae3e31$244(Array<unsigned long long, 1, C, mutable, aligned>, Array<long long, 1, C, mutable, aligned>) () at <string>:24
#2 0x00007f59d17ce326 in __gufunc__._ZN13$3cdynamic$3e45__numba_parfor_gufunc__0x7ffff80a61ae3e31$244E5ArrayIyLi1E1C7mutable7alignedE5ArrayIxLi1E1C7mutable7alignedE ()
#3 0x00007f59d37d7320 in thread_worker ()
from <path>/numba/numba/npyufunc/workqueue.cpython-37m-x86_64-linux-gnu.so
#4 0x00007f59ec626e25 in start_thread (arg=0x7f59d1fd1700) at pthread_create.c:308
#5 0x00007f59ec350bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
In the output it can be seen that there are 4 threads launched and that they all
break at the breakpoint, further that ``Thread 5`` receives a signal ``SIGSEGV``
and that back tracing shows that it came from ``__GI___libc_free`` with the
invalid address in ``mem``, as expected.
Using the ``gdb`` command language
----------------------------------
Both the :func:`numba.gdb` and :func:`numba.gdb_init` functions accept unlimited
string arguments which will be passed directly to ``gdb`` as command line
arguments when it initializes, this makes it easy to set breakpoints on other
functions and perform repeated debugging tasks without having to manually type
them every time. For example, this code runs with ``gdb`` attached and sets a
breakpoint on ``_dgesdd`` (say for example the arguments passed to the LAPACK's
double precision divide and conqueror SVD function need debugging).
.. code-block:: python
:linenos:
from numba import njit, gdb
import numpy as np
@njit(debug=True)
def foo(a):
# instruct Numba to attach gdb at this location and on launch, switch
# breakpoint pending on , and then set a breakpoint on the function
# _dgesdd, continue execution, and once the breakpoint is hit, backtrace
gdb('-ex', 'set breakpoint pending on',
'-ex', 'b dgesdd_',
'-ex','c',
'-ex','bt')
b = a + 10
u, s, vh = np.linalg.svd(b)
return s # just return singular values
z = np.arange(70.).reshape(10, 7)
r = foo(z)
print(r)
In the terminal (``...`` on a line by itself indicates output that is not
presented for brevity), note that no interaction is required to break and
backtrace:
.. code-block:: none
:emphasize-lines: 1
$ NUMBA_OPT=0 python demo_gdb_args.py
Attaching to PID: 22300
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-36.el7
...
Attaching to process 22300
Reading symbols from <py_env>/bin/python3.7...done.
0x00007f652305a550 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 1 at 0x7f650d0618f0: file numba/_helperlib.c, line 1090.
Continuing.
Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
1090 }
Breakpoint 2 at 0x7f65102322e0 (2 locations)
Continuing.
Breakpoint 2, 0x00007f65182be5f0 in mkl_lapack.dgesdd_ ()
from <py_env>/lib/python3.7/site-packages/numpy/core/../../../../libmkl_rt.so
#0 0x00007f65182be5f0 in mkl_lapack.dgesdd_ ()
from <py_env>/lib/python3.7/site-packages/numpy/core/../../../../libmkl_rt.so
#1 0x00007f650d065b71 in numba_raw_rgesdd (kind=kind@entry=100 'd', jobz=<optimized out>, jobz@entry=65 'A', m=m@entry=10,
n=n@entry=7, a=a@entry=0x561c6fbb20c0, lda=lda@entry=10, s=0x561c6facf3a0, u=0x561c6fb680e0, ldu=10, vt=0x561c6fd375c0,
ldvt=7, work=0x7fff4c926c30, lwork=-1, iwork=0x7fff4c926c40, info=0x7fff4c926c20) at numba/_lapack.c:1277
#2 0x00007f650d06768f in numba_ez_rgesdd (ldvt=7, vt=0x561c6fd375c0, ldu=10, u=0x561c6fb680e0, s=0x561c6facf3a0, lda=10,
a=0x561c6fbb20c0, n=7, m=10, jobz=65 'A', kind=<optimized out>) at numba/_lapack.c:1307
#3 numba_ez_gesdd (kind=<optimized out>, jobz=<optimized out>, m=10, n=7, a=0x561c6fbb20c0, lda=10, s=0x561c6facf3a0,
u=0x561c6fb680e0, ldu=10, vt=0x561c6fd375c0, ldvt=7) at numba/_lapack.c:1477
#4 0x00007f650a3147a3 in numba::targets::linalg::svd_impl::$3clocals$3e::svd_impl$243(Array<double, 2, C, mutable, aligned>, omitted$28default$3d1$29) ()
#5 0x00007f650a1c0489 in __main__::foo$241(Array<double, 2, C, mutable, aligned>) () at demo_gdb_args.py:15
#6 0x00007f650a1c2110 in cpython::__main__::foo$241(Array<double, 2, C, mutable, aligned>) ()
#7 0x00007f650cd096a4 in call_cfunc ()
from <path>/numba/numba/_dispatcher.cpython-37m-x86_64-linux-gnu.so
...
How does the ``gdb`` binding work?
----------------------------------
For advanced users and debuggers of Numba applications it's important to know
some of the internal implementation details of the outlined ``gdb`` bindings.
The :func:`numba.gdb` and :func:`numba.gdb_init` functions work by injecting the
following into the function's LLVM IR:
* At the call site of the function first inject a call to ``getpid(3)`` to get
the PID of the executing process and store this for use later, then inject a
``fork(3)`` call:
* In the parent:
* Inject a call ``sleep(3)`` (hence the pause whilst ``gdb`` loads).
* Inject a call to the ``numba_gdb_breakpoint`` function (only
:func:`numba.gdb` does this).
* In the child:
* Inject a call to ``execl(3)`` with the arguments
``numba.config.GDB_BINARY``, the ``attach`` command and the PID recorded
earlier. Numba has a special ``gdb`` command file that contains
instructions to break on the symbol ``numba_gdb_breakpoint`` and then
``finish``, this is to make sure that the program stops on the
breakpoint but the frame it stops in is the compiled Python frame (or
one ``step`` away from, depending on optimisation). This command file is
also added to the arguments and finally and any user specified arguments
are added.
At the call site of a :func:`numba.gdb_breakpoint` a call is injected to the
special ``numba_gdb_breakpoint`` symbol, which is already registered and
instrumented as a place to break and ``finish`` immediately.
As a result of this, a e.g. :func:`numba.gdb` call will cause a fork in the
program, the parent will sleep whilst the child launches ``gdb`` and attaches it
to the parent and tells the parent to continue. The launched ``gdb`` has the
``numba_gdb_breakpoint`` symbol registered as a breakpoint and when the parent
continues and stops sleeping it will immediately call ``numba_gdb_breakpoint``
on which the child will break. Additional :func:`numba.gdb_breakpoint` calls
create calls to the registered breakpoint hence the program will also break at
these locations.
.. _debugging-cuda-python-code:
Debugging CUDA Python code
==========================
Using the simulator
-------------------
CUDA Python code can be run in the Python interpreter using the CUDA Simulator,
allowing it to be debugged with the Python debugger or with print statements. To
enable the CUDA simulator, set the environment variable
:envvar:`NUMBA_ENABLE_CUDASIM` to 1. For more information on the CUDA Simulator,
see :ref:`the CUDA Simulator documentation <simulator>`.
Debug Info
----------
By setting the ``debug`` argument to ``cuda.jit`` to ``True``
(``@cuda.jit(debug=True)``), Numba will emit source location in the compiled
CUDA code. Unlike the CPU target, only filename and line information are
available, but no variable type information is emitted. The information
is sufficient to debug memory error with
`cuda-memcheck <http://docs.nvidia.com/cuda/cuda-memcheck/index.html>`_.
For example, given the following cuda python code:
.. code-block:: python
:linenos:
import numpy as np
from numba import cuda
@cuda.jit(debug=True)
def foo(arr):
arr[cuda.threadIdx.x] = 1
arr = np.arange(30)
foo[1, 32](arr) # more threads than array elements
We can use ``cuda-memcheck`` to find the memory error:
.. code-block:: none
$ cuda-memcheck python chk_cuda_debug.py
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
========= at 0x00000148 in /home/user/chk_cuda_debug.py:6:cudapy::__main__::foo$241(Array<__int64, int=1, C, mutable, aligned>)
========= by thread (31,0,0) in block (0,0,0)
========= Address 0x500a600f8 is out of bounds
...
=========
========= Invalid __global__ write of size 8
========= at 0x00000148 in /home/user/chk_cuda_debug.py:6:cudapy::__main__::foo$241(Array<__int64, int=1, C, mutable, aligned>)
========= by thread (30,0,0) in block (0,0,0)
========= Address 0x500a600f0 is out of bounds
...
==================================
Creating NumPy universal functions
==================================
There are two types of universal functions:
* Those which operate on scalars, these are "universal functions" or *ufuncs*
(see ``@vectorize`` below).
* Those which operate on higher dimensional arrays and scalars, these are
"generalized universal functions" or *gufuncs* (``@guvectorize`` below).
.. _vectorize:
The ``@vectorize`` decorator
============================
Numba's vectorize allows Python functions taking scalar input arguments to
be used as NumPy `ufuncs`_. Creating a traditional NumPy ufunc is
not the most straightforward process and involves writing some C code.
Numba makes this easy. Using the :func:`~numba.vectorize` decorator, Numba
can compile a pure Python function into a ufunc that operates over NumPy
arrays as fast as traditional ufuncs written in C.
.. _ufuncs: http://docs.scipy.org/doc/numpy/reference/ufuncs.html
Using :func:`~numba.vectorize`, you write your function as operating over
input scalars, rather than arrays. Numba will generate the surrounding
loop (or *kernel*) allowing efficient iteration over the actual inputs.
The :func:`~numba.vectorize` decorator has two modes of operation:
* Eager, or decoration-time, compilation: If you pass one or more type
signatures to the decorator, you will be building a NumPy universal
function (ufunc). The rest of this subsection describes building
ufuncs using decoration-time compilation.
* Lazy, or call-time, compilation: When not given any signatures, the
decorator will give you a Numba dynamic universal function
(:class:`~numba.DUFunc`) that dynamically compiles a new kernel when
called with a previously unsupported input type. A later
subsection, ":ref:`dynamic-universal-functions`", describes this mode in
more depth.
As described above, if you pass a list of signatures to the
:func:`~numba.vectorize` decorator, your function will be compiled
into a NumPy ufunc. In the basic case, only one signature will be
passed:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_one_signature`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_one_signature.begin
:end-before: magictoken.ex_vectorize_one_signature.end
:dedent: 12
:linenos:
If you pass several signatures, beware that you have to pass most specific
signatures before least specific ones (e.g., single-precision floats
before double-precision floats), otherwise type-based dispatching will not work
as expected:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_multiple_signatures`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_multiple_signatures.begin
:end-before: magictoken.ex_vectorize_multiple_signatures.end
:dedent: 12
:linenos:
The function will work as expected over the specified array types:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_multiple_signatures`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_return_call_one.begin
:end-before: magictoken.ex_vectorize_return_call_one.end
:dedent: 12
:linenos:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_multiple_signatures`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_return_call_two.begin
:end-before: magictoken.ex_vectorize_return_call_two.end
:dedent: 12
:linenos:
but it will fail working on other types::
>>> a = np.linspace(0, 1+1j, 6)
>>> f(a, a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'ufunc' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
You might ask yourself, "why would I go through this instead of compiling
a simple iteration loop using the :ref:`@jit <jit>` decorator?". The
answer is that NumPy ufuncs automatically get other features such as
reduction, accumulation or broadcasting. Using the example above:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_multiple_signatures`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_return_call_three.begin
:end-before: magictoken.ex_vectorize_return_call_three.end
:dedent: 12
:linenos:
.. seealso::
`Standard features of ufuncs <http://docs.scipy.org/doc/numpy/reference/ufuncs.html#ufunc>`_ (NumPy documentation).
.. note::
Only the broadcasting features of ufuncs are supported in compiled code.
The :func:`~numba.vectorize` decorator supports multiple ufunc targets:
================= ===============================================================
Target Description
================= ===============================================================
cpu Single-threaded CPU
parallel Multi-core CPU
cuda CUDA GPU
.. NOTE:: This creates an *ufunc-like* object.
See `documentation for CUDA ufunc <../cuda/ufunc.html>`_ for detail.
================= ===============================================================
A general guideline is to choose different targets for different data sizes
and algorithms.
The "cpu" target works well for small data sizes (approx. less than 1KB) and low
compute intensity algorithms. It has the least amount of overhead.
The "parallel" target works well for medium data sizes (approx. less than 1MB).
Threading adds a small delay.
The "cuda" target works well for big data sizes (approx. greater than 1MB) and
high compute intensity algorithms. Transferring memory to and from the GPU adds
significant overhead.
.. _guvectorize:
The ``@guvectorize`` decorator
==============================
While :func:`~numba.vectorize` allows you to write ufuncs that work on one
element at a time, the :func:`~numba.guvectorize` decorator takes the concept
one step further and allows you to write ufuncs that will work on an
arbitrary number of elements of input arrays, and take and return arrays of
differing dimensions. The typical example is a running median or a
convolution filter.
Contrary to :func:`~numba.vectorize` functions, :func:`~numba.guvectorize`
functions don't return their result value: they take it as an array
argument, which must be filled in by the function. This is because the
array is actually allocated by NumPy's dispatch mechanism, which calls into
the Numba-generated code.
Similar to :func:`~numba.vectorize` decorator, :func:`~numba.guvectorize`
also has two modes of operation: Eager, or decoration-time compilation and
lazy, or call-time compilation.
Here is a very simple example:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize.begin
:end-before: magictoken.ex_guvectorize.end
:dedent: 12
:linenos:
The underlying Python function simply adds a given scalar (``y``) to all
elements of a 1-dimension array. What's more interesting is the declaration.
There are two things there:
* the declaration of input and output *layouts*, in symbolic form:
``(n),()->(n)`` tells NumPy that the function takes a *n*-element one-dimension
array, a scalar (symbolically denoted by the empty tuple ``()``) and
returns a *n*-element one-dimension array;
* the list of supported concrete *signatures* as per ``@vectorize``; here,
as in the above example, we demonstrate ``int64`` arrays.
.. note::
1D array type can also receive scalar arguments (those with shape ``()``).
In the above example, the second argument also could be declared as
``int64[:]``. In that case, the value must be read by ``y[0]``.
We can now check what the compiled ufunc does, over a simple example:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_call_one.begin
:end-before: magictoken.ex_guvectorize_call_one.end
:dedent: 12
:linenos:
The nice thing is that NumPy will automatically dispatch over more
complicated inputs, depending on their shapes:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_call_two.begin
:end-before: magictoken.ex_guvectorize_call_two.end
:dedent: 12
:linenos:
.. note::
Both :func:`~numba.vectorize` and :func:`~numba.guvectorize` support
passing ``nopython=True`` :ref:`as in the @jit decorator <jit-nopython>`.
Use it to ensure the generated code does not fallback to
:term:`object mode`.
.. _scalar-return-values:
Scalar return values
--------------------
Now suppose we want to return a scalar value from
:func:`~numba.guvectorize`. To do this, we need to:
* in the signatures, declare the scalar return with ``[:]`` like
a 1-dimensional array (eg. ``int64[:]``),
* in the layout, declare it as ``()``,
* in the implementation, write to the first element (e.g. ``res[0] = acc``).
The following example function computes the sum of the 1-dimensional
array (``x``) plus the scalar (``y``) and returns it as a scalar:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_scalar_return`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_scalar_return.begin
:end-before: magictoken.ex_guvectorize_scalar_return.end
:dedent: 12
:linenos:
Now if we apply the wrapped function over the array, we get a scalar
value as the output:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_scalar_return`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_scalar_return_call.begin
:end-before: magictoken.ex_guvectorize_scalar_return_call.end
:dedent: 12
:linenos:
.. _overwriting-input-values:
Overwriting input values
------------------------
In most cases, writing to inputs may also appear to work - however, this
behaviour cannot be relied on. Consider the following example function:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_overwrite`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_overwrite.begin
:end-before: magictoken.ex_guvectorize_overwrite.end
:dedent: 12
:linenos:
Calling the `init_values` function with an array of `float64` type results in
visible changes to the input:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_overwrite`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_overwrite_call_one.begin
:end-before: magictoken.ex_guvectorize_overwrite_call_one.end
:dedent: 12
:linenos:
This works because NumPy can pass the input data directly into the `init_values`
function as the data `dtype` matches that of the declared argument. However, it
may also create and pass in a temporary array, in which case changes to the
input are lost. For example, this can occur when casting is required. To
demonstrate, we can use an array of `float32` with the `init_values` function:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_overwrite`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_overwrite_call_two.begin
:end-before: magictoken.ex_guvectorize_overwrite_call_two.end
:dedent: 12
:linenos:
In this case, there is no change to the `invals` array because the temporary
casted array was mutated instead.
To solve this problem, one needs to tell the GUFunc engine that the ``invals``
argument is writable. This can be achieved by passing ``writable_args=('invals',)``
(specifying by name), or ``writable_args=(0,)`` (specifying by position) to
``@guvectorize``. Now, the code above works as expected:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_overwrite`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_overwrite_call_three.begin
:end-before: magictoken.ex_guvectorize_overwrite_call_three.end
:dedent: 12
:linenos:
.. _dynamic-universal-functions:
Dynamic universal functions
===========================
As described above, if you do not pass any signatures to the
:func:`~numba.vectorize` decorator, your Python function will be used
to build a dynamic universal function, or :class:`~numba.DUFunc`. For
example:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_dynamic.begin
:end-before: magictoken.ex_vectorize_dynamic.end
:dedent: 12
:linenos:
The resulting :func:`f` is a :class:`~numba.DUFunc` instance that
starts with no supported input types. As you make calls to :func:`f`,
Numba generates new kernels whenever you pass a previously unsupported
input type. Given the example above, the following set of interpreter
interactions illustrate how dynamic compilation works::
>>> f
<numba._DUFunc 'f'>
>>> f.ufunc
<ufunc 'f'>
>>> f.ufunc.types
[]
The example above shows that :class:`~numba.DUFunc` instances are not
ufuncs. Rather than subclass ufunc's, :class:`~numba.DUFunc`
instances work by keeping a :attr:`~numba.DUFunc.ufunc` member, and
then delegating ufunc property reads and method calls to this member
(also known as type aggregation). When we look at the initial types
supported by the ufunc, we can verify there are none.
Let's try to make a call to :func:`f`:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_dynamic_call_one.begin
:end-before: magictoken.ex_vectorize_dynamic_call_one.end
:dedent: 12
:linenos:
If this was a normal NumPy ufunc, we would have seen an exception
complaining that the ufunc couldn't handle the input types. When we
call :func:`f` with integer arguments, not only do we receive an
answer, but we can verify that Numba created a loop supporting C
:code:`long` integers.
We can add additional loops by calling :func:`f` with different inputs:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_dynamic_call_two.begin
:end-before: magictoken.ex_vectorize_dynamic_call_two.end
:dedent: 12
:linenos:
We can now verify that Numba added a second loop for dealing with
floating-point inputs, :code:`"dd->d"`.
If we mix input types to :func:`f`, we can verify that `NumPy ufunc
casting rules`_ are still in effect:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_dynamic_call_three.begin
:end-before: magictoken.ex_vectorize_dynamic_call_three.end
:dedent: 12
:linenos:
.. _`NumPy ufunc casting rules`: http://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules
This example demonstrates that calling :func:`f` with mixed types
caused NumPy to select the floating-point loop, and cast the integer
argument to a floating-point value. Thus, Numba did not create a
special :code:`"dl->d"` kernel.
This :class:`~numba.DUFunc` behavior leads us to a point similar to
the warning given above in "`The @vectorize decorator`_" subsection,
but instead of signature declaration order in the decorator, call
order matters. If we had passed in floating-point arguments first,
any calls with integer arguments would be cast to double-precision
floating-point values. For example:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_vectorize_dynamic_call_four.begin
:end-before: magictoken.ex_vectorize_dynamic_call_four.end
:dedent: 12
:linenos:
If you require precise support for various type signatures, you should
specify them in the :func:`~numba.vectorize` decorator, and not rely
on dynamic compilation.
Dynamic generalized universal functions
=======================================
Similar to a dynamic universal function, if you do not specify any types to
the :func:`~numba.guvectorize` decorator, your Python function will be used
to build a dynamic generalized universal function, or :class:`~numba.GUFunc`.
For example:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_dynamic.begin
:end-before: magictoken.ex_guvectorize_dynamic.end
:dedent: 12
:linenos:
We can verify the resulting function :func:`g` is a :class:`~numba.GUFunc`
instance that starts with no supported input types. For instance::
>>> g
<numba._GUFunc 'g'>
>>> g.ufunc
<ufunc 'g'>
>>> g.ufunc.types
[]
Similar to a :class:`~numba.DUFunc`, as one make calls to :func:`g()`,
numba generates new kernels for previously unsupported input types. The
following set of interpreter interactions will illustrate how dynamic
compilation works for a :class:`~numba.GUFunc`:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_dynamic_call_one.begin
:end-before: magictoken.ex_guvectorize_dynamic_call_one.end
:dedent: 12
:linenos:
If this was a normal :func:`guvectorize` function, we would have seen an
exception complaining that the ufunc could not handle the given input types.
When we call :func:`g()` with the input arguments, numba creates a new loop
for the input types.
We can add additional loops by calling :func:`g` with new arguments:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_dynamic_call_two.begin
:end-before: magictoken.ex_guvectorize_dynamic_call_two.end
:dedent: 12
:linenos:
We can now verify that Numba added a second loop for dealing with
floating-point inputs, :code:`"dd->d"`.
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_dynamic_call_three.begin
:end-before: magictoken.ex_guvectorize_dynamic_call_three.end
:dedent: 12
:linenos:
One can also verify that NumPy ufunc casting rules are working as expected:
.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
:language: python
:caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
:start-after: magictoken.ex_guvectorize_dynamic_call_four.begin
:end-before: magictoken.ex_guvectorize_dynamic_call_four.end
:dedent: 12
:linenos:
If you need precise support for various type signatures, you should not rely on dynamic
compilation and instead, specify the types them as first
argument in the :func:`~numba.guvectorize` decorator.
============================================================
Callback into the Python Interpreter from within JIT'ed code
============================================================
There are rare but real cases when a nopython-mode function needs to callback
into the Python interpreter to invoke code that cannot be compiled by Numba.
Such cases include:
- logging progress for long running JIT'ed functions;
- use data structures that are not currently supported by Numba;
- debugging inside JIT'ed code using the Python debugger.
When Numba callbacks into the Python interpreter, the following has to happen:
- acquire the GIL;
- convert values in native representation back into Python objects;
- call-back into the Python interpreter;
- convert returned values from the Python-code into native representation;
- release the GIL.
These steps can be expensive. Users **should not** rely on the feature
described here on performance-critical paths.
.. _with_objmode:
The ``objmode`` context-manager
===============================
.. warning:: This feature can be easily mis-used. Users should first consider
alternative approaches to achieve their intended goal before using
this feature.
.. autofunction:: numba.objmode
Changelog
=========
This directory contains "news fragments" which are short files that contain a
small **ReST**-formatted text that will be added to the next what's new page.
Make sure to use full sentences with correct case and punctuation, and please
try to use Sphinx intersphinx using backticks. The fragment should have a
header line and an underline using ``""""""""`` followed by description of
your user-facing changes as they should appear in the relase notes.
Each file should be named like ``<PULL REQUEST>.<TYPE>.rst``, where
``<PULL REQUEST>`` is a pull request number, and ``<TYPE>`` is one of:
* ``highlight``: Adds a highlighted bullet point to use as a possible highlight
of the release.
* ``np_support``: Addition of new NumPy functionality.
* ``deprecation``: Changes to existing code that will now emit a deprecation warning.
* ``expired``: Removal of a deprecated part of the API.
* ``compatibility``: A change which requires users to change code and is not
backwards compatible. (Not to be used for removal of deprecated features.)
* ``cuda``: Changes in the CUDA target implementation.
* ``new_feature``: New user facing features like ``kwargs``.
* ``improvement``: General improvements and edge-case changes which are
not new features or compatibility related.
* ``performance``: Performance changes that should not affect other behaviour.
* ``change``: Other changes
* ``doc``: Documentation related changes.
* ``infrastructure``: Infrastructure/CI related changes.
* ``bug_fix``: Bug Fixes for exiting features/functionality.
If you are unsure what pull request type to use, don't hesitate to ask in your
PR.
You can install ``towncrier`` and run ``towncrier build --draft``
if you want to get a preview of how your change will look in the final release
notes.
.. note::
This README was adapted from the NumPy changelog readme under the terms of
the `BSD-3 licence <https://github.com/numpy/numpy/blob/c1ffdbc0c29d48ece717acb5bfbf811c935b41f6/LICENSE.txt>`_.
{% set title = "Version {} (Release Date)".format(versiondata.version) %}
{{ title }}
{{ "-" * title|length }}
{% for section, _ in sections.items() %}
{% if section %}{{ section }}
{{ "~" * section|length }}
{% endif %}
{% if sections[section] %}
{% for category, val in definitions.items() if category in sections[section] %}
{{ definitions[category]['name'] }}
{{ "~" * definitions[category]['name']|length }}
{% if definitions[category]['showcontent'] %}
{% for text, values in sections[section][category].items() %}
{{ text }}
{{ get_indent(text) }}({{values|join(', ') }})
{% endfor %}
{% else %}
- {{ sections[section][category]['']|join(', ') }}
{% endif %}
{% if sections[section][category]|length == 0 %}
No significant changes.
{% else %}
{% endif %}
{% endfor %}
{% else %}
No significant changes.
{% endif %}
{% endfor %}
\ No newline at end of file
#! /usr/bin/env python
import sys
import os
from github3 import login
import github3
def fetch(orgname, reponame, last_num, gh):
repo = gh.repository(orgname, reponame)
issues = repo.issues(state='all')
opened_issues = []
closed_issues = []
opened_prs = []
closed_prs = []
max_iss_num = 0
for issue in issues:
info = issue.as_dict()
iss_num = int(info['number'])
max_iss_num = max(max_iss_num, iss_num)
if iss_num <= last_num:
break
merged = False
if issue.pull_request_urls:
# Is PR?
merged = bool(info['pull_request'].get("merged_at"))
where = {'opened': opened_prs, 'closed': closed_prs}
else:
# Is Issues
where = {'opened': opened_issues, 'closed': closed_issues}
line = f"{' - merged ' if merged else ''}- [{reponame}"\
f"#{info['number']}]({info['html_url']}) - {info['title']}"
# Is issue already merged
if issue.is_closed():
where['closed'].append(line)
else:
where['opened'].append(line)
return {
'opened_issues': opened_issues,
'closed_issues': closed_issues,
'opened_prs': opened_prs,
'closed_prs': closed_prs,
'max_iss_num': max_iss_num,
}
def display(data):
print("## 1. New Issues")
for line in reversed(data['opened_issues']):
print(line)
print()
print("### Closed Issues")
for line in reversed(data['closed_issues']):
print(line)
print()
print("## 2. New PRs")
for line in reversed(data['opened_prs']):
print(line)
print()
print("### Closed PRs")
for line in reversed(data['closed_prs']):
print(line)
print()
def main(numba_last_num, llvmlite_last_num, user=None, password=None):
if user is not None and password is not None:
gh = login(str(user), password=str(password))
else:
gh = github3
numba_data = fetch("numba", "numba", numba_last_num, gh)
llvmlite_data = fetch("numba", "llvmlite", llvmlite_last_num, gh)
# combine data
data = {
'opened_issues':
llvmlite_data['opened_issues'] +
numba_data['opened_issues'],
'closed_issues':
llvmlite_data['closed_issues'] +
numba_data['closed_issues'],
'opened_prs':
llvmlite_data['opened_prs'] +
numba_data['opened_prs'],
'closed_prs':
llvmlite_data['closed_prs'] +
numba_data['closed_prs'],
}
display(data)
print(f"(last numba: {numba_data['max_iss_num']};"
f"llvmlite {llvmlite_data['max_iss_num']})")
help_msg = """
Usage:
{program_name} <numba_last_num> <llvmlite_last_num>
"""
if __name__ == '__main__':
program_name = sys.argv[0]
try:
[numba_last_num, llvmlite_last_num] = sys.argv[1:]
except ValueError:
print(help_msg.format(program_name=program_name))
else:
main(int(numba_last_num),
int(llvmlite_last_num),
user=os.environ.get("GHUSER"),
password=os.environ.get("GHPASS"))
"""gitlog2changelog.py
Usage:
gitlog2changelog.py (-h | --help)
gitlog2changelog.py --version
gitlog2changelog.py --token=<token> --beginning=<tag> --repo=<repo> \
--digits=<digits> [--summary]
Options:
-h --help Show this screen.
--version Show version.
--beginning=<tag> Where in the History to begin
--repo=<repo> Which repository to look at on GitHub
--token=<token> The GitHub token to talk to the API
--digits=<digits> The number of digits to use in the issue finding regex
--summary Show total count for each section
"""
import re
from git import Repo
from docopt import docopt
from github import Github
ghrepo = None
def get_pr(pr_number):
return ghrepo.get_pull(pr_number)
def hyperlink_user(user_obj):
return "`%s <%s>`_" % (user_obj.login, user_obj.html_url)
if __name__ == '__main__':
arguments = docopt(__doc__, version='1.0')
beginning = arguments['--beginning']
target_ghrepo = arguments['--repo']
github_token = arguments['--token']
regex_digits = arguments['--digits']
summary = arguments["--summary"]
ghrepo = Github(github_token).get_repo(target_ghrepo)
repo = Repo('.')
all_commits = [x for x in repo.iter_commits(f'{beginning}..HEAD')]
merge_commits = [x for x in all_commits
if 'Merge pull request' in x.message]
prmatch = re.compile(
f'^Merge pull request #([0-9]{{{regex_digits}}}) from.*')
ordered = {}
authors = set()
for x in merge_commits:
match = prmatch.match(x.message)
if match:
issue_id = match.groups()[0]
ordered[issue_id] = "%s" % (x.message.splitlines()[2])
print("Pull-Requests:\n")
for k in sorted(ordered.keys()):
pull = get_pr(int(k))
hyperlink = "`#%s <%s>`_" % (k, pull.html_url)
# get all users for all commits
pr_authors = set()
for c in pull.get_commits():
if c.author:
pr_authors.add(c.author)
if c.committer and c.committer.login != "web-flow":
pr_authors.add(c.committer)
print("* PR %s: %s (%s)" % (hyperlink, ordered[k],
" ".join([hyperlink_user(u) for u in
pr_authors])))
for a in pr_authors:
authors.add(a)
if summary:
print("\nTotal PRs: %s\n" % len(ordered))
else:
print()
print("Authors:\n")
[print('* %s' % hyperlink_user(x)) for x in sorted(authors, key=lambda x:
x.login.lower())]
if summary:
print("\nTotal authors: %s" % len(authors))
# Global options:
[mypy]
warn_unused_configs = True
follow_imports = silent
show_error_context = True
files = **/numba/core/types/*.py, **/numba/core/datamodel/*.py, **/numba/core/rewrites/*.py, **/numba/core/unsafe/*.py, **/numba/core/rvsdg_frontend/*.py, **/numba/core/rvsdg_frontend/rvsdg/*.py
# Per-module options:
# To classify a given module as Level 1, 2 or 3 it must be added both in files (variable above) and in the lists below.
# Level 1 - modules checked on the strictest settings.
;[mypy-]
;warn_return_any = True
;disallow_any_expr = True
;disallow_any_explicit = True
;disallow_any_generics = True
;disallow_subclassing_any = True
;disallow_untyped_calls = True
;disallow_untyped_defs = True
;disallow_incomplete_defs = True
;check_untyped_defs = True
;disallow_untyped_decorators = True
;warn_unused_ignores = True
;follow_imports = normal
;warn_unreachable = True
;strict_equality = True
# Level 2 - module that pass reasonably strict settings.
# No untyped functions allowed. Imports must be typed or explicitly ignored.
;[mypy-]
;warn_return_any = True
;disallow_untyped_defs = True
;disallow_incomplete_defs = True
;follow_imports = normal
# Level 3 - modules that pass mypy default settings (only those in `files` global setting and not in previous levels)
# Function/variables are annotated to avoid mypy errors, but annotations are not complete.
[mypy-numba.core.*]
warn_return_any = True
# Level 4 - modules that do not pass mypy check: they are excluded from "files" setting in global section
# External packages that lack annotations
[mypy-llvmlite.*]
ignore_missing_imports = True
[mypy-numpy.*]
ignore_missing_imports = True
[mypy-winreg.*]
# this can be removed after Mypy 0.78 is out with the latest typeshed
ignore_missing_imports = True
[mypy-numba_rvsdg.*]
ignore_missing_imports = True
[mypy-graphviz.*]
ignore_missing_imports = True
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment