init 0.58

1fb0017a · dugupeiwen · 1fb0017a · 1fb0017a · 1fb0017a · 1fb0017a
Commit 1fb0017a authored Mar 23, 2024 by dugupeiwen
20 changed files
--- a/docs/source/user/index.rst
+++ b/docs/source/user/index.rst
+
+User Manual
+===========
+
+.. toctree::
+
+   5minguide.rst
+   overview.rst
+   installing.rst
+   jit.rst
+   generated-jit.rst
+   vectorize.rst
+   jitclass.rst
+   cfunc.rst
+   pycc.rst
+   parallel.rst
+   stencil.rst
+   withobjmode.rst
+   jit-module.rst
+   performance-tips.rst
+   threading-layer.rst
+   cli.rst
+   troubleshoot.rst
+   faq.rst
+   examples.rst
+   talks.rst
--- a/docs/source/user/installing.rst
+++ b/docs/source/user/installing.rst
+
+Installation
+============
+
+Compatibility
+-------------
+
+For software compatability, please see the section on :ref:`version support
+information<numba_support_info>` for details.
+
+Our supported platforms are:
+
+* Linux x86_64
+* Linux ppcle64 (POWER8, POWER9)
+* Windows 10 and later (64-bit)
+* OS X 10.9 and later (64-bit and unofficial support on M1/Arm64)
+* \*BSD (unofficial support only)
+* NVIDIA GPUs of compute capability 5.0 and later
+
+  * Compute capabilities 3.5 and 3.7 are supported, but deprecated.
+* ARMv8 (64-bit little-endian, such as the NVIDIA Jetson)
+
+:ref:`numba-parallel` is only available on 64-bit platforms.
+
+Installing using conda on x86/x86_64/POWER Platforms
+----------------------------------------------------
+
+The easiest way to install Numba and get updates is by using ``conda``,
+a cross-platform package manager and software distribution maintained
+by Anaconda, Inc.  You can either use `Anaconda
+<https://www.anaconda.com/download>`_ to get the full stack in one download,
+or `Miniconda <https://conda.io/miniconda.html>`_ which will install
+the minimum packages required for a conda environment.
+
+Once you have conda installed, just type::
+
+    $ conda install numba
+
+or::
+
+    $ conda update numba
+
+Note that Numba, like Anaconda, only supports PPC in 64-bit little-endian mode.
+
+To enable CUDA GPU support for Numba, install the latest `graphics drivers from
+NVIDIA <https://www.nvidia.com/Download/index.aspx>`_ for your platform.
+(Note that the open source Nouveau drivers shipped by default with many Linux
+distributions do not support CUDA.)  Then install the ``cudatoolkit`` package::
+
+    $ conda install cudatoolkit
+
+You do not need to install the CUDA SDK from NVIDIA.
+
+
+Installing using pip on x86/x86_64 Platforms
+--------------------------------------------
+
+Binary wheels for Windows, Mac, and Linux are also available from `PyPI
+<https://pypi.org/project/numba/>`_.  You can install Numba using ``pip``::
+
+    $ pip install numba
+
+This will download all of the needed dependencies as well.  You do not need to
+have LLVM installed to use Numba (in fact, Numba will ignore all LLVM
+versions installed on the system) as the required components are bundled into
+the llvmlite wheel.
+
+To use CUDA with Numba installed by ``pip``, you need to install the `CUDA SDK
+<https://developer.nvidia.com/cuda-downloads>`_ from NVIDIA.  Please refer to
+:ref:`cudatoolkit-lookup` for details. Numba can also detect CUDA libraries
+installed system-wide on Linux.
+
+
+Installing on Linux ARMv8 (AArch64) Platforms
+---------------------------------------------
+
+We build and test conda packages on the `NVIDIA Jetson TX2
+<https://www.nvidia.com/en-us/autonomous-machines/embedded-systems-dev-kits-modules/>`_,
+but they are likely to work for other AArch64 platforms.  (Note that while the
+CPUs in the Raspberry Pi 3, 4, and Zero 2 W are 64-bit, Raspberry Pi OS may be
+running in 32-bit mode depending on the OS image in use).
+
+Conda-forge support for AArch64 is still quite experimental and packages are limited,
+but it does work enough for Numba to build and pass tests.  To set up the environment:
+
+* Install `miniforge <https://github.com/conda-forge/miniforge>`_.
+  This will create a minimal conda environment.
+
+* Then you can install Numba from the ``numba`` channel::
+
+    $ conda install -c numba numba
+
+On CUDA-enabled systems, like the Jetson, the CUDA toolkit should be
+automatically detected in the environment.
+
+.. _numba-source-install-instructions:
+
+Installing from source
+----------------------
+
+Installing Numba from source is fairly straightforward (similar to other
+Python packages), but installing `llvmlite
+<https://github.com/numba/llvmlite>`_ can be quite challenging due to the need
+for a special LLVM build.  If you are building from source for the purposes of
+Numba development, see :ref:`buildenv` for details on how to create a Numba
+development environment with conda.
+
+If you are building Numba from source for other reasons, first follow the
+`llvmlite installation guide <https://llvmlite.readthedocs.io/en/latest/admin-guide/install.html>`_.
+Once that is completed, you can download the latest Numba source code from
+`Github <https://github.com/numba/numba>`_::
+
+    $ git clone https://github.com/numba/numba.git
+
+Source archives of the latest release can also be found on
+`PyPI <https://pypi.org/project/numba/>`_.  In addition to ``llvmlite``, you will also need:
+
+* A C compiler compatible with your Python installation.  If you are using
+  Anaconda, you can use the following conda packages:
+
+  * Linux ``x86_64``: ``gcc_linux-64`` and ``gxx_linux-64``
+  * Linux ``POWER``: ``gcc_linux-ppc64le`` and ``gxx_linux-ppc64le``
+  * Linux ``ARM``: no conda packages, use the system compiler
+  * Mac OSX: ``clang_osx-64`` and ``clangxx_osx-64`` or the system compiler at
+    ``/usr/bin/clang`` (Mojave onwards)
+  * Mac OSX (M1): ``clang_osx-arm64`` and ``clangxx_osx-arm64``
+  * Windows: a version of Visual Studio appropriate for the Python version in
+    use
+
+* `NumPy <http://www.numpy.org/>`_
+
+Then you can build and install Numba from the top level of the source tree::
+
+    $ python setup.py install
+
+If you wish to run the test suite, see the instructions in the
+:ref:`developer documentation <running-tests>`.
+
+.. _numba-source-install-env_vars:
+
+Build time environment variables and configuration of optional components
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Below are environment variables that are applicable to altering how Numba would
+otherwise build by default along with information on configuration options.
+
+.. envvar:: NUMBA_DISABLE_OPENMP (default: not set)
+
+  To disable compilation of the OpenMP threading backend set this environment
+  variable to a non-empty string when building. If not set (default):
+
+  * For Linux and Windows it is necessary to provide OpenMP C headers and
+    runtime  libraries compatible with the compiler tool chain mentioned above,
+    and for these to be accessible to the compiler via standard flags.
+  * For OSX the conda package ``llvm-openmp`` provides suitable C headers and
+    libraries. If the compilation requirements are not met the OpenMP threading
+    backend will not be compiled.
+
+.. envvar:: NUMBA_DISABLE_TBB (default: not set)
+
+  To disable the compilation of the TBB threading backend set this environment
+  variable to a non-empty string when building. If not set (default) the TBB C
+  headers and libraries must be available at compile time. If building with
+  ``conda build`` this requirement can be met by installing the ``tbb-devel``
+  package. If not building with ``conda build`` the requirement can be met via a
+  system installation of TBB or through the use of the ``TBBROOT`` environment
+  variable to provide the location of the TBB installation. For more
+  information about setting ``TBBROOT`` see the `Intel documentation <https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/appendix/adding-parallelism-to-your-program/adding-the-parallel-framework-to-your-build-environment/defining-the-tbbroot-environment-variable.html>`_.
+
+.. _numba-source-install-check:
+
+Dependency List
+---------------
+
+Numba has numerous required and optional dependencies which additionally may
+vary with target operating system and hardware. The following lists them all
+(as of July 2020).
+
+* Required build time:
+
+  * ``setuptools``
+  * ``numpy``
+  * ``llvmlite``
+  * Compiler toolchain mentioned above
+
+* Required run time:
+
+  * ``numpy``
+  * ``llvmlite``
+
+* Optional build time:
+
+  See :ref:`numba-source-install-env_vars` for more details about additional
+  options for the configuration and specification of these optional components.
+
+  * ``llvm-openmp`` (OSX) - provides headers for compiling OpenMP support into
+    Numba's threading backend
+  * ``tbb-devel`` - provides TBB headers/libraries for compiling TBB support
+    into Numba's threading backend (version >= 2021.6 required).
+  * ``importlib_metadata`` (for Python versions < 3.9)
+
+* Optional runtime are:
+
+  * ``scipy`` - provides cython bindings used in Numba's ``np.linalg.*``
+    support
+  * ``tbb`` - provides the TBB runtime libraries used by Numba's TBB threading
+    backend (version >= 2021 required).
+  * ``jinja2`` - for "pretty" type annotation output (HTML) via the ``numba``
+    CLI
+  * ``cffi`` - permits use of CFFI bindings in Numba compiled functions
+  * ``llvm-openmp`` - (OSX) provides OpenMP library support for Numba's OpenMP
+    threading backend.
+  * ``intel-openmp`` - (OSX) provides an alternative OpenMP library for use with
+    Numba's OpenMP threading backend.
+  * ``ipython`` - if in use, caching will use IPython's cache
+    directories/caching still works
+  * ``pyyaml`` - permits the use of a ``.numba_config.yaml``
+    file for storing per project configuration options
+  * ``colorama`` - makes error message highlighting work
+  * ``intel-cmplr-lib-rt`` - allows Numba to use Intel SVML for extra
+    performance
+  * ``pygments`` - for "pretty" type annotation
+  * ``gdb`` as an executable on the ``$PATH`` - if you would like to use the gdb
+    support
+  * ``setuptools`` - permits the use of ``pycc`` for Ahead-of-Time (AOT)
+    compilation
+  * Compiler toolchain mentioned above, if you would like to use ``pycc`` for
+    Ahead-of-Time (AOT) compilation
+  * ``r2pipe`` - required for assembly CFG inspection.
+  * ``radare2`` as an executable on the ``$PATH`` - required for assembly CFG
+    inspection. `See here <https://github.com/radareorg/radare2>`_ for
+    information on obtaining and installing.
+  * ``graphviz`` - for some CFG inspection functionality.
+  * ``typeguard`` - used by ``runtests.py`` for
+    :ref:`runtime type-checking <type_anno_check>`.
+  * ``cuda-python`` - The NVIDIA CUDA Python bindings. See :ref:`cuda-bindings`.
+    Numba requires Version 11.6 or greater.
+  * ``cubinlinker`` and ``ptxcompiler`` to support
+    :ref:`minor-version-compatibility`.
+
+
+* To build the documentation:
+
+  * ``sphinx``
+  * ``pygments``
+  * ``sphinx_rtd_theme``
+  * ``numpydoc``
+  * ``make`` as an executable on the ``$PATH``
+
+.. _numba_support_info:
+
+Version support information
+---------------------------
+
+This is the canonical reference for information concerning which versions of
+Numba's dependencies were tested and known to work against a given version of
+Numba. Other versions of the dependencies (especially NumPy) may work reasonably
+well but were not tested. The use of ``x`` in a version number indicates all
+patch levels supported. The use of ``?`` as a version is due to missing
+information.
+
+----------++--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| Numba     | Release date | Python                    | NumPy                      | llvmlite                     | LLVM              | TBB                         |
+===========+==============+===========================+============================+==============================+===================+=============================+
+| 0.58.1    | 2023-10-17   | 3.8.x <= version < 3.12   | 1.22 <= version < 1.27     | 0.41.x                       | 14.x              | 2021.6 <= version           |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.58.0    | 2023-09-20   | 3.8.x <= version < 3.12   | 1.22 <= version < 1.26     | 0.41.x                       | 14.x              | 2021.6 <= version           |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.57.1    | 2023-06-21   | 3.8.x <= version < 3.12   | 1.21 <= version < 1.25     | 0.40.x                       | 14.x              | 2021.6 <= version           |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.57.0    | 2023-05-01   | 3.8.x <= version < 3.12   | 1.21 <= version < 1.25     | 0.40.x                       | 14.x              | 2021.6 <= version           |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.56.4    | 2022-11-03   | 3.7.x <= version < 3.11   | 1.18 <= version < 1.24     | 0.39.x                       | 11.x              | 2021.x                      |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.56.3    | 2022-10-13   | 3.7.x <= version < 3.11   | 1.18 <= version < 1.24     | 0.39.x                       | 11.x              | 2021.x                      |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.56.2    | 2022-09-01   | 3.7.x <= version < 3.11   | 1.18 <= version < 1.24     | 0.39.x                       | 11.x              | 2021.x                      |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.56.1    | NO RELEASE   |                           |                            |                              |                   |                             |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.56.0    | 2022-07-25   | 3.7.x <= version < 3.11   | 1.18 <= version < 1.23     | 0.39.x                       | 11.x              | 2021.x                      |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.55.2    | 2022-05-25   | 3.7.x <= version < 3.11   | 1.18 <= version < 1.23     | 0.38.x                       | 11.x              | 2021.x                      |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.55.{0,1}| 2022-01-13   | 3.7.x <= version < 3.11   | 1.18 <= version < 1.22     | 0.38.x                       | 11.x              | 2021.x                      |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.54.x    | 2021-08-19   | 3.6.x <= version < 3.10   | 1.17 <= version < 1.21     | 0.37.x                       | 11.x              | 2021.x                      |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.53.x    | 2021-03-11   | 3.6.x <= version < 3.10   | 1.15 <= version < 1.21     | 0.36.x                       | 11.x              | 2019.5 <= version < 2021.4  |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.52.x    | 2020-11-30   | 3.6.x <= version < 3.9    | 1.15 <= version < 1.20     | 0.35.x                       | 10.x              | 2019.5 <= version < 2020.3  |
+|           |              |                           |                            |                              | (9.x for aarch64) |                             |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.51.x    | 2020-08-12   | 3.6.x <= version < 3.9    | 1.15 <= version < 1.19     | 0.34.x                       | 10.x              | 2019.5 <= version < 2020.0  |
+|           |              |                           |                            |                              | (9.x for aarch64) |                             |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.50.x    | 2020-06-10   | 3.6.x <= version < 3.9    | 1.15 <= version < 1.19     | 0.33.x                       | 9.x               | 2019.5 <= version < 2020.0  |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.49.x    | 2020-04-16   | 3.6.x <= version < 3.9    | 1.15 <= version < 1.18     | 0.31.x <= version < 0.33.x   | 9.x               | 2019.5 <= version < 2020.0  |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.48.x    | 2020-01-27   | 3.6.x <= version < 3.9    | 1.15 <= version < 1.18     | 0.31.x                       | 8.x               | 2018.0.5 <= version < ?     |
+|           |              |                           |                            |                              | (7.x for ppc64le) |                             |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+| 0.47.x    | 2020-01-02   | 3.5.x <= version < 3.9;   | 1.15 <= version < 1.18     | 0.30.x                       | 8.x               | 2018.0.5 <= version < ?     |
+|           |              | version == 2.7.x          |                            |                              | (7.x for ppc64le) |                             |
+-----------+--------------+---------------------------+----------------------------+------------------------------+-------------------+-----------------------------+
+
+Checking your installation
+--------------------------
+
+You should be able to import Numba from the Python prompt::
+
+    $ python
+    Python 3.10.2 | packaged by conda-forge | (main, Jan 14 2022, 08:02:09) [GCC 9.4.0] on linux
+    Type "help", "copyright", "credits" or "license" for more information.
+    >>> import numba
+    >>> numba.__version__
+    '0.55.1'
+
+You can also try executing the ``numba --sysinfo`` (or ``numba -s`` for short)
+command to report information about your system capabilities. See :ref:`cli` for
+further information.
+
+::
+
+    $ numba -s
+    System info:
+    --------------------------------------------------------------------------------
+    __Time Stamp__
+    Report started (local time)                   : 2022-01-18 10:35:08.981319
+
+    __Hardware Information__
+    Machine                                       : x86_64
+    CPU Name                                      : skylake-avx512
+    CPU Count                                     : 12
+    CPU Features                                  :
+    64bit adx aes avx avx2 avx512bw avx512cd avx512dq avx512f avx512vl bmi bmi2
+    clflushopt clwb cmov cx16 cx8 f16c fma fsgsbase fxsr invpcid lzcnt mmx
+    movbe pclmul pku popcnt prfchw rdrnd rdseed rtm sahf sse sse2 sse3 sse4.1
+    sse4.2 ssse3 xsave xsavec xsaveopt xsaves
+
+    __OS Information__
+    Platform Name                                 : Linux-5.4.0-94-generic-x86_64-with-glibc2.31
+    Platform Release                              : 5.4.0-94-generic
+    OS Name                                       : Linux
+    OS Version                                    : #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022
+
+    __Python Information__
+    Python Compiler                               : GCC 9.4.0
+    Python Implementation                         : CPython
+    Python Version                                : 3.10.2
+    Python Locale                                 : en_GB.UTF-8
+
+    __LLVM information__
+    LLVM Version                                  : 11.1.0
+
+    __CUDA Information__
+    Found 1 CUDA devices
+    id 0      b'Quadro RTX 8000'                              [SUPPORTED]
+                          Compute Capability: 7.5
+                               PCI Device ID: 0
+                                  PCI Bus ID: 21
+                                        UUID: GPU-e6489c45-5b68-3b03-bab7-0e7c8e809643
+                                    Watchdog: Enabled
+                 FP32/FP64 Performance Ratio: 32
+
+(output truncated due to length)
--- a/docs/source/user/jit-module.rst
+++ b/docs/source/user/jit-module.rst
+.. _jit-module:
+
+============================================
+Automatic module jitting with ``jit_module``
+============================================
+
+A common usage pattern is to have an entire module containing user-defined
+functions that all need to be jitted. One option to accomplish this is to
+manually apply the ``@jit`` decorator to each function definition. This approach
+works and is great in many cases. However, for large modules with many functions,
+manually ``jit``-wrapping each function definition can be tedious. For these
+situations, Numba provides another option, the ``jit_module`` function, to
+automatically replace functions declared in a module with their ``jit``-wrapped
+equivalents.
+
+It's important to note the conditions under which ``jit_module`` will *not*
+impact a function:
+
+1. Functions which have already been wrapped with a Numba decorator (e.g.
+   ``jit``, ``vectorize``, ``cfunc``, etc.) are not impacted by ``jit_module``.
+
+2. Functions which are declared outside the module from which ``jit_module``
+   is called are not automatically ``jit``-wrapped.
+
+3. Function declarations which occur logically after calling ``jit_module``
+   are not impacted.
+
+All other functions in a module will have the ``@jit`` decorator automatically
+applied to them. See the following section for an example use case.
+
+.. note:: This feature is for use by module authors. ``jit_module`` should not
+    be called outside the context of a module containing functions to be jitted.
+
+
+Example usage
+=============
+
+Let's assume we have a Python module we've created, ``mymodule.py`` (shown
+below), which contains several functions. Some of these functions are defined
+in ``mymodule.py`` while others are imported from other modules. We wish to have
+all the functions which are defined in ``mymodule.py`` jitted using
+``jit_module``.
+
+.. _jit-module-usage:
+
+.. code-block:: python
+
+   # mymodule.py
+
+   from numba import jit, jit_module
+
+   def inc(x):
+      return x + 1
+   
+   def add(x, y):
+      return x + y
+   
+   import numpy as np
+   # Use NumPy's mean function
+   mean = np.mean
+   
+   @jit(nogil=True)
+   def mul(a, b):
+      return a * b
+   
+   jit_module(nopython=True, error_model="numpy")
+
+   def div(a, b):
+       return a / b
+
+There are several things to note in the above example:
+
+- Both the ``inc`` and ``add`` functions will be replaced with their
+  ``jit``-wrapped equivalents with :ref:`compilation options <jit-options>`
+  ``nopython=True`` and ``error_model="numpy"``.
+
+- The ``mean`` function, because it's defined *outside* of ``mymodule.py`` in
+  NumPy, will not be modified.
+
+- ``mul`` will not be modified because it has been manually decorated with
+  ``jit``.
+
+- ``div`` will not be automatically ``jit``-wrapped because it is declared
+  after ``jit_module`` is called.
+
+When the above module is imported, we have:
+
+.. code-block:: python
+
+   >>> import mymodule
+   >>> mymodule.inc
+   CPUDispatcher(<function inc at 0x1032f86a8>)
+   >>> mymodule.mean
+   <function mean at 0x1096b8950>
+
+
+API
+===
+.. warning:: This feature is experimental. The supported features may change
+    with or without notice.
+
+.. autofunction:: numba.jit_module
+
--- a/docs/source/user/jit.rst
+++ b/docs/source/user/jit.rst
+.. _jit:
+
+===================================
+Compiling Python code with ``@jit``
+===================================
+
+Numba provides several utilities for code generation, but its central
+feature is the :func:`numba.jit` decorator.  Using this decorator, you can mark
+a function for optimization by Numba's JIT compiler.  Various invocation
+modes trigger differing compilation options and behaviours.
+
+
+Basic usage
+===========
+
+.. _jit-lazy:
+
+Lazy compilation
+----------------
+
+The recommended way to use the ``@jit`` decorator is to let Numba decide
+when and how to optimize::
+
+   from numba import jit
+
+   @jit
+   def f(x, y):
+       # A somewhat trivial example
+       return x + y
+
+In this mode, compilation will be deferred until the first function
+execution.  Numba will infer the argument types at call time, and generate
+optimized code based on this information.  Numba will also be able to
+compile separate specializations depending on the input types.  For example,
+calling the ``f()`` function above with integer or complex numbers will
+generate different code paths::
+
+   >>> f(1, 2)
+   3
+   >>> f(1j, 2)
+   (2+1j)
+
+Eager compilation
+-----------------
+
+You can also tell Numba the function signature you are expecting.  The
+function ``f()`` would now look like::
+
+   from numba import jit, int32
+
+   @jit(int32(int32, int32))
+   def f(x, y):
+       # A somewhat trivial example
+       return x + y
+
+``int32(int32, int32)`` is the function's signature.  In this case, the
+corresponding specialization will be compiled by the ``@jit`` decorator,
+and no other specialization will be allowed. This is useful if you want
+fine-grained control over types chosen by the compiler (for example,
+to use single-precision floats).
+
+If you omit the return type, e.g. by writing ``(int32, int32)`` instead of
+``int32(int32, int32)``, Numba will try to infer it for you.  Function
+signatures can also be strings, and you can pass several of them as a list;
+see the :func:`numba.jit` documentation for more details.
+
+Of course, the compiled function gives the expected results::
+
+   >>> f(1,2)
+   3
+
+and if we specified ``int32`` as return type, the higher-order bits get
+discarded::
+
+   >>> f(2**31, 2**31 + 1)
+   1
+
+
+Calling and inlining other functions
+====================================
+
+Numba-compiled functions can call other compiled functions.  The function
+calls may even be inlined in the native code, depending on optimizer
+heuristics.  For example::
+
+   @jit
+   def square(x):
+       return x ** 2
+
+   @jit
+   def hypot(x, y):
+       return math.sqrt(square(x) + square(y))
+
+The ``@jit`` decorator *must* be added to any such library function,
+otherwise Numba may generate much slower code.
+
+
+Signature specifications
+========================
+
+Explicit ``@jit`` signatures can use a number of types.  Here are some
+common ones:
+
+* ``void`` is the return type of functions returning nothing (which
+  actually return :const:`None` when called from Python)
+* ``intp`` and ``uintp`` are pointer-sized integers (signed and unsigned,
+  respectively)
+* ``intc`` and ``uintc`` are equivalent to C ``int`` and ``unsigned int``
+  integer types
+* ``int8``, ``uint8``, ``int16``, ``uint16``, ``int32``, ``uint32``,
+  ``int64``, ``uint64`` are fixed-width integers of the corresponding bit
+  width (signed and unsigned)
+* ``float32`` and ``float64`` are single- and double-precision floating-point
+  numbers, respectively
+* ``complex64`` and ``complex128`` are single- and double-precision complex
+  numbers, respectively
+* array types can be specified by indexing any numeric type, e.g. ``float32[:]``
+  for a one-dimensional single-precision array or ``int8[:,:]`` for a
+  two-dimensional array of 8-bit integers.
+
+
+.. _jit-options:
+
+Compilation options
+===================
+
+A number of keyword-only arguments can be passed to the ``@jit`` decorator.
+
+.. _jit-nopython:
+
+``nopython``
+------------
+
+Numba has two compilation modes: :term:`nopython mode` and
+:term:`object mode`.  The former produces much faster code, but has
+limitations that can force Numba to fall back to the latter.  To prevent
+Numba from falling back, and instead raise an error, pass ``nopython=True``.
+
+::
+
+   @jit(nopython=True)
+   def f(x, y):
+       return x + y
+
+.. seealso:: :ref:`numba-troubleshooting`
+
+.. _jit-nogil:
+
+``nogil``
+---------
+
+Whenever Numba optimizes Python code to native code that only works on
+native types and variables (rather than Python objects), it is not necessary
+anymore to hold Python's :py:term:`global interpreter lock` (GIL).
+Numba will release the GIL when entering such a compiled function if you
+passed ``nogil=True``.
+
+::
+
+   @jit(nogil=True)
+   def f(x, y):
+       return x + y
+
+Code running with the GIL released runs concurrently with other
+threads executing Python or Numba code (either the same compiled function,
+or another one), allowing you to take advantage of multi-core systems.
+This will not be possible if the function is compiled in :term:`object mode`.
+
+When using ``nogil=True``, you'll have to be wary of the usual pitfalls
+of multi-threaded programming (consistency, synchronization, race conditions,
+etc.).
+
+.. _jit-cache:
+
+``cache``
+---------
+
+To avoid compilation times each time you invoke a Python program,
+you can instruct Numba to write the result of function compilation into
+a file-based cache.  This is done by passing ``cache=True``::
+
+   @jit(cache=True)
+   def f(x, y):
+       return x + y
+
+
+
+.. note::
+    Caching of compiled functions has several known limitations:
+
+    - The caching of compiled functions is not performed on a
+      function-by-function basis. The cached function is the the main jit
+      function, and all secondary functions (those called by the main
+      function) are incorporated in the cache of the main function.
+    - Cache invalidation fails to recognize changes in functions defined in a
+      different file. This means that when a main jit function calls
+      functions that were imported from a different module, a change in those
+      other modules will not be detected and the cache will not be updated.
+      This carries the risk that "old" function code might be used in the
+      calculations.
+    - Global variables are treated as constants. The cache will remember the value
+      of the global variable at compilation time. On cache load, the cached
+      function will not rebind to the new value of the global variable.
+
+.. _parallel_jit_option:
+
+``parallel``
+------------
+
+Enables automatic parallelization (and related optimizations) for those
+operations in the function known to have parallel semantics.  For a list of
+supported operations, see :ref:`numba-parallel`.  This feature is enabled by
+passing ``parallel=True`` and must be used in conjunction with
+``nopython=True``::
+
+   @jit(nopython=True, parallel=True)
+   def f(x, y):
+       return x + y
+
+.. seealso:: :ref:`numba-parallel`
--- a/docs/source/user/jitclass.rst
+++ b/docs/source/user/jitclass.rst
+.. _jitclass:
+
+===========================================
+Compiling Python classes with ``@jitclass``
+===========================================
+
+.. note::
+
+  This is a early version of jitclass support. Not all compiling features are
+  exposed or implemented, yet.
+
+
+Numba supports code generation for classes via the
+:func:`numba.experimental.jitclass` decorator.  A class can be marked for
+optimization using this decorator along with a specification of the types of
+each field.  We call the resulting class object a *jitclass*.  All methods of a
+jitclass are compiled into nopython functions.  The data of a jitclass instance
+is allocated on the heap as a C-compatible structure so that any compiled
+functions can have direct access to the underlying data, bypassing the
+interpreter.
+
+
+Basic usage
+===========
+
+Here's an example of a jitclass:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_jitclass.py
+   :language: python
+   :start-after: magictoken.ex_jitclass.begin
+   :end-before: magictoken.ex_jitclass.end
+   :dedent: 8
+
+In the above example, a ``spec`` is provided as a list of 2-tuples.  The tuples
+contain the name of the field and the Numba type of the field.  Alternatively,
+user can use a dictionary (an ``OrderedDict`` preferably for stable field
+ordering), which maps field names to types.
+
+The definition of the class requires at least a ``__init__`` method for
+initializing each defined fields.  Uninitialized fields contains garbage data.
+Methods and properties (getters and setters only) can be defined.  They will be
+automatically compiled.
+
+
+Inferred class member types from type annotations with ``as_numba_type``
+========================================================================
+
+Fields of a ``jitclass`` can also be inferred from Python type annotations.
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_jitclass.py
+   :language: python
+   :start-after: magictoken.ex_jitclass_type_hints.begin
+   :end-before: magictoken.ex_jitclass_type_hints.end
+   :dedent: 8
+
+Any type annotations on the class will be used to extend the spec if that field
+is not already present.  The Numba type corresponding to the given Python type
+is inferred using ``as_numba_type``.  For example, if we have the class
+
+.. code-block:: python
+
+    @jitclass([("w", int32), ("y", float64[:])])
+    class Foo:
+        w: int
+        x: float
+        y: np.ndarray
+        z: SomeOtherType
+
+        def __init__(self, w: int, x: float, y: np.ndarray, z: SomeOtherType):
+            ...
+
+then the full spec used for ``Foo`` will be:
+
+* ``"w": int32`` (specified in the ``spec``)
+* ``"x": float64`` (added from type annotation)
+* ``"y": array(float64, 1d, A)`` (specified in the ``spec``)
+* ``"z": numba.as_numba_type(SomeOtherType)`` (added from type annotation)
+
+Here ``SomeOtherType`` could be any supported Python type (e.g.
+``bool``, ``typing.Dict[int, typing.Tuple[float, float]]``, or another
+``jitclass``).
+
+Note that only type annotations on the class will be used to infer spec
+elements.  Method type annotations (e.g. those of ``__init__`` above) are
+ignored.
+
+Numba requires knowing the dtype and rank of NumPy arrays, which cannot
+currently be expressed with type annotations. Because of this, NumPy arrays need
+to be included in the ``spec`` explicitly.
+
+
+Specifying ``numba.typed`` containers as class members explicitly
+=================================================================
+
+The following patterns demonstrate how to specify a ``numba.typed.Dict`` or
+``numba.typed.List`` explicitly as part of the ``spec`` passed to ``jitclass``.
+
+First, using explicit Numba types and explicit construction.
+
+.. code-block:: python
+
+    from numba import types, typed
+    from numba.experimental import jitclass
+
+    # key and value types
+    kv_ty = (types.int64, types.unicode_type)
+
+    # A container class with:
+    # * member 'd' holding a typed dictionary of int64 -> unicode string (kv_ty)
+    # * member 'l' holding a typed list of float64
+    @jitclass([('d', types.DictType(*kv_ty)),
+               ('l', types.ListType(types.float64))])
+    class ContainerHolder(object):
+        def __init__(self):
+            # initialize the containers
+            self.d = typed.Dict.empty(*kv_ty)
+            self.l = typed.List.empty_list(types.float64)
+
+    container = ContainerHolder()
+    container.d[1] = "apple"
+    container.d[2] = "orange"
+    container.l.append(123.)
+    container.l.append(456.)
+    print(container.d) # {1: apple, 2: orange}
+    print(container.l) # [123.0, 456.0]
+
+Another useful pattern is to use the ``numba.typed`` container attribute
+``_numba_type_`` to find the type of a container, this can be accessed directly
+from an instance of the container in the Python interpreter. The same
+information can be obtained by calling :func:`numba.typeof` on the instance. For
+example:
+
+.. code-block:: python
+
+    from numba import typed, typeof
+    from numba.experimental import jitclass
+
+    d = typed.Dict()
+    d[1] = "apple"
+    d[2] = "orange"
+    l = typed.List()
+    l.append(123.)
+    l.append(456.)
+
+
+    @jitclass([('d', typeof(d)), ('l', typeof(l))])
+    class ContainerInstHolder(object):
+        def __init__(self, dict_inst, list_inst):
+            self.d = dict_inst
+            self.l = list_inst
+
+    container = ContainerInstHolder(d, l)
+    print(container.d) # {1: apple, 2: orange}
+    print(container.l) # [123.0, 456.0]
+
+It is worth noting that the instance of the container in a ``jitclass`` must be
+initialized before use, for example, this will cause an invalid memory access
+as ``self.d`` is written to without ``d`` being initialized as a ``type.Dict``
+instance of the type specified.
+
+.. code-block:: python
+
+    from numba import types
+    from numba.experimental import jitclass
+
+    dict_ty = types.DictType(types.int64, types.unicode_type)
+
+    @jitclass([('d', dict_ty)])
+    class NotInitialisingContainer(object):
+        def __init__(self):
+            self.d[10] = "apple" # this is invalid, `d` is not initialized
+
+    NotInitialisingContainer() # segmentation fault/memory access violation
+
+
+Support operations
+==================
+
+The following operations of jitclasses work in both the interpreter and Numba
+compiled functions:
+
+* calling the jitclass class object to construct a new instance
+  (e.g. ``mybag = Bag(123)``);
+* read/write access to attributes and properties (e.g. ``mybag.value``);
+* calling methods (e.g. ``mybag.increment(3)``);
+* calling static methods as instance attributes (e.g. ``mybag.add(1, 1)``);
+* calling static methods as class attributes (e.g. ``Bag.add(1, 2)``);
+* using select dunder methods (e.g. ``__add__`` with ``mybag + otherbag``);
+
+Using jitclasses in Numba compiled function is more efficient.
+Short methods can be inlined (at the discretion of LLVM inliner).
+Attributes access are simply reading from a C structure.
+Using jitclasses from the interpreter has the same overhead of calling any
+Numba compiled function from the interpreter.  Arguments and return values
+must be unboxed or boxed between Python objects and native representation.
+Values encapsulated by a jitclass does not get boxed into Python object when
+the jitclass instance is handed to the interpreter.  It is during attribute
+access to the field values that they are boxed.
+Calling static methods as class attributes is only supported outside of the
+class definition (i.e. code cannot call ``Bag.add()`` from within another method
+of ``Bag``).
+
+
+Supported dunder methods
+------------------------
+
+The following dunder methods may be defined for jitclasses:
+
+* ``__abs__``
+* ``__bool__``
+* ``__complex__``
+* ``__contains__``
+* ``__float__``
+* ``__getitem__``
+* ``__hash__``
+* ``__index__``
+* ``__int__``
+* ``__len__``
+* ``__setitem__``
+* ``__str__``
+* ``__eq__``
+* ``__ne__``
+* ``__ge__``
+* ``__gt__``
+* ``__le__``
+* ``__lt__``
+* ``__add__``
+* ``__floordiv__``
+* ``__lshift__``
+* ``__matmul__``
+* ``__mod__``
+* ``__mul__``
+* ``__neg__``
+* ``__pos__``
+* ``__pow__``
+* ``__rshift__``
+* ``__sub__``
+* ``__truediv__``
+* ``__and__``
+* ``__or__``
+* ``__xor__``
+* ``__iadd__``
+* ``__ifloordiv__``
+* ``__ilshift__``
+* ``__imatmul__``
+* ``__imod__``
+* ``__imul__``
+* ``__ipow__``
+* ``__irshift__``
+* ``__isub__``
+* ``__itruediv__``
+* ``__iand__``
+* ``__ior__``
+* ``__ixor__``
+* ``__radd__``
+* ``__rfloordiv__``
+* ``__rlshift__``
+* ``__rmatmul__``
+* ``__rmod__``
+* ``__rmul__``
+* ``__rpow__``
+* ``__rrshift__``
+* ``__rsub__``
+* ``__rtruediv__``
+* ``__rand__``
+* ``__ror__``
+* ``__rxor__``
+
+Refer to the `Python Data Model documentation
+<https://docs.python.org/3/reference/datamodel.html>`_ for descriptions of
+these methods.
+
+
+Limitations
+===========
+
+* A jitclass class object is treated as a function (the constructor) inside
+  a Numba compiled function.
+* ``isinstance()`` only works in the interpreter.
+* Manipulating jitclass instances in the interpreter is not optimized, yet.
+* Support for jitclasses are available on CPU only.
+  (Note: Support for GPU devices is planned for a future release.)
+
+
+The decorator: ``@jitclass``
+============================
+
+.. autofunction:: numba.experimental.jitclass
--- a/docs/source/user/overview.rst
+++ b/docs/source/user/overview.rst
+
+Overview
+========
+
+Numba is a compiler for Python array and numerical functions that gives 
+you the power to speed up your applications with high performance
+functions written directly in Python.
+
+Numba generates optimized machine code from pure Python code using
+the `LLVM compiler infrastructure <http://llvm.org/>`_.  With a few simple
+annotations, array-oriented and math-heavy Python code can be
+just-in-time optimized to performance similar as C, C++ and Fortran, without
+having to switch languages or Python interpreters.
+
+Numba's main features are:
+
+* :ref:`on-the-fly code generation <jit>` (at import time or runtime, at the
+  user's preference)
+* native code generation for the CPU (default) and
+  :doc:`GPU hardware <../cuda/index>`
+* integration with the Python scientific software stack (thanks to Numpy)
+
+Here is how a Numba-optimized function, taking a Numpy array as argument,
+might look like::
+
+   @numba.jit
+   def sum2d(arr):
+       M, N = arr.shape
+       result = 0.0
+       for i in range(M):
+           for j in range(N):
+               result += arr[i,j]
+       return result
+
--- a/docs/source/user/parallel.rst
+++ b/docs/source/user/parallel.rst
+.. Copyright (c) 2017 Intel Corporation
+   SPDX-License-Identifier: BSD-2-Clause
+
+.. _numba-parallel:
+
+=======================================
+Automatic parallelization with ``@jit``
+=======================================
+
+Setting the :ref:`parallel_jit_option` option for :func:`~numba.jit` enables
+a Numba transformation pass that attempts to automatically parallelize and
+perform other optimizations on (part of) a function. At the moment, this
+feature only works on CPUs.
+
+Some operations inside a user defined function, e.g. adding a scalar value to
+an array, are known to have parallel semantics.  A user program may contain
+many such operations and while each operation could be parallelized
+individually, such an approach often has lackluster performance due to poor
+cache behavior.  Instead, with auto-parallelization, Numba attempts to
+identify such operations in a user program, and fuse adjacent ones together,
+to form one or more kernels that are automatically run in parallel.
+The process is fully automated without modifications to the user program,
+which is in contrast to Numba's :func:`~numba.vectorize` or
+:func:`~numba.guvectorize` mechanism, where manual effort is required
+to create parallel kernels.
+
+.. _numba-parallel-supported:
+
+Supported Operations
+====================
+
+In this section, we give a list of all the array operations that have
+parallel semantics and for which we attempt to parallelize.
+
+#. All numba array operations that are supported by :ref:`case-study-array-expressions`,
+   which include common arithmetic functions between Numpy arrays, and between
+   arrays and scalars, as well as Numpy ufuncs. They are often called
+   `element-wise` or `point-wise` array operations:
+
+    * unary operators: ``+`` ``-`` ``~``
+    * binary operators: ``+`` ``-`` ``*`` ``/`` ``/?`` ``%`` ``|`` ``>>`` ``^`` ``<<`` ``&`` ``**`` ``//``
+    * comparison operators: ``==`` ``!=`` ``<`` ``<=`` ``>`` ``>=``
+    * :ref:`Numpy ufuncs <supported_ufuncs>` that are supported in :term:`nopython mode`.
+    * User defined :class:`~numba.DUFunc` through :func:`~numba.vectorize`.
+
+#. Numpy reduction functions ``sum``, ``prod``, ``min``, ``max``, ``argmin``,
+   and ``argmax``. Also, array math functions ``mean``, ``var``, and ``std``.
+
+#. Numpy array creation functions ``zeros``, ``ones``, ``arange``, ``linspace``,
+   and several random functions (rand, randn, ranf, random_sample, sample,
+   random, standard_normal, chisquare, weibull, power, geometric, exponential,
+   poisson, rayleigh, normal, uniform, beta, binomial, f, gamma, lognormal,
+   laplace, randint, triangular).
+
+#. Numpy ``dot`` function between a matrix and a vector, or two vectors.
+   In all other cases, Numba's default implementation is used.
+
+#. Multi-dimensional arrays are also supported for the above operations
+   when operands have matching dimension and size. The full semantics of
+   Numpy broadcast between arrays with mixed dimensionality or size is
+   not supported, nor is the reduction across a selected dimension.
+
+#. Array assignment in which the target is an array selection using a slice
+   or a boolean array, and the value being assigned is either a scalar or
+   another selection where the slice range or bitarray are inferred to be
+   compatible.
+
+#. The ``reduce`` operator of ``functools`` is supported for specifying parallel
+   reductions on 1D Numpy arrays but the initial value argument is mandatory.
+
+.. _numba-prange:
+
+Explicit Parallel Loops
+========================
+
+Another feature of the code transformation pass (when ``parallel=True``) is
+support for explicit parallel loops. One can use Numba's ``prange`` instead of
+``range`` to specify that a loop can be parallelized. The user is required to
+make sure that the loop does not have cross iteration dependencies except for
+supported reductions.
+
+A reduction is inferred automatically if a variable is updated by a supported binary
+function/operator using its previous value in the loop body.  The following
+functions/operators are supported: ``+=``, ``+``, ``-=``, ``-``, ``*=``,
+``*``, ``/=``, ``/``, ``max()``, ``min()``.
+The initial value of the reduction is inferred automatically for the
+supported operators (i.e., not the ``max`` and ``min`` functions).
+Note that the ``//=`` operator is not supported because
+in the general case the result depends on the order in which the divisors are
+applied.  However, if all divisors are integers then the programmer may be
+able to rewrite the ``//=`` reduction as a ``*=`` reduction followed by
+a single floor division after the parallel region where the divisor is the
+accumulated product.
+For the ``max`` and ``min`` functions, the reduction variable should hold the identity
+value right before entering the ``prange`` loop.  Reductions in this manner
+are supported for scalars and for arrays of arbitrary dimensions.
+
+The example below demonstrates a parallel loop with a
+reduction (``A`` is a one-dimensional Numpy array)::
+
+    from numba import njit, prange
+
+    @njit(parallel=True)
+    def prange_test(A):
+        s = 0
+        # Without "parallel=True" in the jit-decorator
+        # the prange statement is equivalent to range
+        for i in prange(A.shape[0]):
+            s += A[i]
+        return s
+
+The following example demonstrates a product reduction on a two-dimensional array::
+
+    from numba import njit, prange
+    import numpy as np
+
+    @njit(parallel=True)
+    def two_d_array_reduction_prod(n):
+        shp = (13, 17)
+        result1 = 2 * np.ones(shp, np.int_)
+        tmp = 2 * np.ones_like(result1)
+
+        for i in prange(n):
+            result1 *= tmp
+
+        return result1
+
+.. note:: When using Python's ``range`` to induce a loop, Numba types the
+          induction variable as a signed integer. This is also the case for
+          Numba's ``prange`` when ``parallel=False``. However, for
+          ``parallel=True``, if the range is identifiable as strictly positive,
+          the type of the induction variable  will be ``uint64``. The impact of
+          a ``uint64`` induction variable is often most noticable when
+          undertaking operations involving it and a signed integer. Under
+          Numba's type coercion rules, such a case will commonly result in the
+          operation producing a floating point result type.
+
+
+Care should be taken, however, when reducing into slices or elements of an array
+if the elements specified by the slice or index are written to simultaneously by
+multiple parallel threads. The compiler may not detect such cases and then a race condition
+would occur.
+
+The following example demonstrates such a case where a race condition in the execution of the
+parallel for-loop results in an incorrect return value::
+
+    from numba import njit, prange
+    import numpy as np
+
+    @njit(parallel=True)
+    def prange_wrong_result(x):
+        n = x.shape[0]
+        y = np.zeros(4)
+        for i in prange(n):
+            # accumulating into the same element of `y` from different
+            # parallel iterations of the loop results in a race condition
+            y[:] += x[i]
+
+        return y
+
+as does the following example where the accumulating element is explicitly specified::
+
+    from numba import njit, prange
+    import numpy as np
+
+    @njit(parallel=True)
+    def prange_wrong_result(x):
+        n = x.shape[0]
+        y = np.zeros(4)
+        for i in prange(n):
+            # accumulating into the same element of `y` from different
+            # parallel iterations of the loop results in a race condition
+            y[i % 4] += x[i]
+
+        return y
+
+whereas performing a whole array reduction is fine::
+
+   from numba import njit, prange
+   import numpy as np
+
+   @njit(parallel=True)
+   def prange_ok_result_whole_arr(x):
+       n = x.shape[0]
+       y = np.zeros(4)
+       for i in prange(n):
+           y += x[i]
+       return y
+
+as is creating a slice reference outside of the parallel reduction loop::
+
+   from numba import njit, prange
+   import numpy as np
+
+   @njit(parallel=True)
+   def prange_ok_result_outer_slice(x):
+       n = x.shape[0]
+       y = np.zeros(4)
+       z = y[:]
+       for i in prange(n):
+           z += x[i]
+       return y
+
+Examples
+========
+
+In this section, we give an example of how this feature helps
+parallelize Logistic Regression::
+
+    @numba.jit(nopython=True, parallel=True)
+    def logistic_regression(Y, X, w, iterations):
+        for i in range(iterations):
+            w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
+        return w
+
+We will not discuss details of the algorithm, but instead focus on how
+this program behaves with auto-parallelization:
+
+1. Input ``Y`` is a vector of size ``N``, ``X`` is an ``N x D`` matrix,
+   and ``w`` is a vector of size ``D``.
+
+2. The function body is an iterative loop that updates variable ``w``.
+   The loop body consists of a sequence of vector and matrix operations.
+
+3. The inner ``dot`` operation produces a vector of size ``N``, followed by a
+   sequence of arithmetic operations either between a scalar and vector of
+   size ``N``, or two vectors both of size ``N``.
+
+4. The outer ``dot`` produces a vector of size ``D``, followed by an inplace
+   array subtraction on variable ``w``.
+
+5. With auto-parallelization, all operations that produce array of size
+   ``N`` are fused together to become a single parallel kernel. This includes
+   the inner ``dot`` operation and all point-wise array operations following it.
+
+6. The outer ``dot`` operation produces a result array of different dimension,
+   and is not fused with the above kernel.
+
+Here, the only thing required to take advantage of parallel hardware is to set
+the :ref:`parallel_jit_option` option for :func:`~numba.jit`, with no
+modifications to the ``logistic_regression`` function itself.  If we were to
+give an equivalence parallel implementation using :func:`~numba.guvectorize`,
+it would require a pervasive change that rewrites the code to extract kernel
+computation that can be parallelized, which was both tedious and challenging.
+
+Unsupported Operations
+======================
+
+This section contains a non-exhaustive list of commonly encountered but
+currently unsupported features:
+
+#. **Mutating a list is not threadsafe**
+
+   Concurrent write operations on container types (i.e. lists, sets and
+   dictionaries) in a ``prange`` parallel region are not threadsafe e.g.::
+
+    @njit(parallel=True)
+    def invalid():
+        z = []
+        for i in prange(10000):
+            z.append(i)
+        return z
+
+   It is highly likely that the above will result in corruption or an access
+   violation as containers require thread-safety under mutation but this feature
+   is not implemented.
+
+#. **Induction variables are not associated with thread ID**
+
+   The use of the induction variable induced by a ``prange`` based loop in
+   conjunction with ``get_num_threads`` as a method of ensuring safe writes into
+   a pre-sized container is not valid e.g.::
+
+    @njit(parallel=True)
+    def invalid():
+        n = get_num_threads()
+        z = [0 for _ in range(n)]
+        for i in prange(100):
+            z[i % n] += i
+        return z
+
+   The above can on occasion appear to work, but it does so by luck. There's no
+   guarantee about which indexes are assigned to which executing threads or the
+   order in which the loop iterations execute.
+
+.. _numba-parallel-diagnostics:
+
+Diagnostics
+===========
+
+.. note:: At present not all parallel transforms and functions can be tracked
+          through the code generation process. Occasionally diagnostics about
+          some loops or transforms may be missing.
+
+The :ref:`parallel_jit_option` option for :func:`~numba.jit` can produce
+diagnostic information about the transforms undertaken in automatically
+parallelizing the decorated code. This information can be accessed in two ways,
+the first is by setting the environment variable
+:envvar:`NUMBA_PARALLEL_DIAGNOSTICS`, the second is by calling
+:meth:`~Dispatcher.parallel_diagnostics`, both methods give the same information
+and print to ``STDOUT``. The level of verbosity in the diagnostic information is
+controlled by an integer argument of value between 1 and 4 inclusive, 1 being
+the least verbose and 4 the most. For example::
+
+    @njit(parallel=True)
+    def test(x):
+        n = x.shape[0]
+        a = np.sin(x)
+        b = np.cos(a * a)
+        acc = 0
+        for i in prange(n - 2):
+            for j in prange(n - 1):
+                acc += b[i] + b[j + 1]
+        return acc
+
+    test(np.arange(10))
+
+    test.parallel_diagnostics(level=4)
+
+produces::
+
+    ================================================================================
+    ======= Parallel Accelerator Optimizing:  Function test, example.py (4)  =======
+    ================================================================================
+
+
+    Parallel loop listing for  Function test, example.py (4)
+    --------------------------------------|loop #ID
+    @njit(parallel=True)                  |
+    def test(x):                          |
+        n = x.shape[0]                    |
+        a = np.sin(x)---------------------| #0
+        b = np.cos(a * a)-----------------| #1
+        acc = 0                           |
+        for i in prange(n - 2):-----------| #3
+            for j in prange(n - 1):-------| #2
+                acc += b[i] + b[j + 1]    |
+        return acc                        |
+    --------------------------------- Fusing loops ---------------------------------
+    Attempting fusion of parallel loops (combines loops with similar properties)...
+    Trying to fuse loops #0 and #1:
+        - fusion succeeded: parallel for-loop #1 is fused into for-loop #0.
+    Trying to fuse loops #0 and #3:
+        - fusion failed: loop dimension mismatched in axis 0. slice(0, x_size0.1, 1)
+    != slice(0, $40.4, 1)
+    ----------------------------- Before Optimization ------------------------------
+    Parallel region 0:
+    +--0 (parallel)
+    +--1 (parallel)
+
+
+    Parallel region 1:
+    +--3 (parallel)
+    +--2 (parallel)
+
+
+    --------------------------------------------------------------------------------
+    ------------------------------ After Optimization ------------------------------
+    Parallel region 0:
+    +--0 (parallel, fused with loop(s): 1)
+
+
+    Parallel region 1:
+    +--3 (parallel)
+    +--2 (serial)
+
+
+
+    Parallel region 0 (loop #0) had 1 loop(s) fused.
+
+    Parallel region 1 (loop #3) had 0 loop(s) fused and 1 loop(s) serialized as part
+    of the larger parallel loop (#3).
+    --------------------------------------------------------------------------------
+    --------------------------------------------------------------------------------
+
+    ---------------------------Loop invariant code motion---------------------------
+
+    Instruction hoisting:
+    loop #0:
+    Failed to hoist the following:
+        dependency: $arg_out_var.10 = getitem(value=x, index=$parfor__index_5.99)
+        dependency: $0.6.11 = getattr(value=$0.5, attr=sin)
+        dependency: $expr_out_var.9 = call $0.6.11($arg_out_var.10, func=$0.6.11, args=[Var($arg_out_var.10, example.py (7))], kws=(), vararg=None)
+        dependency: $arg_out_var.17 = $expr_out_var.9 * $expr_out_var.9
+        dependency: $0.10.20 = getattr(value=$0.9, attr=cos)
+        dependency: $expr_out_var.16 = call $0.10.20($arg_out_var.17, func=$0.10.20, args=[Var($arg_out_var.17, example.py (8))], kws=(), vararg=None)
+    loop #3:
+    Has the following hoisted:
+        $const58.3 = const(int, 1)
+        $58.4 = _n_23 - $const58.3
+    --------------------------------------------------------------------------------
+
+
+
+To aid users unfamiliar with the transforms undertaken when the
+:ref:`parallel_jit_option` option is used, and to assist in the understanding of
+the subsequent sections, the following definitions are provided:
+
+* Loop fusion
+    `Loop fusion <https://en.wikipedia.org/wiki/Loop_fission_and_fusion>`_ is a
+    technique whereby loops with equivalent bounds may be combined under certain
+    conditions to produce a loop with a larger body (aiming to improve data
+    locality).
+
+* Loop serialization
+    Loop serialization occurs when any number of ``prange`` driven loops are
+    present inside another ``prange`` driven loop. In this case the outermost
+    of all the ``prange`` loops executes in parallel and any inner ``prange``
+    loops (nested or otherwise) are treated as standard ``range`` based loops.
+    Essentially, nested parallelism does not occur.
+
+* Loop invariant code motion
+    `Loop invariant code motion
+    <https://en.wikipedia.org/wiki/Loop-invariant_code_motion>`_ is an
+    optimization technique that analyses a loop to look for statements that can
+    be moved outside the loop body without changing the result of executing the
+    loop, these statements are then "hoisted" out of the loop to save repeated
+    computation.
+
+* Allocation hoisting
+    Allocation hoisting is a specialized case of loop invariant code motion that
+    is possible due to the design of some common NumPy allocation methods.
+    Explanation of this technique is best driven by an example:
+
+    .. code-block:: python
+
+        @njit(parallel=True)
+        def test(n):
+            for i in prange(n):
+                temp = np.zeros((50, 50)) # <--- Allocate a temporary array with np.zeros()
+                for j in range(50):
+                    temp[j, j] = i
+
+            # ...do something with temp
+
+    internally, this is transformed to approximately the following:
+
+    .. code-block:: python
+
+        @njit(parallel=True)
+        def test(n):
+            for i in prange(n):
+                temp = np.empty((50, 50)) # <--- np.zeros() is rewritten as np.empty()
+                temp[:] = 0               # <--- and then a zero initialisation
+                for j in range(50):
+                    temp[j, j] = i
+
+            # ...do something with temp
+
+    then after hoisting:
+
+    .. code-block:: python
+
+        @njit(parallel=True)
+        def test(n):
+            temp = np.empty((50, 50)) # <--- allocation is hoisted as a loop invariant as `np.empty` is considered pure
+            for i in prange(n):
+                temp[:] = 0           # <--- this remains as assignment is a side effect
+                for j in range(50):
+                    temp[j, j] = i
+
+            # ...do something with temp
+
+    it can be seen that the ``np.zeros`` allocation is split into an allocation
+    and an assignment, and then the allocation is hoisted out of the loop in
+    ``i``, this producing more efficient code as the allocation only occurs
+    once.
+
+The parallel diagnostics report sections
+----------------------------------------
+
+The report is split into the following sections:
+
+#. Code annotation
+    This is the first section and contains the source code of the decorated
+    function with loops that have parallel semantics identified and enumerated.
+    The ``loop #ID`` column on the right of the source code lines up with
+    identified parallel loops. From the example, ``#0`` is ``np.sin``, ``#1``
+    is ``np.cos`` and ``#2`` and ``#3`` are ``prange()``:
+
+    .. code-block:: python
+
+        Parallel loop listing for  Function test, example.py (4)
+        --------------------------------------|loop #ID
+        @njit(parallel=True)                  |
+        def test(x):                          |
+            n = x.shape[0]                    |
+            a = np.sin(x)---------------------| #0
+            b = np.cos(a * a)-----------------| #1
+            acc = 0                           |
+            for i in prange(n - 2):-----------| #3
+                for j in prange(n - 1):-------| #2
+                    acc += b[i] + b[j + 1]    |
+            return acc                        |
+
+    It is worth noting that the loop IDs are enumerated in the order they are
+    discovered which is not necessarily the same order as present in the source.
+    Further, it should also be noted that the parallel transforms use a static
+    counter for loop ID indexing. As a consequence it is possible for the loop
+    ID index to not start at 0 due to use of the same counter for internal
+    optimizations/transforms taking place that are invisible to the user.
+
+#. Fusing loops
+    This section describes the attempts made at fusing discovered
+    loops noting which succeeded and which failed. In the case of failure to
+    fuse a reason is given (e.g. dependency on other data). From the example:
+
+    .. code-block:: text
+
+        --------------------------------- Fusing loops ---------------------------------
+        Attempting fusion of parallel loops (combines loops with similar properties)...
+        Trying to fuse loops #0 and #1:
+            - fusion succeeded: parallel for-loop #1 is fused into for-loop #0.
+        Trying to fuse loops #0 and #3:
+            - fusion failed: loop dimension mismatched in axis 0. slice(0, x_size0.1, 1)
+        != slice(0, $40.4, 1)
+
+    It can be seen that fusion of loops ``#0`` and ``#1`` was attempted and this
+    succeeded (both are based on the same dimensions of ``x``). Following the
+    successful fusion of ``#0`` and ``#1``, fusion was attempted between ``#0``
+    (now including the fused ``#1`` loop) and ``#3``. This fusion failed because
+    there is a loop dimension mismatch, ``#0`` is size ``x.shape`` whereas
+    ``#3`` is size ``x.shape[0] - 2``.
+
+#. Before Optimization
+    This section shows the structure of the parallel regions in the code before
+    any optimization has taken place, but with loops associated with their final
+    parallel region (this is to make before/after optimization output directly
+    comparable). Multiple parallel regions may exist if there are loops which
+    cannot be fused, in this case code within each region will execute in
+    parallel, but each parallel region will run sequentially. From the example:
+
+    .. code-block:: text
+
+        Parallel region 0:
+        +--0 (parallel)
+        +--1 (parallel)
+
+
+        Parallel region 1:
+        +--3 (parallel)
+        +--2 (parallel)
+
+    As alluded to by the `Fusing loops` section, there are necessarily two
+    parallel regions in the code. The first contains loops ``#0`` and ``#1``,
+    the second contains ``#3`` and ``#2``, all loops are marked ``parallel`` as
+    no optimization has taken place yet.
+
+#. After Optimization
+    This section shows the structure of the parallel regions in the code after
+    optimization has taken place. Again, parallel regions are enumerated with
+    their corresponding loops but this time loops which are fused or serialized
+    are noted and a summary is presented. From the example:
+
+    .. code-block:: text
+
+        Parallel region 0:
+        +--0 (parallel, fused with loop(s): 1)
+
+
+        Parallel region 1:
+        +--3 (parallel)
+           +--2 (serial)
+
+        Parallel region 0 (loop #0) had 1 loop(s) fused.
+
+        Parallel region 1 (loop #3) had 0 loop(s) fused and 1 loop(s) serialized as part
+        of the larger parallel loop (#3).
+
+
+    It can be noted that parallel region 0 contains loop ``#0`` and, as seen in
+    the `fusing loops` section, loop ``#1`` is fused into loop ``#0``. It can
+    also be noted that parallel region 1 contains loop ``#3`` and that loop
+    ``#2`` (the inner ``prange()``) has been serialized for execution in the
+    body of loop ``#3``.
+
+#. Loop invariant code motion
+    This section shows for each loop, after optimization has occurred:
+
+    * the instructions that failed to be hoisted and the reason for failure
+      (dependency/impure).
+    * the instructions that were hoisted.
+    * any allocation hoisting that may have occurred.
+
+    From the example:
+
+    .. code-block:: text
+
+        Instruction hoisting:
+        loop #0:
+        Failed to hoist the following:
+            dependency: $arg_out_var.10 = getitem(value=x, index=$parfor__index_5.99)
+            dependency: $0.6.11 = getattr(value=$0.5, attr=sin)
+            dependency: $expr_out_var.9 = call $0.6.11($arg_out_var.10, func=$0.6.11, args=[Var($arg_out_var.10, example.py (7))], kws=(), vararg=None)
+            dependency: $arg_out_var.17 = $expr_out_var.9 * $expr_out_var.9
+            dependency: $0.10.20 = getattr(value=$0.9, attr=cos)
+            dependency: $expr_out_var.16 = call $0.10.20($arg_out_var.17, func=$0.10.20, args=[Var($arg_out_var.17, example.py (8))], kws=(), vararg=None)
+        loop #3:
+        Has the following hoisted:
+            $const58.3 = const(int, 1)
+            $58.4 = _n_23 - $const58.3
+
+    The first thing to note is that this information is for advanced users as it
+    refers to the :term:`Numba IR` of the function being transformed. As an
+    example, the expression ``a * a`` in the example source partly translates to
+    the expression ``$arg_out_var.17 = $expr_out_var.9 * $expr_out_var.9`` in
+    the IR, this clearly cannot be hoisted out of ``loop #0`` because it is not
+    loop invariant! Whereas in ``loop #3``, the expression
+    ``$const58.3 = const(int, 1)`` comes from the source ``b[j + 1]``, the
+    number ``1`` is clearly a constant and so can be hoisted out of the loop.
+
+.. _numba-parallel-scheduling:
+
+Scheduling
+==========
+
+By default, Numba divides the iterations of a parallel region into approximately equal
+sized chunks and gives one such chunk to each configured thread.
+(See :ref:`setting_the_number_of_threads`).
+This scheduling approach is equivalent to OpenMP's static schedule with no specified
+chunk size and is appropriate when the work required for each iteration is nearly constant.
+Conversely, if the work required per iteration, as shown in the ``prange`` loop below,
+varies significantly then this static
+scheduling approach can lead to load imbalances and longer execution times.
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_parallel_chunksize.py
+   :language: python
+   :caption: from ``test_unbalanced_example`` of ``numba/tests/doc_examples/test_parallel_chunksize.py``
+   :start-after: magictoken.ex_unbalanced.begin
+   :end-before: magictoken.ex_unbalanced.end
+   :dedent: 12
+   :linenos:
+
+In such cases,
+Numba provides a mechanism to control how many iterations of a parallel region
+(i.e., the chunk size) go into each chunk.
+Numba then computes the number of required chunks which is
+equal to the number of iterations divided by the chunk size, truncated to the nearest
+integer.  All of these chunks are then approximately, equally sized.
+Numba then gives one such chunk to each configured
+thread as above and when a thread finishes a chunk, Numba gives that thread the next
+available chunk.
+This scheduling approach is similar to OpenMP's dynamic scheduling
+option with the specified chunk size.
+(Note that Numba is only capable of supporting this dynamic scheduling
+of parallel regions if the underlying Numba threading backend,
+:ref:`numba-threading-layer`, is also capable of dynamic scheduling.
+At the moment, only the ``tbb`` backend is capable of dynamic
+scheduling and so is required if any performance
+benefit is to be achieved from this chunk size selection mechanism.)
+To minimize execution time, the programmer must
+pick a chunk size that strikes a balance between greater load balancing with smaller
+chunk sizes and less scheduling overhead with larger chunk sizes.
+See :ref:`chunk-details-label` for additional details on the internal implementation
+of chunk sizes.
+
+The number of iterations of a parallel region in a chunk is stored as a thread-local
+variable and can be set using
+:func:`numba.set_parallel_chunksize`.  This function takes one integer parameter
+whose value must be greater than
+or equal to 0.  A value of 0 is the default value and instructs Numba to use the
+static scheduling approach above.  Values greater than 0 instruct Numba to use that value
+as the chunk size in the dynamic scheduling approach described above.
+:func:`numba.set_parallel_chunksize` returns the previous value of the chunk size.
+The current value of this thread local variable is used as the chunk size for all
+subsequent parallel regions invoked by this thread.
+However, upon entering a parallel region, Numba sets the chunk size thread local variable
+for each of the threads executing that parallel region back to the default of 0,
+since it is unlikely
+that any nested parallel regions would require the same chunk size.  If the same thread is
+used to execute a sequential and parallel region then that thread's chunk size
+variable is set to 0 at the beginning of the parallel region and restored to
+its original value upon exiting the parallel region.
+This behavior is demonstrated in ``func1`` in the example below in that the
+reported chunk size inside the ``prange`` parallel region is 0 but is 4 outside
+the parallel region.  Note that if the ``prange`` is not executed in parallel for
+any reason (e.g., setting ``parallel=False``) then the chunk size reported inside
+the non-parallel prange would be reported as 4.
+This behavior may initially be counterintuitive to programmers as it differs from
+how thread local variables typically behave in other languages.
+A programmer may use
+the chunk size API described in this section within the threads executing a parallel
+region if the programmer wishes to specify a chunk size for any nested parallel regions
+that may be launched.
+The current value of the parallel chunk size can be obtained by calling
+:func:`numba.get_parallel_chunksize`.
+Both of these functions can be used from standard Python and from within Numba JIT compiled functions
+as shown below.  Both invocations of ``func1`` would be executed with a chunk size of 4 whereas
+``func2`` would use a chunk size of 8.
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_parallel_chunksize.py
+   :language: python
+   :caption: from ``test_chunksize_manual`` of ``numba/tests/doc_examples/test_parallel_chunksize.py``
+   :start-after: magictoken.ex_chunksize_manual.begin
+   :end-before: magictoken.ex_chunksize_manual.end
+   :dedent: 12
+   :linenos:
+
+Since this idiom of saving and restoring is so common, Numba provides the
+:func:`parallel_chunksize` with clause context-manager to simplify the idiom.
+As shown below, this with clause can be invoked from both standard Python and
+within Numba JIT compiled functions.  As with other Numba context-managers, be
+aware that the raising of exceptions is not supported from within a context managed
+block that is part of a Numba JIT compiled function.
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_parallel_chunksize.py
+   :language: python
+   :caption: from ``test_chunksize_with`` of ``numba/tests/doc_examples/test_parallel_chunksize.py``
+   :start-after: magictoken.ex_chunksize_with.begin
+   :end-before: magictoken.ex_chunksize_with.end
+   :dedent: 12
+   :linenos:
+
+Note that these functions to set the chunk size only have an effect on
+Numba automatic parallelization with the :ref:`parallel_jit_option` option.
+Chunk size specification has no effect on the :func:`~numba.vectorize` decorator
+or the :func:`~numba.guvectorize` decorator.
+
+.. seealso:: :ref:`parallel_jit_option`, :ref:`Parallel FAQs <parallel_FAQs>`
--- a/docs/source/user/performance-tips.rst
+++ b/docs/source/user/performance-tips.rst
+.. _performance-tips:
+
+Performance Tips
+================
+
+This is a short guide to features present in Numba that can help with obtaining
+the best performance from code. Two examples are used, both are entirely
+contrived and exist purely for pedagogical reasons to motivate discussion.
+The first is the computation of the trigonometric identity
+``cos(x)^2 + sin(x)^2``, the second is a simple element wise square root of a
+vector with reduction over summation. All performance numbers are indicative
+only and unless otherwise stated were taken from running on an Intel ``i7-4790``
+CPU (4 hardware threads) with an input of ``np.arange(1.e7)``.
+
+.. note::
+   A reasonably effective approach to achieving high performance code is to
+   profile the code running with real data and use that to guide performance
+   tuning. The information presented here is to demonstrate features, not to act
+   as canonical guidance!
+
+No Python mode vs Object mode
+-----------------------------
+
+A common pattern is to decorate functions with ``@jit`` as this is the most
+flexible decorator offered by Numba. ``@jit`` essentially encompasses two modes
+of compilation, first it will try and compile the decorated function in no
+Python mode, if this fails it will try again to compile the function using
+object mode. Whilst the use of looplifting in object mode can enable some
+performance increase, getting functions to compile under no python mode is
+really the key to good performance. To make it such that only no python mode is
+used and if compilation fails an exception is raised the decorators ``@njit``
+and ``@jit(nopython=True)`` can be used (the first is an alias of the
+second for convenience).
+
+Loops
+-----
+Whilst NumPy has developed a strong idiom around the use of vector operations,
+Numba is perfectly happy with loops too. For users familiar with C or Fortran,
+writing Python in this style will work fine in Numba (after all, LLVM gets a
+lot of use in compiling C lineage languages). For example::
+
+    @njit
+    def ident_np(x):
+        return np.cos(x) ** 2 + np.sin(x) ** 2
+
+    @njit
+    def ident_loops(x):
+        r = np.empty_like(x)
+        n = len(x)
+        for i in range(n):
+            r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2
+        return r
+
+The above run at almost identical speeds when decorated with ``@njit``, without
+the decorator the vectorized function is a couple of orders of magnitude faster.
+
+-----------------+-------+----------------+
+| Function Name   | @njit | Execution time |
+=================+=======+================+
+| ``ident_np``    | No    |     0.581s     |
+-----------------+-------+----------------+
+| ``ident_np``    | Yes   |     0.659s     |
+-----------------+-------+----------------+
+| ``ident_loops`` | No    |     25.2s      |
+-----------------+-------+----------------+
+| ``ident_loops`` | Yes   |     0.670s     |
+-----------------+-------+----------------+
+
+.. _fast-math:
+
+Fastmath
+--------
+In certain classes of applications strict IEEE 754 compliance is less
+important. As a result it is possible to relax some numerical rigour with
+view of gaining additional performance. The way to achieve this behaviour in
+Numba is through the use of the ``fastmath`` keyword argument::
+
+    @njit(fastmath=False)
+    def do_sum(A):
+        acc = 0.
+        # without fastmath, this loop must accumulate in strict order
+        for x in A:
+            acc += np.sqrt(x)
+        return acc
+
+    @njit(fastmath=True)
+    def do_sum_fast(A):
+        acc = 0.
+        # with fastmath, the reduction can be vectorized as floating point
+        # reassociation is permitted.
+        for x in A:
+            acc += np.sqrt(x)
+        return acc
+
+
+-----------------+-----------------+
+| Function Name   | Execution time  |
+=================+=================+
+| ``do_sum``      |      35.2 ms    |
+-----------------+-----------------+
+| ``do_sum_fast`` |      17.8 ms    |
+-----------------+-----------------+
+
+In some cases you may wish to opt-in to only a subset of possible fast-math
+optimizations. This can be done by supplying a set of `LLVM fast-math flags
+<https://llvm.org/docs/LangRef.html#fast-math-flags>`_ to ``fastmath``.::
+
+    def add_assoc(x, y):
+        return (x - y) + y
+
+    print(njit(fastmath=False)(add_assoc)(0, np.inf)) # nan
+    print(njit(fastmath=True) (add_assoc)(0, np.inf)) # 0.0
+    print(njit(fastmath={'reassoc', 'nsz'})(add_assoc)(0, np.inf)) # 0.0
+    print(njit(fastmath={'reassoc'})       (add_assoc)(0, np.inf)) # nan
+    print(njit(fastmath={'nsz'})           (add_assoc)(0, np.inf)) # nan
+
+
+Parallel=True
+-------------
+If code contains operations that are parallelisable (:ref:`and supported
+<numba-parallel-supported>`) Numba can compile a version that will run in
+parallel on multiple native threads (no GIL!). This parallelisation is performed
+automatically and is enabled by simply adding the ``parallel`` keyword
+argument::
+
+    @njit(parallel=True)
+    def ident_parallel(x):
+        return np.cos(x) ** 2 + np.sin(x) ** 2
+
+
+Executions times are as follows:
+
+--------------------+-----------------+
+| Function Name      | Execution time  |
+====================+=================+
+| ``ident_parallel`` | 112 ms          |
+--------------------+-----------------+
+
+
+The execution speed of this function with ``parallel=True`` present is
+approximately 5x that of the NumPy equivalent and 6x that of standard
+``@njit``.
+
+
+Numba parallel execution also has support for explicit parallel loop
+declaration similar to that in OpenMP. To indicate that a loop should be
+executed in parallel the ``numba.prange`` function should be used, this function
+behaves like Python ``range`` and if ``parallel=True`` is not set it acts
+simply as an alias of ``range``. Loops induced with ``prange`` can be used for
+embarrassingly parallel computation and also reductions.
+
+Revisiting the reduce over sum example, assuming it is safe for the sum to be
+accumulated out of order, the loop in ``n`` can be parallelised through the use
+of ``prange``. Further, the ``fastmath=True`` keyword argument can be added
+without concern in this case as the assumption that out of order execution is
+valid has already been made through the use of ``parallel=True`` (as each thread
+computes a partial sum).
+::
+
+    @njit(parallel=True)
+    def do_sum_parallel(A):
+        # each thread can accumulate its own partial sum, and then a cross
+        # thread reduction is performed to obtain the result to return
+        n = len(A)
+        acc = 0.
+        for i in prange(n):
+            acc += np.sqrt(A[i])
+        return acc
+
+    @njit(parallel=True, fastmath=True)
+    def do_sum_parallel_fast(A):
+        n = len(A)
+        acc = 0.
+        for i in prange(n):
+            acc += np.sqrt(A[i])
+        return acc
+
+
+Execution times are as follows, ``fastmath`` again improves performance.
+
+-------------------------+-----------------+
+| Function Name           | Execution time  |
+=========================+=================+
+| ``do_sum_parallel``     |      9.81 ms    |
+-------------------------+-----------------+
+| ``do_sum_parallel_fast``|      5.37 ms    |
+-------------------------+-----------------+
+
+.. _intel-svml:
+
+Intel SVML
+----------
+
+Intel provides a short vector math library (SVML) that contains a large number
+of optimised transcendental functions available for use as compiler
+intrinsics. If the ``intel-cmplr-lib-rt`` package is present in the
+environment (or the SVML libraries are simply locatable!) then Numba
+automatically configures the LLVM back end to use the SVML intrinsic functions
+where ever possible. SVML provides both high and low accuracy versions of each
+intrinsic and the version that is used is determined through the use of the
+``fastmath`` keyword. The default is to use high accuracy which is accurate to
+within ``1 ULP``, however if ``fastmath`` is set to ``True`` then the lower
+accuracy versions of the intrinsics are used (answers to within ``4 ULP``).
+
+
+First obtain SVML, using conda for example::
+
+    conda install intel-cmplr-lib-rt
+
+.. note::
+    The SVML library was previously provided through the ``icc_rt`` conda
+    package. The ``icc_rt`` package has since become a meta-package and as of
+    version ``2021.1.1`` it has ``intel-cmplr-lib-rt`` amongst other packages as
+    a dependency. Installing the recommended ``intel-cmplr-lib-rt`` package
+    directly results in fewer installed packages.
+
+Rerunning the identity function example ``ident_np`` from above with various
+combinations of options to ``@njit`` and with/without SVML yields the following
+performance results (input size ``np.arange(1.e8)``). For reference, with just
+NumPy the function executed in ``5.84s``:
+
+-----------------------------------+--------+-------------------+
+| ``@njit`` kwargs                  |  SVML  | Execution time    |
+===================================+========+===================+
+| ``None``                          | No     | 5.95s             |
+-----------------------------------+--------+-------------------+
+| ``None``                          | Yes    | 2.26s             |
+-----------------------------------+--------+-------------------+
+| ``fastmath=True``                 | No     | 5.97s             |
+-----------------------------------+--------+-------------------+
+| ``fastmath=True``                 | Yes    | 1.8s              |
+-----------------------------------+--------+-------------------+
+| ``parallel=True``                 | No     | 1.36s             |
+-----------------------------------+--------+-------------------+
+| ``parallel=True``                 | Yes    | 0.624s            |
+-----------------------------------+--------+-------------------+
+| ``parallel=True, fastmath=True``  | No     | 1.32s             |
+-----------------------------------+--------+-------------------+
+| ``parallel=True, fastmath=True``  | Yes    | 0.576s            |
+-----------------------------------+--------+-------------------+
+
+It is evident that SVML significantly increases the performance of this
+function. The impact of ``fastmath`` in the case of SVML not being present is
+zero, this is expected as there is nothing in the original function that would
+benefit from relaxing numerical strictness.
+
+Linear algebra
+--------------
+Numba supports most of ``numpy.linalg`` in no Python mode. The internal
+implementation relies on a LAPACK and BLAS library to do the numerical work
+and it obtains the bindings for the necessary functions from SciPy. Therefore,
+to achieve good performance in ``numpy.linalg`` functions with Numba it is
+necessary to use a SciPy built against a well optimised LAPACK/BLAS library.
+In the case of the Anaconda distribution SciPy is built against Intel's MKL
+which is highly optimised and as a result Numba makes use of this performance.
--- a/docs/source/user/pycc.rst
+++ b/docs/source/user/pycc.rst
+
+============================
+Compiling code ahead of time
+============================
+
+.. _pycc:
+
+While Numba's main use case is :term:`Just-in-Time compilation`, it also
+provides a facility for :term:`Ahead-of-Time compilation` (AOT).
+
+.. note:: To use this feature the ``setuptools`` package is required at
+          compilation time, but it is not a runtime dependency of the
+          extension module produced.
+
+.. note:: This module is pending deprecation. Please see
+          :ref:`deprecation-numba-pycc` for more information.
+
+
+Overview
+========
+
+Benefits
+--------
+
+#. AOT compilation produces a compiled extension module which does not depend
+   on Numba: you can distribute the module on machines which do not have
+   Numba installed (but NumPy is required).
+
+#. There is no compilation overhead at runtime (but see the
+   ``@jit`` :ref:`cache <jit-cache>` option), nor any overhead of importing
+   Numba.
+
+.. seealso::
+   Compiled extension modules are discussed in the
+   `Python packaging user guide <https://packaging.python.org/en/latest/guides/packaging-binary-extensions/>`_.
+
+
+Limitations
+-----------
+
+#. AOT compilation only allows for regular functions, not :term:`ufuncs <ufunc>`.
+
+#. You have to specify function signatures explicitly.
+
+#. Each exported function can have only one signature (but you can export
+   several different signatures under different names).
+
+#. Exported functions do not check the types of the arguments that are passed
+   to them; the caller is expected to provide arguments of the correct type.
+
+#. AOT compilation produces generic code for your CPU's architectural family
+   (for example "x86-64"), while JIT compilation produces code optimized
+   for your particular CPU model.
+
+
+Usage
+=====
+
+Standalone example
+------------------
+
+::
+
+   from numba.pycc import CC
+
+   cc = CC('my_module')
+   # Uncomment the following line to print out the compilation steps
+   #cc.verbose = True
+
+   @cc.export('multf', 'f8(f8, f8)')
+   @cc.export('multi', 'i4(i4, i4)')
+   def mult(a, b):
+       return a * b
+
+   @cc.export('square', 'f8(f8)')
+   def square(a):
+       return a ** 2
+
+   if __name__ == "__main__":
+       cc.compile()
+
+
+If you run this Python script, it will generate an extension module named
+``my_module``.  Depending on your platform, the actual filename may be
+``my_module.so``, ``my_module.pyd``, ``my_module.cpython-34m.so``, etc.
+
+The generated module has three functions: ``multf``, ``multi`` and ``square``.
+``multi`` operates on 32-bit integers (``i4``), while ``multf`` and ``square``
+operate on double-precision floats (``f8``)::
+
+   >>> import my_module
+   >>> my_module.multi(3, 4)
+   12
+   >>> my_module.square(1.414)
+   1.9993959999999997
+
+
+Distutils integration
+---------------------
+
+You can also integrate the compilation step for your extension modules
+in your ``setup.py`` script, using distutils or setuptools::
+
+   from distutils.core import setup
+
+   from source_module import cc
+
+   setup(...,
+         ext_modules=[cc.distutils_extension()])
+
+
+The ``source_module`` above is the module defining the ``cc`` object.
+Extensions compiled like this will be automatically included in the
+build files for your Python project, so you can distribute them inside
+binary packages such as wheels or Conda packages. Note that in the case of
+using conda, the compilers used for AOT need to be those that are available
+in the Anaconda distribution.
+
+
+Signature syntax
+----------------
+
+The syntax for exported signatures is the same as in the ``@jit``
+decorator.  You can read more about it in the :ref:`types <numba-types>`
+reference.
+
+Here is an example of exporting an implementation of the second-order
+centered difference on a 1d array::
+
+   @cc.export('centdiff_1d', 'f8[:](f8[:], f8)')
+   def centdiff_1d(u, dx):
+       D = np.empty_like(u)
+       D[0] = 0
+       D[-1] = 0
+       for i in range(1, len(D) - 1):
+           D[i] = (u[i+1] - 2 * u[i] + u[i-1]) / dx**2
+       return D
+
+.. (example from http://nbviewer.ipython.org/gist/ketch/ae87a94f4ef0793d5d52)
+
+You can also omit the return type, which will then be inferred by Numba::
+
+   @cc.export('centdiff_1d', '(f8[:], f8)')
+   def centdiff_1d(u, dx):
+       # Same code as above
+       ...
+
--- a/docs/source/user/stencil.rst
+++ b/docs/source/user/stencil.rst
+.. Copyright (c) 2017 Intel Corporation
+   SPDX-License-Identifier: BSD-2-Clause
+
+.. _numba-stencil:
+
+================================
+Using the ``@stencil`` decorator
+================================
+
+Stencils are a common computational pattern in which array elements
+are updated according to some fixed pattern called the stencil kernel.
+Numba provides the ``@stencil`` decorator so that users may
+easily specify a stencil kernel and Numba then generates the looping
+code necessary to apply that kernel to some input array.  Thus, the
+stencil decorator allows clearer, more concise code and in conjunction
+with :ref:`the parallel jit option <parallel_jit_option>` enables higher
+performance through parallelization of the stencil execution.
+
+
+Basic usage
+===========
+
+An example use of the ``@stencil`` decorator::
+
+   from numba import stencil
+
+   @stencil
+   def kernel1(a):
+       return 0.25 * (a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0])
+
+The stencil kernel is specified by what looks like a standard Python
+function definition but there are different semantics with
+respect to array indexing.
+Stencils produce an output array of the same size and shape as the
+input array although depending on the kernel definition may have a
+different type.
+Conceptually, the stencil kernel is run once for each element in the
+output array.  The return value from the stencil kernel is the value
+written into the output array for that particular element.
+
+The parameter ``a`` represents the input array over which the
+kernel is applied.
+Indexing into this array takes place with respect to the current element
+of the output array being processed.  For example, if element ``(x, y)``
+is being processed then ``a[0, 0]`` in the stencil kernel corresponds to
+``a[x + 0, y + 0]`` in the input array.  Similarly, ``a[-1, 1]`` in the stencil
+kernel corresponds to ``a[x - 1, y + 1]`` in the input array.
+
+Depending on the specified kernel, the kernel may not be applicable to the
+borders of the output array as this may cause the input array to be
+accessed out-of-bounds.  The way in which the stencil decorator handles
+this situation is dependent upon which :ref:`stencil-mode` is selected.
+The default mode is for the stencil decorator to set the border elements
+of the output array to zero.
+
+To invoke a stencil on an input array, call the stencil as if it were
+a regular function and pass the input array as the argument. For example, using
+the kernel defined above::
+
+   >>> import numpy as np
+   >>> input_arr = np.arange(100).reshape((10, 10))
+   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
+          [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
+          [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
+          [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
+          [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
+          [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
+          [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
+          [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
+          [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
+          [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])
+   >>> output_arr = kernel1(input_arr)
+   array([[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
+          [  0.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,   0.],
+          [  0.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,   0.],
+          [  0.,  31.,  32.,  33.,  34.,  35.,  36.,  37.,  38.,   0.],
+          [  0.,  41.,  42.,  43.,  44.,  45.,  46.,  47.,  48.,   0.],
+          [  0.,  51.,  52.,  53.,  54.,  55.,  56.,  57.,  58.,   0.],
+          [  0.,  61.,  62.,  63.,  64.,  65.,  66.,  67.,  68.,   0.],
+          [  0.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,  78.,   0.],
+          [  0.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,   0.],
+          [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])
+   >>> input_arr.dtype
+   dtype('int64')
+   >>> output_arr.dtype
+   dtype('float64')
+
+Note that the stencil decorator has determined that the output type
+of the specified stencil kernel is ``float64`` and has thus created the
+output array as ``float64`` while the input array is of type ``int64``.
+
+Stencil Parameters
+==================
+
+Stencil kernel definitions may take any number of arguments with
+the following provisions.  The first argument must be an array.
+The size and shape of the output array will be the same as that of the
+first argument.  Additional arguments may either be scalars or
+arrays.  For array arguments, those arrays must be at least as large
+as the first argument (array) in each dimension.  Array indexing is relative for
+all such input array arguments.
+
+.. _stencil-kernel-shape-inference:
+
+Kernel shape inference and border handling
+==========================================
+
+In the above example and in most cases, the array indexing in the
+stencil kernel will exclusively use ``Integer`` literals.
+In such cases, the stencil decorator is able to analyze the stencil
+kernel to determine its size.  In the above example, the stencil
+decorator determines that the kernel is ``3 x 3`` in shape since indices
+``-1`` to ``1`` are used for both the first and second dimensions.  Note that
+the stencil decorator also correctly handles non-symmetric and
+non-square stencil kernels.
+
+Based on the size of the stencil kernel, the stencil decorator is
+able to compute the size of the border in the output array.  If
+applying the kernel to some element of input array would cause
+an index to be out-of-bounds then that element belongs to the border
+of the output array.  In the above example, points ``-1`` and ``+1`` are
+accessed in each dimension and thus the output array has a border
+of size one in all dimensions.
+
+The parallel mode is able to infer kernel indices as constants from
+simple expressions if possible. For example::
+
+    @njit(parallel=True)
+    def stencil_test(A):
+        c = 2
+        B = stencil(
+            lambda a, c: 0.3 * (a[-c+1] + a[0] + a[c-1]))(A, c)
+        return B
+
+
+Stencil decorator options
+=========================
+
+.. note::
+   The stencil decorator may be augmented in the future to provide additional
+   mechanisms for border handling. At present, only one behaviour is
+   implemented, ``"constant"`` (see ``func_or_mode`` below for details).
+
+.. _stencil-neighborhood:
+
+``neighborhood``
+----------------
+
+Sometimes it may be inconvenient to write the stencil kernel
+exclusively with ``Integer`` literals.  For example, let us say we
+would like to compute the trailing 30-day moving average of a
+time series of data.  One could write
+``(a[-29] + a[-28] + ... + a[-1] + a[0]) / 30`` but the stencil
+decorator offers a more concise form using the ``neighborhood``
+option::
+
+   @stencil(neighborhood = ((-29, 0),))
+   def kernel2(a):
+       cumul = 0
+       for i in range(-29, 1):
+           cumul += a[i]
+       return cumul / 30
+
+The neighborhood option is a tuple of tuples.  The outer tuple's
+length is equal to the number of dimensions of the input array.
+The inner tuple's lengths are always two because
+each element of the inner tuple corresponds to minimum and
+maximum index offsets used in the corresponding dimension.
+
+If a user specifies a neighborhood but the kernel accesses elements outside the
+specified neighborhood, **the behavior is undefined.**
+
+.. _stencil-mode:
+
+``func_or_mode``
+----------------
+
+The optional ``func_or_mode`` parameter controls how the border of the output array
+is handled.  Currently, there is only one supported value, ``"constant"``.
+In ``constant`` mode, the stencil kernel is not applied in cases where
+the kernel would access elements outside the valid range of the input
+array.  In such cases, those elements in the output array are assigned
+to a constant value, as specified by the ``cval`` parameter.
+
+``cval``
+--------
+
+The optional cval parameter defaults to zero but can be set to any
+desired value, which is then used for the border of the output array
+if the ``func_or_mode`` parameter is set to ``constant``.  The cval parameter is
+ignored in all other modes.  The type of the cval parameter must match
+the return type of the stencil kernel.  If the user wishes the output
+array to be constructed from a particular type then they should ensure
+that the stencil kernel returns that type.
+
+``standard_indexing``
+---------------------
+
+By default, all array accesses in a stencil kernel are processed as
+relative indices as described above.  However, sometimes it may be
+advantageous to pass an auxiliary array (e.g. an array of weights)
+to a stencil kernel and have that array use standard Python indexing
+rather than relative indexing.  For this purpose, there is the
+stencil decorator option ``standard_indexing`` whose value is a
+collection of strings whose names match those parameters to the
+stencil function that are to be accessed with standard Python indexing
+rather than relative indexing::
+
+    @stencil(standard_indexing=("b",))
+    def kernel3(a, b):
+        return a[-1] * b[0] + a[0] + b[1]
+
+``StencilFunc``
+===============
+
+The stencil decorator returns a callable object of type ``StencilFunc``. A
+``StencilFunc`` object contains a number of attributes but the only one of
+potential interest to users is the ``neighborhood`` attribute.
+If the ``neighborhood`` option was passed to the stencil decorator then
+the provided neighborhood is stored in this attribute.  Else, upon
+first execution or compilation, the system calculates the neighborhood
+as described above and then stores the computed neighborhood into this
+attribute.  A user may then inspect the attribute if they wish to verify
+that the calculated neighborhood is correct.
+
+Stencil invocation options
+==========================
+
+Internally, the stencil decorator transforms the specified stencil
+kernel into a regular Python function.  This function will have the
+same parameters as specified in the stencil kernel definition but will
+also include the following optional parameter.
+
+.. _stencil-function-out:
+
+``out``
+-------
+
+The optional ``out`` parameter is added to every stencil function
+generated by Numba.  If specified, the ``out`` parameter tells
+Numba that the user is providing their own pre-allocated array
+to be used for the output of the stencil.  In this case, the
+stencil function will not allocate its own output array.
+Users should assure that the return type of the stencil kernel can
+be safely cast to the element-type of the user-specified output array
+following the `NumPy ufunc casting rules`_.
+
+.. _`NumPy ufunc casting rules`: http://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules
+
+An example usage is shown below::
+
+   >>> import numpy as np
+   >>> input_arr = np.arange(100).reshape((10, 10))
+   >>> output_arr = np.full(input_arr.shape, 0.0)
+   >>> kernel1(input_arr, out=output_arr)
--- a/docs/source/user/talks.rst
+++ b/docs/source/user/talks.rst
+
+Talks and Tutorials
+===================
+
+.. note:: This is a selection of talks and tutorials that have been given by members of
+    the Numba team as well as Numba users.  If you know of a Numba-related talk
+    that should be included on this list, please `open an issue <https://github.com/numba/numba/issues>`_.
+
+Talks on Numba
+--------------
+
+* AnacondaCON 2018 - Accelerating Scientific Workloads with Numba - Siu Kwan Lam (`Video <https://www.youtube.com/watch?v=6oXedk2tGfk>`__)
+* `DIANA-HEP Meeting, 23 April 2018 <https://indico.cern.ch/event/709711/>`__ - Overview of Numba - Stan Seibert
+
+Talks on Applications of Numba
+------------------------------
+
+* GPU Technology Conference 2016 - Accelerating a Spectral Algorithm for Plasma Physics with Python/Numba on GPU - Manuel Kirchen & Rémi Lehe (`Slides <http://on-demand.gputechconf.com/gtc/2016/presentation/s6353-manuel-kirchen-spectral-algorithm-plasma-physics.pdf>`__)
+* `DIANA-HEP Meeting, 23 April 2018 <https://indico.cern.ch/event/709711/>`_ - Use of Numba in XENONnT - Chris Tunnell
+* `DIANA-HEP Meeting, 23 April 2018 <https://indico.cern.ch/event/709711/>`_ - Extending Numba for HEP data types - Jim Pivarski
+* STAC Summit, Nov 1 2017 - Scaling High-Performance Python with Minimal Effort - Ehsan Totoni (`Video <https://stacresearch.com/STAC-Summit-1-Nov-2017-Intel-Totoni>`__, `Slides <https://stacresearch.com/system/files/resource/files/STAC-Summit-1-Nov-2017-Intel-Totoni.pdf>`__)
+* SciPy 2018 - UMAP: Uniform Manifold Approximation and Projection for Dimensional Reduction - Leland McInnes (`Video <https://www.youtube.com/watch?v=nq6iPZVUxZU>`__, `Github <https://github.com/lmcinnes/umap>`__)
+* PyData Berlin 2018 - Extending Pandas using Apache Arrow and Numba - Uwe L. Korn (`Video <https://www.youtube.com/watch?v=tvmX8YAFK80>`__, `Blog <https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html>`__)
+* FOSDEM 2019 - Extending Numba - Joris Geessels  (`Video, Slides & Examples <https://fosdem.org/2019/schedule/event/python_extending_numba/>`__)
+* PyCon India 2019 - Real World Numba: Taking the Path of Least Resistance - Ankit Mahato (`Video <https://www.youtube.com/watch?v=rhbegsr8stc>`__)
+* SciPy 2019 - How to Accelerate an Existing Codebase with Numba - Siu Kwan Lam & Stanley Seibert (`Video <https://www.youtube.com/watch?v=-4tD8kNHdXs>`__)
+* SciPy 2019 - Real World Numba: Creating a Skeleton Analysis Library - Juan Nunez-Iglesias (`Video <https://www.youtube.com/watch?v=0pUPNMglnaE>`__)
+* SciPy 2019 - Fast Gradient Boosting Decision Trees with PyGBM and Numba - Nicholas Hug (`Video <https://www.youtube.com/watch?v=cLpIh8Aiy2w>`__)
+* PyCon Sweden 2020 - Accelerating Scientific Computing using Numba - Ankit Mahato (`Video <https://www.youtube.com/watch?v=d_21Q0UoWrQ>`__)
+
+Tutorials
+---------
+
+* SciPy 2017 - Numba: Tell those C++ Bullies to Get Lost - Gil Forsyth & Lorena Barba (`Video <https://www.youtube.com/watch?v=1AwG0T4gaO0>`__, `Notebooks <https://github.com/gforsyth/numba_tutorial_scipy2017>`__)
+* GPU Technology Conference 2018 - GPU Computing in Python with Numba - Stan Seibert (`Notebooks <https://github.com/ContinuumIO/gtc2018-numba>`__)
+* PyData Amsterdam 2019 - Create CUDA kernels from Python using Numba and CuPy - Valentin Haenel (`Video <https://www.youtube.com/watch?v=CQDsT81GyS8>`__)
--- a/docs/source/user/threading-layer.rst
+++ b/docs/source/user/threading-layer.rst
+.. _numba-threading-layer:
+
+The Threading Layers
+====================
+
+This section is about the Numba threading layer, this is the library that is
+used internally to perform the parallel execution that occurs through the use of
+the ``parallel`` targets for CPUs, namely:
+
+* The use of the ``parallel=True`` kwarg in ``@jit`` and ``@njit``.
+* The use of the ``target='parallel'`` kwarg in ``@vectorize`` and
+  ``@guvectorize``.
+
+.. note::
+    If a code base does not use the ``threading`` or ``multiprocessing``
+    modules (or any other sort of parallelism) the defaults for the threading
+    layer that ship with Numba will work well, no further action is required!
+
+
+Which threading layers are available?
+-------------------------------------
+There are three threading layers available and they are named as follows:
+
+* ``tbb`` - A threading layer backed by Intel TBB.
+* ``omp`` - A threading layer backed by OpenMP.
+* ``workqueue`` -A simple built-in work-sharing task scheduler.
+
+In practice, the only threading layer guaranteed to be present is ``workqueue``.
+The ``omp`` layer requires the presence of a suitable OpenMP runtime library.
+The ``tbb`` layer requires the presence of Intel's TBB libraries, these can be
+obtained via the conda command::
+
+    $ conda install tbb
+
+If you installed Numba with ``pip``, TBB can be enabled by running::
+
+    $ pip install tbb
+
+Due to compatibility issues with manylinux1 and other portability concerns,
+the OpenMP threading layer is disabled in the Numba binary wheels on PyPI.
+
+.. note::
+    The default manner in which Numba searches for and loads a threading layer
+    is tolerant of missing libraries, incompatible runtimes etc.
+
+
+.. _numba-threading-layer-setting-mech:
+
+Setting the threading layer
+---------------------------
+
+
+The threading layer is set via the environment variable
+``NUMBA_THREADING_LAYER`` or through assignment to
+``numba.config.THREADING_LAYER``. If the programmatic approach to setting the
+threading layer is used it must occur logically before any Numba based
+compilation for a parallel target has occurred. There are two approaches to
+choosing a threading layer, the first is by selecting a threading layer that is
+safe under various forms of parallel execution, the second is through explicit
+selection via the threading layer name (e.g. ``tbb``).
+
+Setting the threading layer selection priority
+----------------------------------------------
+
+By default the threading layers are searched in the order of ``'tbb'``,
+``'omp'``, then ``'workqueue'``. To change this search order whilst
+maintaining the selection of a threading layer based on availability, the
+environment variable :envvar:`NUMBA_THREADING_LAYER_PRIORITY` can be used.
+
+Note that it can also be set via
+:py:data:`numba.config.THREADING_LAYER_PRIORITY`.
+Similar to :py:data:`numba.config.THREADING_LAYER`,
+it must occur logically before any Numba based
+compilation for a parallel target has occurred.
+
+For example, to instruct Numba to choose ``omp`` first if available,
+then ``tbb`` and so on, set the environment variable as
+``NUMBA_THREADING_LAYER_PRIORITY="omp tbb workqueue"``.
+Or programmatically,
+``numba.config.THREADING_LAYER_PRIORITY = ["omp", "tbb", "workqueue"]``.
+
+Selecting a threading layer for safe parallel execution
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Parallel execution is fundamentally derived from core Python libraries in four
+forms (the first three also apply to code using parallel execution via other
+means!):
+
+* ``threads`` from the ``threading`` module.
+* ``spawn`` ing processes from the ``multiprocessing`` module via ``spawn``
+  (default on Windows, only available in Python 3.4+ on Unix)
+* ``fork`` ing processes from the ``multiprocessing`` module via ``fork``
+  (default on Unix).
+* ``fork`` ing processes from the ``multiprocessing`` module through the use of
+  a ``forkserver`` (only available in Python 3 on Unix). Essentially a new
+  process is spawned and then forks are made from this new process on request.
+
+Any library in use with these forms of parallelism must exhibit safe behaviour
+under the given paradigm. As a result, the threading layer selection methods
+are designed to provide a way to choose a threading layer library that is safe
+for a given paradigm in an easy, cross platform and environment tolerant manner.
+The options that can be supplied to the
+:ref:`setting mechanisms <numba-threading-layer-setting-mech>` are as
+follows:
+
+* ``default`` provides no specific safety guarantee and is the default.
+* ``safe`` is both fork and thread safe, this requires the ``tbb`` package
+  (Intel TBB libraries) to be installed.
+* ``forksafe`` provides a fork safe library.
+* ``threadsafe`` provides a thread safe library.
+
+To discover the threading layer that was selected, the function
+``numba.threading_layer()`` may be called after parallel execution. For example,
+on a Linux machine with no TBB installed::
+
+    from numba import config, njit, threading_layer
+    import numpy as np
+
+    # set the threading layer before any parallel target compilation
+    config.THREADING_LAYER = 'threadsafe'
+
+    @njit(parallel=True)
+    def foo(a, b):
+        return a + b
+
+    x = np.arange(10.)
+    y = x.copy()
+
+    # this will force the compilation of the function, select a threading layer
+    # and then execute in parallel
+    foo(x, y)
+
+    # demonstrate the threading layer chosen
+    print("Threading layer chosen: %s" % threading_layer())
+
+which produces::
+
+    Threading layer chosen: omp
+
+and this makes sense as GNU OpenMP, as present on Linux, is thread safe.
+
+Selecting a named threading layer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Advanced users may wish to select a specific threading layer for their use case,
+this is done by directly supplying the threading layer name to the
+:ref:`setting mechanisms <numba-threading-layer-setting-mech>`. The options
+and requirements are as follows:
+
+----------------------+-----------+-------------------------------------------+
+| Threading Layer Name | Platform  | Requirements                              |
+======================+===========+===========================================+
+| ``tbb``              | All       | The ``tbb`` package (``$ conda install    |
+|                      |           | tbb``)                                    |
+----------------------+-----------+-------------------------------------------+
+| ``omp``              | Linux     | GNU OpenMP libraries (very likely this    |
+|                      |           | will already exist)                       |
+|                      |           |                                           |
+|                      | Windows   | MS OpenMP libraries (very likely this will|
+|                      |           | already exist)                            |
+|                      |           |                                           |
+|                      | OSX       | Either the ``intel-openmp`` package or the|
+|                      |           | ``llvm-openmp`` package                   |
+|                      |           | (``conda install`` the package as named). |
+----------------------+-----------+-------------------------------------------+
+| ``workqueue``        | All       | None                                      |
+----------------------+-----------+-------------------------------------------+
+
+Should the threading layer not load correctly Numba will detect this and provide
+a hint about how to resolve the problem. It should also be noted that the Numba
+diagnostic command ``numba -s`` has a section
+``__Threading Layer Information__`` that reports on the availability of
+threading layers in the current environment.
+
+
+Extra notes
+-----------
+The threading layers have fairly complex interactions with CPython internals and
+system level libraries, some additional things to note:
+
+* The installation of Intel's TBB libraries vastly widens the options available
+  in the threading layer selection process.
+* On Linux, the ``omp`` threading layer is not fork safe due to the GNU OpenMP
+  runtime library (``libgomp``) not being fork safe. If a fork occurs in a
+  program that is using the ``omp`` threading layer, a detection mechanism is
+  present that will try and gracefully terminate the forked child and print an
+  error message to ``STDERR``.
+* On systems with the ``fork(2)`` system call available, if the TBB backed
+  threading layer is in use and a ``fork`` call is made from a thread other than
+  the thread that launched TBB (typically the main thread) then this results in
+  undefined behaviour and a warning will be displayed on ``STDERR``. As
+  ``spawn`` is essentially ``fork`` followed by ``exec`` it is safe to ``spawn``
+  from a non-main thread, but as this cannot be differentiated from just a
+  ``fork`` call the warning message will still be displayed.
+* On OSX, the ``intel-openmp`` package is required to enable the OpenMP based
+  threading layer.
+
+.. _setting_the_number_of_threads:
+
+Setting the Number of Threads
+-----------------------------
+
+The number of threads used by numba is based on the number of CPU cores
+available (see :obj:`numba.config.NUMBA_DEFAULT_NUM_THREADS`), but it can be
+overridden with the :envvar:`NUMBA_NUM_THREADS` environment variable.
+
+The total number of threads that numba launches is in the variable
+:obj:`numba.config.NUMBA_NUM_THREADS`.
+
+For some use cases, it may be desirable to set the number of threads to a
+lower value, so that numba can be used with higher level parallelism.
+
+The number of threads can be set dynamically at runtime using
+:func:`numba.set_num_threads`. Note that :func:`~.set_num_threads` only allows
+setting the number of threads to a smaller value than
+:obj:`~.NUMBA_NUM_THREADS`. Numba always launches
+:obj:`numba.config.NUMBA_NUM_THREADS` threads, but :func:`~.set_num_threads`
+causes it to mask out unused threads so they aren't used in computations.
+
+The current number of threads used by numba can be accessed with
+:func:`numba.get_num_threads`. Both functions work inside of a jitted
+function.
+
+.. _numba-threading-layer-thread-masking:
+
+Example of Limiting the Number of Threads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this example, suppose the machine we are running on has 8 cores (so
+:obj:`numba.config.NUMBA_NUM_THREADS` would be ``8``). Suppose we want to run
+some code with ``@njit(parallel=True)``, but we also want to run our code
+concurrently in 4 different processes. With the default number of threads,
+each Python process would run 8 threads, for a total in 4*8 = 32 threads,
+which is oversubscription for our 8 cores. We should rather limit each process
+to 2 threads, so that the total will be 4*2 = 8, which matches our number of
+physical cores.
+
+There are two ways to do this. One is to set the :envvar:`NUMBA_NUM_THREADS`
+environment variable to ``2``.
+
+.. code:: bash
+
+   $ NUMBA_NUM_THREADS=2 python ourcode.py
+
+However, there are two downsides to this approach:
+
+1. :envvar:`NUMBA_NUM_THREADS` must be set before Numba is imported, and
+   ideally before Python is launched. As soon as Numba is imported the
+   environment variable is read and that number of threads is locked in as the
+   number of threads Numba launches.
+
+2. If we want to later increase the number of threads used by the process, we
+   cannot. :envvar:`NUMBA_NUM_THREADS` sets the *maximum* number of threads
+   that are launched for a process. Calling :func:`~.set_num_threads()` with a
+   value greater than :obj:`numba.config.NUMBA_NUM_THREADS` results in an
+   error.
+
+The advantage of this approach is that we can do it from outside of the
+process without changing the code.
+
+Another approach is to use the :func:`numba.set_num_threads` function in our code
+
+.. code:: python
+
+   from numba import njit, set_num_threads
+
+   @njit(parallel=True)
+   def func():
+       ...
+
+   set_num_threads(2)
+   func()
+
+If we call ``set_num_threads(2)`` before executing our parallel code, it has
+the same effect as calling the process with ``NUMBA_NUM_THREADS=2``, in that
+the parallel code will only execute on 2 threads. However, we can later call
+``set_num_threads(8)`` to increase the number of threads back to the default
+size. And we do not have to worry about setting it before Numba gets imported.
+It only needs to be called before the parallel function is run.
+
+.. _numba-threading-layer-thread-id:
+
+Getting a Thread ID
+-------------------
+
+In some cases it may be beneficial to have access to a unique identifier for the
+current thread that is executing a parallel region in Numba. For that purpose,
+Numba provides the :func:`numba.get_thread_id` function. This function is the
+corollary of OpenMP's function ``omp_get_thread_num`` and returns an integer
+between 0 (inclusive) and the number of configured threads as described above
+(exclusive).
+
+API Reference
+~~~~~~~~~~~~~
+
+.. py:data:: numba.config.NUMBA_NUM_THREADS
+
+   The total (maximum) number of threads launched by numba.
+
+   Defaults to :obj:`numba.config.NUMBA_DEFAULT_NUM_THREADS`, but can be
+   overridden with the :envvar:`NUMBA_NUM_THREADS` environment variable.
+
+.. py:data:: numba.config.NUMBA_DEFAULT_NUM_THREADS
+
+   The number of usable CPU cores on the system (as determined by
+   ``len(os.sched_getaffinity(0))``, if supported by the OS, or
+   ``multiprocessing.cpu_count()`` if not).
+   This is the default value for :obj:`numba.config.NUMBA_NUM_THREADS` unless
+   the :envvar:`NUMBA_NUM_THREADS` environment variable is set.
+
+.. autofunction:: numba.set_num_threads
+
+.. autofunction:: numba.get_num_threads
+
+.. autofunction:: numba.get_thread_id
--- a/docs/source/user/troubleshoot.rst
+++ b/docs/source/user/troubleshoot.rst
+
+.. _numba-troubleshooting:
+
+========================
+Troubleshooting and tips
+========================
+
+.. _what-to-compile:
+
+What to compile
+===============
+
+The general recommendation is that you should only try to compile the
+critical paths in your code.  If you have a piece of performance-critical
+computational code amongst some higher-level code, you may factor out
+the performance-critical code in a separate function and compile the
+separate function with Numba.  Letting Numba focus on that small piece
+of performance-critical code has several advantages:
+
+* it reduces the risk of hitting unsupported features;
+* it reduces the compilation times;
+* it allows you to evolve the higher-level code which is outside of the
+  compiled function much easier.
+
+.. _code-doesnt-compile:
+
+My code doesn't compile
+=======================
+
+There can be various reasons why Numba cannot compile your code, and raises
+an error instead.  One common reason is that your code relies on an
+unsupported Python feature, especially in :term:`nopython mode`.
+Please see the list of :ref:`pysupported`.  If you find something that
+is listed there and still fails compiling, please
+:ref:`report a bug <report-numba-bugs>`.
+
+When Numba tries to compile your code it first tries to work out the types of
+all the variables in use, this is so it can generate a type specific
+implementation of your code that can be compiled down to machine code. A common
+reason for Numba failing to compile (especially in :term:`nopython mode`) is a
+type inference failure, essentially Numba cannot work out what the type of all
+the variables in your code should be.
+
+For example, let's consider this trivial function::
+
+    @jit(nopython=True)
+    def f(x, y):
+        return x + y
+
+If you call it with two numbers, Numba is able to infer the types properly::
+
+    >>> f(1, 2)
+        3
+
+If however you call it with a tuple and a number, Numba is unable to say
+what the result of adding a tuple and number is, and therefore compilation
+errors out::
+
+    >>> f(1, (2,))
+    Traceback (most recent call last):
+    File "<stdin>", line 1, in <module>
+    File "<path>/numba/numba/dispatcher.py", line 339, in _compile_for_args
+        reraise(type(e), e, None)
+    File "<path>/numba/numba/six.py", line 658, in reraise
+        raise value.with_traceback(tb)
+    numba.errors.TypingError: Failed at nopython (nopython frontend)
+    Invalid use of + with parameters (int64, tuple(int64 x 1))
+    Known signatures:
+    * (int64, int64) -> int64
+    * (int64, uint64) -> int64
+    * (uint64, int64) -> int64
+    * (uint64, uint64) -> uint64
+    * (float32, float32) -> float32
+    * (float64, float64) -> float64
+    * (complex64, complex64) -> complex64
+    * (complex128, complex128) -> complex128
+    * (uint16,) -> uint64
+    * (uint8,) -> uint64
+    * (uint64,) -> uint64
+    * (uint32,) -> uint64
+    * (int16,) -> int64
+    * (int64,) -> int64
+    * (int8,) -> int64
+    * (int32,) -> int64
+    * (float32,) -> float32
+    * (float64,) -> float64
+    * (complex64,) -> complex64
+    * (complex128,) -> complex128
+    * parameterized
+    [1] During: typing of intrinsic-call at <stdin> (3)
+
+    File "<stdin>", line 3:
+
+The error message helps you find out what went wrong:
+"Invalid use of + with parameters (int64, tuple(int64 x 1))" is to be
+interpreted as "Numba encountered an addition of variables typed as integer
+and 1-tuple of integer, respectively, and doesn't know about any such
+operation".
+
+Note that if you allow object mode::
+
+    @jit
+    def g(x, y):
+        return x + y
+
+compilation will succeed and the compiled function will raise at runtime as
+Python would do::
+
+   >>> g(1, (2,))
+   Traceback (most recent call last):
+     File "<stdin>", line 1, in <module>
+   TypeError: unsupported operand type(s) for +: 'int' and 'tuple'
+
+
+My code has a type unification problem
+======================================
+
+Another common reason for Numba not being able to compile your code is that it
+cannot statically determine the return type of a function. The most likely
+cause of this is the return type depending on a value that is available only at
+runtime. Again, this is most often problematic when using
+:term:`nopython mode`. The concept of type unification is simply trying to find
+a type in which two variables could safely be represented. For example a 64 bit
+float and a 64 bit complex number could both be represented in a 128 bit complex
+number.
+
+As an example of type unification failure, this function has a return type that
+is determined at runtime based on the value of `x`::
+
+    In [1]: from numba import jit
+
+    In [2]: @jit(nopython=True)
+    ...: def f(x):
+    ...:     if x > 10:
+    ...:         return (1,)
+    ...:     else:
+    ...:         return 1
+    ...:
+
+    In [3]: f(10)
+
+Trying to execute this function, errors out as follows::
+
+    TypingError: Failed at nopython (nopython frontend)
+    Can't unify return type from the following types: tuple(int64 x 1), int64
+    Return of: IR name '$8.2', type '(int64 x 1)', location:
+    File "<ipython-input-2-51ef1cc64bea>", line 4:
+    def f(x):
+        <source elided>
+        if x > 10:
+            return (1,)
+            ^
+    Return of: IR name '$12.2', type 'int64', location:
+    File "<ipython-input-2-51ef1cc64bea>", line 6:
+    def f(x):
+        <source elided>
+        else:
+            return 1
+
+The error message "Can't unify return type from the following types:
+tuple(int64 x 1), int64" should be read as "Numba cannot find a type that
+can safely represent a 1-tuple of integer and an integer".
+
+.. _code-has-untyped-list:
+
+My code has an untyped list problem
+===================================
+
+As :ref:`noted previously <code-doesnt-compile>` the first part of Numba
+compiling your code involves working out what the types of all the variables
+are. In the case of lists, a list must contain items that are of the same type
+or can be empty if the type can be inferred from some later operation. What is
+not possible is to have a list which is defined as empty and has no inferable
+type (i.e. an untyped list).
+
+For example, this is using a list of a known type::
+
+    from numba import jit
+    @jit(nopython=True)
+    def f():
+        return [1, 2, 3] # this list is defined on construction with `int` type
+
+This is using an empty list, but the type can be inferred::
+
+    from numba import jit
+    @jit(nopython=True)
+    def f(x):
+        tmp = [] # defined empty
+        for i in range(x):
+            tmp.append(i) # list type can be inferred from the type of `i`
+        return tmp
+
+This is using an empty list and the type cannot be inferred::
+
+    from numba import jit
+    @jit(nopython=True)
+    def f(x):
+        tmp = [] # defined empty
+        return (tmp, x) # ERROR: the type of `tmp` is unknown
+
+Whilst slightly contrived, if you need an empty list and the type cannot be
+inferred but you know what type you want the list to be, this "trick" can be
+used to instruct the typing mechanism::
+
+    from numba import jit
+    import numpy as np
+    @jit(nopython=True)
+    def f(x):
+        # define empty list, but instruct that the type is np.complex64
+        tmp = [np.complex64(x) for x in range(0)]
+        return (tmp, x) # the type of `tmp` is known, but it is still empty
+
+The compiled code is too slow
+=============================
+
+The most common reason for slowness of a compiled JIT function is that
+compiling in :term:`nopython mode` has failed and the Numba compiler has
+fallen back to :term:`object mode`.  :term:`object mode` currently provides
+little to no speedup compared to regular Python interpretation, and its
+main point is to allow an internal optimization known as
+:term:`loop-lifting`: this optimization will allow to compile inner
+loops in :term:`nopython mode` regardless of what code surrounds those
+inner loops.
+
+To find out if type inference succeeded on your function, you can use
+the :meth:`~Dispatcher.inspect_types` method on the compiled function.
+
+For example, let's take the following function::
+
+   @jit
+   def f(a, b):
+       s = a + float(b)
+       return s
+
+When called with numbers, this function should be fast as Numba is able
+to convert number types to floating-point numbers.  Let's see::
+
+   >>> f(1, 2)
+   3.0
+   >>> f.inspect_types()
+   f (int64, int64)
+   --------------------------------------------------------------------------------
+   # --- LINE 7 ---
+
+   @jit
+
+   # --- LINE 8 ---
+
+   def f(a, b):
+
+       # --- LINE 9 ---
+       # label 0
+       #   a.1 = a  :: int64
+       #   del a
+       #   b.1 = b  :: int64
+       #   del b
+       #   $0.2 = global(float: <class 'float'>)  :: Function(<class 'float'>)
+       #   $0.4 = call $0.2(b.1, )  :: (int64,) -> float64
+       #   del b.1
+       #   del $0.2
+       #   $0.5 = a.1 + $0.4  :: float64
+       #   del a.1
+       #   del $0.4
+       #   s = $0.5  :: float64
+       #   del $0.5
+
+       s = a + float(b)
+
+       # --- LINE 10 ---
+       #   $0.7 = cast(value=s)  :: float64
+       #   del s
+       #   return $0.7
+
+       return s
+
+Without trying to understand too much of the Numba intermediate representation,
+it is still visible that all variables and temporary values have had their
+types inferred properly: for example *a* has the type ``int64``, *$0.5* has
+the type ``float64``, etc.
+
+However, if *b* is passed as a string, compilation will fall back on object
+mode as the float() constructor with a string is currently not supported
+by Numba::
+
+   >>> f(1, "2")
+   3.0
+   >>> f.inspect_types()
+   [... snip annotations for other signatures, see above ...]
+   ================================================================================
+   f (int64, str)
+   --------------------------------------------------------------------------------
+   # --- LINE 7 ---
+
+   @jit
+
+   # --- LINE 8 ---
+
+   def f(a, b):
+
+       # --- LINE 9 ---
+       # label 0
+       #   a.1 = a  :: pyobject
+       #   del a
+       #   b.1 = b  :: pyobject
+       #   del b
+       #   $0.2 = global(float: <class 'float'>)  :: pyobject
+       #   $0.4 = call $0.2(b.1, )  :: pyobject
+       #   del b.1
+       #   del $0.2
+       #   $0.5 = a.1 + $0.4  :: pyobject
+       #   del a.1
+       #   del $0.4
+       #   s = $0.5  :: pyobject
+       #   del $0.5
+
+       s = a + float(b)
+
+       # --- LINE 10 ---
+       #   $0.7 = cast(value=s)  :: pyobject
+       #   del s
+       #   return $0.7
+
+       return s
+
+Here we see that all variables end up typed as ``pyobject``.  This means
+that the function was compiled in object mode and values are passed
+around as generic Python objects, without Numba trying to look into them
+to reason about their raw values.  This is a situation you want to avoid
+when caring about the speed of your code.
+
+If a function fails to compile in ``nopython`` mode warnings will be emitted
+with explanation as to why compilation failed. For example with the ``f()``
+function above (slightly edited for documentation purposes)::
+
+    >>> f(1, 2)
+    3.0
+    >>> f(1, "2")
+    example.py:7: NumbaWarning:
+    Compilation is falling back to object mode WITH looplifting enabled because Function "f" failed type inference due to: Invalid use of Function(<class 'float'>) with argument(s) of type(s): (unicode_type)
+    * parameterized
+    In definition 0:
+        TypeError: float() only support for numbers
+        raised from <path>/numba/typing/builtins.py:880
+    In definition 1:
+        TypeError: float() only support for numbers
+        raised from <path>/numba/typing/builtins.py:880
+    This error is usually caused by passing an argument of a type that is unsupported by the named function.
+    [1] During: resolving callee type: Function(<class 'float'>)
+    [2] During: typing of call at example.py (9)
+
+
+    File "example.py", line 9:
+    def f(a, b):
+        s = a + float(b)
+        ^
+
+    <path>/numba/compiler.py:722: NumbaWarning: Function "f" was compiled in object mode without forceobj=True.
+
+    File "example.py", line 8:
+    @jit
+    def f(a, b):
+    ^
+
+    3.0
+
+
+Disabling JIT compilation
+=========================
+
+In order to debug code, it is possible to disable JIT compilation, which makes
+the ``jit`` decorator (and the ``njit`` decorator) act as if
+they perform no operation, and the invocation of decorated functions calls the
+original Python function instead of a compiled version. This can be toggled by
+setting the :envvar:`NUMBA_DISABLE_JIT` environment variable to ``1``.
+
+When this mode is enabled, the ``vectorize`` and ``guvectorize`` decorators will
+still result in compilation of a ufunc, as there is no straightforward pure
+Python implementation of these functions.
+
+
+.. _debugging-jit-compiled-code:
+
+Debugging JIT compiled code with GDB
+====================================
+
+Setting the ``debug`` keyword argument in the ``jit`` decorator
+(e.g. ``@jit(debug=True)``) enables the emission of debug info in the jitted
+code.  To debug, GDB version 7.0 or above is required.  Currently, the following
+debug info is available:
+
+* Function name will be shown in the backtrace along with type information and
+  values (if available).
+* Source location (filename and line number) is available.  For example,
+  users can set a break point by the absolute filename and line number;
+  e.g. ``break /path/to/myfile.py:6``.
+* Arguments to the current function can be show with ``info args``
+* Local variables in the current function can be shown with ``info locals``.
+* The type of variables can be shown with ``whatis myvar``.
+* The value of variables can be shown with ``print myvar`` or ``display myvar``.
+
+  * Simple numeric types, i.e. int, float and double, are shown in their
+    native representation.
+  * Other types are shown as a structure based on Numba's memory model
+    representation of the type.
+
+Further, the Numba ``gdb`` printing extension can be loaded into ``gdb`` (if the
+``gdb`` has Python support) to permit the printing of variables as they would be
+in native Python. The extension does this by reinterpreting Numba's memory model
+representations as Python types. Information about the ``gdb`` installation that
+Numba is using, including the path to load the ``gdb`` printing extension, can
+be displayed by using the ``numba -g`` command. For best results ensure that the
+Python that ``gdb`` is using has a NumPy module accessible. An example output
+of the ``gdb`` information follows:
+
+.. code-block:: none
+  :emphasize-lines: 1
+
+    $ numba -g
+    GDB info:
+    --------------------------------------------------------------------------------
+    Binary location                               : <some path>/gdb
+    Print extension location                      : <some python path>/numba/misc/gdb_print_extension.py
+    Python version                                : 3.8
+    NumPy version                                 : 1.20.0
+    Numba printing extension supported            : True
+
+    To load the Numba gdb printing extension, execute the following from the gdb prompt:
+
+    source <some python path>/numba/misc/gdb_print_extension.py
+
+    --------------------------------------------------------------------------------
+
+Known issues:
+
+* Stepping depends heavily on optimization level. At full optimization
+  (equivalent to O3), most of the variables are optimized out. It is often
+  beneficial to use the jit option ``_dbg_optnone=True`` 
+  or the environment variable :envvar:`NUMBA_OPT` to adjust the 
+  optimization level and the jit option ``_dbg_extend_lifetimes=True`` 
+  (which is on by default if ``debug=True``) or
+  :envvar:`NUMBA_EXTEND_VARIABLE_LIFETIMES` to extend
+  the lifetime of variables to the end of their scope so as to get a debugging
+  experience closer to the semantics of Python execution.
+
+* Memory consumption increases significantly with debug info enabled.
+  The compiler emits extra information (`DWARF <http://www.dwarfstd.org/>`_)
+  along with the instructions.  The emitted object code can be 2x bigger with
+  debug info.
+
+Internal details:
+
+* Since Python semantics allow variables to bind to value of different types,
+  Numba internally creates multiple versions of the variable for each type.
+  So for code like::
+
+    x = 1         # type int
+    x = 2.3       # type float
+    x = (1, 2, 3) # type 3-tuple of int
+
+  Each assignments will store to a different variable name.  In the debugger,
+  the variables will be ``x``, ``x$1`` and ``x$2``.  (In the Numba IR, they are
+  ``x``, ``x.1`` and ``x.2``.)
+
+* When debug is enabled, inlining of functions at LLVM IR level is disabled.
+
+JIT options for debug
+---------------------
+
+* ``debug`` (bool). Set to ``True`` to enable debug info. Defaults to ``False``.
+* ``_dbg_optnone`` (bool). Set to ``True`` to disable all LLVM optimization passes 
+  on the function. Defaults to ``False``. See :envvar:`NUMBA_OPT` for a global setting
+  to disable optimization.
+* ``_dbg_extend_lifetimes`` (bool). Set to ``True`` to extend the lifetime of
+  objects such that they more closely follow the semantics of Python.
+  Automatically set to ``True`` when 
+  ``debug=True``; otherwise, defaults to ``False``. Users can explicitly set this option 
+  to ``False`` to retain the normal execution semantics of compiled code.
+  See :envvar:`NUMBA_EXTEND_VARIABLE_LIFETIMES` for a global option to extend object 
+  lifetimes.
+
+Example debug usage
+-------------------
+
+The python source:
+
+.. code-block:: python
+  :linenos:
+
+  from numba import njit
+
+  @njit(debug=True)
+  def foo(a):
+      b = a + 1
+      c = a * 2.34
+      d = (a, b, c)
+      print(a, b, c, d)
+
+  r = foo(123)
+  print(r)
+
+In the terminal:
+
+.. code-block:: none
+  :emphasize-lines: 1, 3, 7, 12, 14, 16, 20, 22, 26, 28, 30, 32, 34, 36
+
+    $ NUMBA_OPT=0 NUMBA_EXTEND_VARIABLE_LIFETIMES=1 gdb -q python
+    Reading symbols from python...
+    (gdb) break test1.py:5
+    No source file named test1.py.
+    Make breakpoint pending on future shared library load? (y or [n]) y
+    Breakpoint 1 (test1.py:5) pending.
+    (gdb) run test1.py
+    Starting program: <path>/bin/python test1.py
+    ...
+    Breakpoint 1, __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at test1.py:5
+    5           b = a + 1
+    (gdb) info args
+    a = 123
+    (gdb) n
+    6           c = a * 2.34
+    (gdb) info locals
+    b = 124
+    c = 0
+    d = {f0 = 0, f1 = 0, f2 = 0}
+    (gdb) n
+    7           d = (a, b, c)
+    (gdb) info locals
+    b = 124
+    c = 287.81999999999999
+    d = {f0 = 0, f1 = 0, f2 = 0}
+    (gdb) whatis b
+    type = int64
+    (gdb) whatis d
+    type = Tuple(int64, int64, float64) ({i64, i64, double})
+    (gdb) n
+    8           print(a, b, c, d)
+    (gdb) print b
+    $1 = 124
+    (gdb) print d
+    $2 = {f0 = 123, f1 = 124, f2 = 287.81999999999999}
+    (gdb) bt
+    #0  __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at test1.py:8
+    #1  0x00007ffff06439fa in cpython::__main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) ()
+
+
+Another example follows that makes use of the Numba ``gdb`` printing extension
+mentioned above, note the change in the print format once the extension is
+loaded with ``source`` :
+
+The Python source:
+
+.. code-block:: python
+  :linenos:
+
+    from numba import njit
+    import numpy as np
+
+    @njit(debug=True)
+    def foo(n):
+        x = np.arange(n)
+        y = (x[0], x[-1])
+        return x, y
+
+    foo(4)
+
+In the terminal:
+
+.. code-block:: none
+  :emphasize-lines: 1, 3, 4, 7, 12, 14, 16, 17, 20
+
+    $ NUMBA_OPT=0 NUMBA_EXTEND_VARIABLE_LIFETIMES=1 gdb -q python
+    Reading symbols from python...
+    (gdb) set breakpoint pending on
+    (gdb) break test2.py:8
+    No source file named test2.py.
+    Breakpoint 1 (test2.py:8) pending.
+    (gdb) run test2.py
+    Starting program: <path>/bin/python test2.py
+    ...
+    Breakpoint 1, __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (n=4) at test2.py:8
+    8           return x, y
+    (gdb) print x
+    $1 = {meminfo = 0x55555688f470 "\001", parent = 0x0, nitems = 4, itemsize = 8, data = 0x55555688f4a0, shape = {4}, strides = {8}}
+    (gdb) print y
+    $2 = {0, 3}
+    (gdb) source numba/misc/gdb_print_extension.py
+    (gdb) print x
+    $3 =
+    [0 1 2 3]
+    (gdb) print y
+    $4 = (0, 3)
+
+
+
+Globally override debug setting
+-------------------------------
+
+It is possible to enable debug for the full application by setting environment
+variable ``NUMBA_DEBUGINFO=1``.  This sets the default value of the ``debug``
+option in ``jit``.  Debug can be turned off on individual functions by setting
+``debug=False``.
+
+Beware that enabling debug info significantly increases the memory consumption
+for each compiled function.  For large application, this may cause out-of-memory
+error.
+
+Using Numba's direct ``gdb`` bindings in ``nopython``  mode
+===========================================================
+
+Numba (version 0.42.0 and later) has some additional functions relating to
+``gdb`` support for CPUs that make it easier to debug programs. All the ``gdb``
+related functions described in the following work in the same manner
+irrespective of whether they are called from the standard CPython interpreter or
+code compiled in either :term:`nopython mode` or :term:`object mode`.
+
+.. note:: This feature is experimental!
+
+.. warning:: This feature does unexpected things if used from Jupyter or
+             alongside the ``pdb`` module. It's behaviour is harmless, just hard
+             to predict!
+
+Set up
+------
+Numba's ``gdb`` related functions make use of a ``gdb`` binary, the location and
+name of this binary can be configured via the :envvar:`NUMBA_GDB_BINARY`
+environment variable if desired.
+
+.. note:: Numba's ``gdb`` support requires the ability for ``gdb`` to attach to
+          another process. On some systems (notably Ubuntu Linux) default
+          security restrictions placed on ``ptrace`` prevent this from being
+          possible. This restriction is enforced at the system level by the
+          Linux security module `Yama`. Documentation for this module and the
+          security implications of making changes to its behaviour can be found
+          in the `Linux Kernel documentation <https://www.kernel.org/doc/Documentation/admin-guide/LSM/Yama.rst>`_.
+          The `Ubuntu Linux security documentation <https://wiki.ubuntu.com/Security/Features#ptrace>`_
+          discusses how to adjust the behaviour of `Yama` on with regards to
+          ``ptrace_scope`` so as to permit the required behaviour.
+
+Basic ``gdb`` support
+---------------------
+
+.. warning:: Calling :func:`numba.gdb` and/or :func:`numba.gdb_init` more than
+             once in the same program is not advisable, unexpected things may
+             happen. If multiple breakpoints are desired within a program,
+             launch ``gdb`` once via :func:`numba.gdb` or :func:`numba.gdb_init`
+             and then use :func:`numba.gdb_breakpoint` to register additional
+             breakpoint locations.
+
+The most simple function for adding ``gdb`` support is :func:`numba.gdb`, which,
+at the call location, will:
+
+* launch ``gdb`` and attach it to the running process.
+* create a breakpoint at the site of the :func:`numba.gdb()` function call, the
+  attached ``gdb`` will pause execution here awaiting user input.
+
+use of this functionality is best motivated by example, continuing with the
+example used above:
+
+.. code-block:: python
+  :linenos:
+
+  from numba import njit, gdb
+
+  @njit(debug=True)
+  def foo(a):
+      b = a + 1
+      gdb() # instruct Numba to attach gdb at this location and pause execution
+      c = a * 2.34
+      d = (a, b, c)
+      print(a, b, c, d)
+
+  r= foo(123)
+  print(r)
+
+In the terminal (``...`` on a line by itself indicates output that is not
+presented for brevity):
+
+.. code-block:: none
+    :emphasize-lines: 1, 4, 8, 13, 24, 26, 28, 30, 32, 37
+
+    $ NUMBA_OPT=0 NUMBA_EXTEND_VARIABLE_LIFETIMES=1 python demo_gdb.py
+    ...
+    Breakpoint 1, 0x00007fb75238d830 in numba_gdb_breakpoint () from numba/_helperlib.cpython-39-x86_64-linux-gnu.so
+    (gdb) s
+    Single stepping until exit from function numba_gdb_breakpoint,
+    which has no line number information.
+    0x00007fb75233e1cf in numba::misc::gdb_hook::hook_gdb::_3clocals_3e::impl_242[abi:c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3d](StarArgTuple) ()
+    (gdb) s
+    Single stepping until exit from function _ZN5numba4misc8gdb_hook8hook_gdb12_3clocals_3e8impl_242B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dE12StarArgTuple,
+    which has no line number information.
+    __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at demo_gdb.py:7
+    7           c = a * 2.34
+    (gdb) l
+    2
+    3       @njit(debug=True)
+    4       def foo(a):
+    5           b = a + 1
+    6           gdb() # instruct Numba to attach gdb at this location and pause execution
+    7           c = a * 2.34
+    8           d = (a, b, c)
+    9           print(a, b, c, d)
+    10
+    11      r= foo(123)
+    (gdb) p a
+    $1 = 123
+    (gdb) p b
+    $2 = 124
+    (gdb) p c
+    $3 = 0
+    (gdb) b 9
+    Breakpoint 2 at 0x7fb73d1f7287: file demo_gdb.py, line 9.
+    (gdb) c
+    Continuing.
+
+    Breakpoint 2, __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at demo_gdb.py:9
+    9           print(a, b, c, d)
+    (gdb) info locals
+    b = 124
+    c = 287.81999999999999
+    d = {f0 = 123, f1 = 124, f2 = 287.81999999999999}
+
+
+It can be seen in the above example that execution of the code is paused at the
+location of the ``gdb()`` function call at end of the ``numba_gdb_breakpoint``
+function (this is the Numba internal symbol registered as breakpoint with
+``gdb``). Issuing a ``step`` twice at this point moves to the stack frame of the
+compiled Python source. From there, it can be seen that the variables ``a`` and
+``b`` have been evaluated but ``c`` has not, as demonstrated by printing their
+values, this is precisely as expected given the location of the ``gdb()`` call.
+Issuing a ``break`` on line 9 and then continuing execution leads to the
+evaluation of line ``7``. The variable ``c`` is assigned a value as a result of
+the execution and this can be seen in output of ``info locals`` when the
+breakpoint is hit.
+
+Running with ``gdb`` enabled
+----------------------------
+
+The functionality provided by :func:`numba.gdb` (launch and attach ``gdb`` to
+the executing process and pause on a breakpoint) is also available as two
+separate functions:
+
+* :func:`numba.gdb_init` this function injects code at the call site to launch
+  and attach ``gdb`` to the executing process but does not pause execution.
+* :func:`numba.gdb_breakpoint` this function injects code at the call site that
+  will call the special ``numba_gdb_breakpoint`` function that is registered as
+  a breakpoint in Numba's ``gdb`` support. This is demonstrated in the next
+  section.
+
+This functionality enables more complex debugging capabilities. Again, motivated
+by example, debugging a 'segfault' (memory access violation signalling
+``SIGSEGV``):
+
+.. code-block:: python
+  :linenos:
+
+    from numba import njit, gdb_init
+    import numpy as np
+
+    # NOTE debug=True switches bounds-checking on, but for the purposes of this
+    # example it is explicitly turned off so that the out of bounds index is
+    # not caught!
+    @njit(debug=True, boundscheck=False)
+    def foo(a, index):
+        gdb_init() # instruct Numba to attach gdb at this location, but not to pause execution
+        b = a + 1
+        c = a * 2.34
+        d = c[index] # access an address that is a) invalid b) out of the page
+        print(a, b, c, d)
+
+    bad_index = int(1e9) # this index is invalid
+    z = np.arange(10)
+    r = foo(z, bad_index)
+    print(r)
+
+In the terminal (``...`` on a line by itself indicates output that is not
+presented for brevity):
+
+.. code-block:: none
+    :emphasize-lines: 1, 6, 8, 10, 12
+
+    $ NUMBA_OPT=0 python demo_gdb_segfault.py
+    ...
+    Program received signal SIGSEGV, Segmentation fault.
+    0x00007f5a4ca655eb in __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](Array<long long, 1, C, mutable, aligned>, long long) (a=..., index=1000000000) at demo_gdb_segfault.py:12
+    12          d = c[index] # access an address that is a) invalid b) out of the page
+    (gdb) p index
+    $1 = 1000000000
+    (gdb) p c
+    $2 = {meminfo = 0x5586cfb95830 "\001", parent = 0x0, nitems = 10, itemsize = 8, data = 0x5586cfb95860, shape = {10}, strides = {8}}
+    (gdb) whatis c
+    type = array(float64, 1d, C) ({i8*, i8*, i64, i64, double*, [1 x i64], [1 x i64]})
+    (gdb) p c.nitems
+    $3 = 10
+
+In the ``gdb`` output it can be noted that a ``SIGSEGV`` signal was caught, and
+the line in which the access violation occurred is printed.
+
+Continuing the example as a debugging session demonstration, first ``index``
+can be printed, and it is evidently 1e9. Printing ``c`` shows that it is a
+structure, so the type needs looking up and it can be seen that it is an
+``array(float64, 1d, C)`` type. Given the segfault came from an invalid access
+it would be informative to check the number of items in the array and compare
+that to the index requested. Inspecting the ``nitems`` member of the structure
+``c`` shows 10 items. It's therefore clear that the segfault comes from an
+invalid access of index ``1000000000`` in an array containing ``10`` items.
+
+Adding breakpoints to code
+--------------------------
+
+The next example demonstrates using multiple breakpoints that are defined
+through the invocation of the :func:`numba.gdb_breakpoint` function:
+
+.. code-block:: python
+  :linenos:
+
+  from numba import njit, gdb_init, gdb_breakpoint
+
+  @njit(debug=True)
+  def foo(a):
+      gdb_init() # instruct Numba to attach gdb at this location
+      b = a + 1
+      gdb_breakpoint() # instruct gdb to break at this location
+      c = a * 2.34
+      d = (a, b, c)
+      gdb_breakpoint() # and to break again at this location
+      print(a, b, c, d)
+
+  r= foo(123)
+  print(r)
+
+In the terminal (``...`` on a line by itself indicates output that is not
+presented for brevity):
+
+.. code-block:: none
+    :emphasize-lines: 1, 4, 9, 20, 22, 24, 29, 31
+
+    $ NUMBA_OPT=0 python demo_gdb_breakpoints.py
+    ...
+    Breakpoint 1, 0x00007fb65bb4c830 in numba_gdb_breakpoint () from numba/_helperlib.cpython-39-x86_64-linux-gnu.so
+    (gdb) step
+    Single stepping until exit from function numba_gdb_breakpoint,
+    which has no line number information.
+    __main__::foo_241[abi:c8tJTC_2fWgEeGLSgydRTQUgiqKEZ6gEoDvQJmaQIA](long long) (a=123) at demo_gdb_breakpoints.py:8
+    8           c = a * 2.34
+    (gdb) l
+    3       @njit(debug=True)
+    4       def foo(a):
+    5           gdb_init() # instruct Numba to attach gdb at this location
+    6           b = a + 1
+    7           gdb_breakpoint() # instruct gdb to break at this location
+    8           c = a * 2.34
+    9           d = (a, b, c)
+    10          gdb_breakpoint() # and to break again at this location
+    11          print(a, b, c, d)
+    12
+    (gdb) p b
+    $1 = 124
+    (gdb) p c
+    $2 = 0
+    (gdb) c
+    Continuing.
+
+    Breakpoint 1, 0x00007fb65bb4c830 in numba_gdb_breakpoint ()
+    from numba/_helperlib.cpython-39-x86_64-linux-gnu.so
+    (gdb) step
+    11          print(a, b, c, d)
+    (gdb) p c
+    $3 = 287.81999999999999
+
+From the ``gdb`` output it can be seen that execution paused at line 8 as a
+breakpoint was hit, and after a ``continue`` was issued, it broke again at line
+11 where the next breakpoint was hit.
+
+Debugging in parallel regions
+-----------------------------
+
+The following example is quite involved, it executes with ``gdb``
+instrumentation from the outset as per the example above, but it also uses
+threads and makes use of the breakpoint functionality. Further, the last
+iteration of the parallel section calls the function ``work``, which is
+actually just a binding to ``glibc``'s ``free(3)`` in this case, but could
+equally be some involved function that is presenting a segfault for unknown
+reasons.
+
+.. code-block:: python
+  :linenos:
+
+    from numba import njit, prange, gdb_init, gdb_breakpoint
+    import ctypes
+
+    def get_free():
+        lib = ctypes.cdll.LoadLibrary('libc.so.6')
+        free_binding = lib.free
+        free_binding.argtypes = [ctypes.c_void_p,]
+        free_binding.restype = None
+        return free_binding
+
+    work = get_free()
+
+    @njit(debug=True, parallel=True)
+    def foo():
+        gdb_init() # instruct Numba to attach gdb at this location, but not to pause execution
+        counter = 0
+        n = 9
+        for i in prange(n):
+            if i > 3 and i < 8: # iterations 4, 5, 6, 7 will break here
+                gdb_breakpoint()
+
+            if i == 8: # last iteration segfaults
+                work(0xBADADD)
+
+            counter += 1
+        return counter
+
+    r = foo()
+    print(r)
+
+In the terminal (``...`` on a line by itself indicates output that is not
+presented for brevity), note the setting of ``NUMBA_NUM_THREADS`` to 4 to ensure
+that there are 4 threads running in the parallel section:
+
+.. code-block:: none
+    :emphasize-lines: 1, 19, 29, 44, 50, 56, 62, 69
+
+    $ NUMBA_NUM_THREADS=4 NUMBA_OPT=0 python demo_gdb_threads.py
+    Attaching to PID: 21462
+    ...
+    Attaching to process 21462
+    [New LWP 21467]
+    [New LWP 21468]
+    [New LWP 21469]
+    [New LWP 21470]
+    [Thread debugging using libthread_db enabled]
+    Using host libthread_db library "/lib64/libthread_db.so.1".
+    0x00007f59ec31756d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
+    81      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
+    Breakpoint 1 at 0x7f59d631e8f0: file numba/_helperlib.c, line 1090.
+    Continuing.
+    [Switching to Thread 0x7f59d1fd1700 (LWP 21470)]
+
+    Thread 5 "python" hit Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
+    1090    }
+    (gdb) info threads
+    Id   Target Id         Frame
+    1    Thread 0x7f59eca2f740 (LWP 21462) "python" pthread_cond_wait@@GLIBC_2.3.2 ()
+        at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
+    2    Thread 0x7f59d37d4700 (LWP 21467) "python" pthread_cond_wait@@GLIBC_2.3.2 ()
+        at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
+    3    Thread 0x7f59d2fd3700 (LWP 21468) "python" pthread_cond_wait@@GLIBC_2.3.2 ()
+        at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
+    4    Thread 0x7f59d27d2700 (LWP 21469) "python" numba_gdb_breakpoint () at numba/_helperlib.c:1090
+    * 5    Thread 0x7f59d1fd1700 (LWP 21470) "python" numba_gdb_breakpoint () at numba/_helperlib.c:1090
+    (gdb) thread apply 2-5 info locals
+
+    Thread 2 (Thread 0x7f59d37d4700 (LWP 21467)):
+    No locals.
+
+    Thread 3 (Thread 0x7f59d2fd3700 (LWP 21468)):
+    No locals.
+
+    Thread 4 (Thread 0x7f59d27d2700 (LWP 21469)):
+    No locals.
+
+    Thread 5 (Thread 0x7f59d1fd1700 (LWP 21470)):
+    sched$35 = '\000' <repeats 55 times>
+    counter__arr = '\000' <repeats 16 times>, "\001\000\000\000\000\000\000\000\b\000\000\000\000\000\000\000\370B]\"hU\000\000\001", '\000' <repeats 14 times>
+    counter = 0
+    (gdb) continue
+    Continuing.
+    [Switching to Thread 0x7f59d27d2700 (LWP 21469)]
+
+    Thread 4 "python" hit Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
+    1090    }
+    (gdb) continue
+    Continuing.
+    [Switching to Thread 0x7f59d1fd1700 (LWP 21470)]
+
+    Thread 5 "python" hit Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
+    1090    }
+    (gdb) continue
+    Continuing.
+    [Switching to Thread 0x7f59d27d2700 (LWP 21469)]
+
+    Thread 4 "python" hit Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
+    1090    }
+    (gdb) continue
+    Continuing.
+
+    Thread 5 "python" received signal SIGSEGV, Segmentation fault.
+    [Switching to Thread 0x7f59d1fd1700 (LWP 21470)]
+    __GI___libc_free (mem=0xbadadd) at malloc.c:2935
+    2935      if (chunk_is_mmapped(p))                       /* release mmapped memory. */
+    (gdb) bt
+    #0  __GI___libc_free (mem=0xbadadd) at malloc.c:2935
+    #1  0x00007f59d37ded84 in $3cdynamic$3e::__numba_parfor_gufunc__0x7ffff80a61ae3e31$244(Array<unsigned long long, 1, C, mutable, aligned>, Array<long long, 1, C, mutable, aligned>) () at <string>:24
+    #2  0x00007f59d17ce326 in __gufunc__._ZN13$3cdynamic$3e45__numba_parfor_gufunc__0x7ffff80a61ae3e31$244E5ArrayIyLi1E1C7mutable7alignedE5ArrayIxLi1E1C7mutable7alignedE ()
+    #3  0x00007f59d37d7320 in thread_worker ()
+    from <path>/numba/numba/npyufunc/workqueue.cpython-37m-x86_64-linux-gnu.so
+    #4  0x00007f59ec626e25 in start_thread (arg=0x7f59d1fd1700) at pthread_create.c:308
+    #5  0x00007f59ec350bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
+
+In the output it can be seen that there are 4 threads launched and that they all
+break at the breakpoint, further that ``Thread 5`` receives a signal ``SIGSEGV``
+and that back tracing shows that it came from ``__GI___libc_free`` with the
+invalid address in ``mem``, as expected.
+
+Using the ``gdb`` command language
+----------------------------------
+Both the :func:`numba.gdb` and :func:`numba.gdb_init` functions accept unlimited
+string arguments which will be passed directly to ``gdb`` as command line
+arguments when it initializes, this makes it easy to set breakpoints on other
+functions and perform repeated debugging tasks without having to manually type
+them every time. For example, this code runs with ``gdb`` attached and sets a
+breakpoint on ``_dgesdd`` (say for example the arguments passed to the LAPACK's
+double precision divide and conqueror SVD function need debugging).
+
+.. code-block:: python
+  :linenos:
+
+    from numba import njit, gdb
+    import numpy as np
+
+    @njit(debug=True)
+    def foo(a):
+        # instruct Numba to attach gdb at this location and on launch, switch
+        # breakpoint pending on , and then set a breakpoint on the function
+        # _dgesdd, continue execution, and once the breakpoint is hit, backtrace
+        gdb('-ex', 'set breakpoint pending on',
+            '-ex', 'b dgesdd_',
+            '-ex','c',
+            '-ex','bt')
+        b = a + 10
+        u, s, vh = np.linalg.svd(b)
+        return s # just return singular values
+
+    z = np.arange(70.).reshape(10, 7)
+    r = foo(z)
+    print(r)
+
+In the terminal (``...`` on a line by itself indicates output that is not
+presented for brevity), note that no interaction is required to break and
+backtrace:
+
+.. code-block:: none
+    :emphasize-lines: 1
+
+    $ NUMBA_OPT=0 python demo_gdb_args.py
+    Attaching to PID: 22300
+    GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-36.el7
+    ...
+    Attaching to process 22300
+    Reading symbols from <py_env>/bin/python3.7...done.
+    0x00007f652305a550 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:81
+    81      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
+    Breakpoint 1 at 0x7f650d0618f0: file numba/_helperlib.c, line 1090.
+    Continuing.
+
+    Breakpoint 1, numba_gdb_breakpoint () at numba/_helperlib.c:1090
+    1090    }
+    Breakpoint 2 at 0x7f65102322e0 (2 locations)
+    Continuing.
+
+    Breakpoint 2, 0x00007f65182be5f0 in mkl_lapack.dgesdd_ ()
+    from <py_env>/lib/python3.7/site-packages/numpy/core/../../../../libmkl_rt.so
+    #0  0x00007f65182be5f0 in mkl_lapack.dgesdd_ ()
+    from <py_env>/lib/python3.7/site-packages/numpy/core/../../../../libmkl_rt.so
+    #1  0x00007f650d065b71 in numba_raw_rgesdd (kind=kind@entry=100 'd', jobz=<optimized out>, jobz@entry=65 'A', m=m@entry=10,
+        n=n@entry=7, a=a@entry=0x561c6fbb20c0, lda=lda@entry=10, s=0x561c6facf3a0, u=0x561c6fb680e0, ldu=10, vt=0x561c6fd375c0,
+        ldvt=7, work=0x7fff4c926c30, lwork=-1, iwork=0x7fff4c926c40, info=0x7fff4c926c20) at numba/_lapack.c:1277
+    #2  0x00007f650d06768f in numba_ez_rgesdd (ldvt=7, vt=0x561c6fd375c0, ldu=10, u=0x561c6fb680e0, s=0x561c6facf3a0, lda=10,
+        a=0x561c6fbb20c0, n=7, m=10, jobz=65 'A', kind=<optimized out>) at numba/_lapack.c:1307
+    #3  numba_ez_gesdd (kind=<optimized out>, jobz=<optimized out>, m=10, n=7, a=0x561c6fbb20c0, lda=10, s=0x561c6facf3a0,
+        u=0x561c6fb680e0, ldu=10, vt=0x561c6fd375c0, ldvt=7) at numba/_lapack.c:1477
+    #4  0x00007f650a3147a3 in numba::targets::linalg::svd_impl::$3clocals$3e::svd_impl$243(Array<double, 2, C, mutable, aligned>, omitted$28default$3d1$29) ()
+    #5  0x00007f650a1c0489 in __main__::foo$241(Array<double, 2, C, mutable, aligned>) () at demo_gdb_args.py:15
+    #6  0x00007f650a1c2110 in cpython::__main__::foo$241(Array<double, 2, C, mutable, aligned>) ()
+    #7  0x00007f650cd096a4 in call_cfunc ()
+    from <path>/numba/numba/_dispatcher.cpython-37m-x86_64-linux-gnu.so
+    ...
+
+
+How does the ``gdb`` binding work?
+----------------------------------
+For advanced users and debuggers of Numba applications it's important to know
+some of the internal implementation details of the outlined ``gdb`` bindings.
+The :func:`numba.gdb` and :func:`numba.gdb_init` functions work by injecting the
+following into the function's LLVM IR:
+
+* At the call site of the function first inject a call to ``getpid(3)`` to get
+  the PID of the executing process and store this for use later, then inject a
+  ``fork(3)`` call:
+
+  * In the parent:
+
+    * Inject a call ``sleep(3)`` (hence the pause whilst ``gdb`` loads).
+    * Inject a call to the ``numba_gdb_breakpoint`` function (only
+      :func:`numba.gdb` does this).
+
+  * In the child:
+
+    * Inject a call to ``execl(3)`` with the arguments
+      ``numba.config.GDB_BINARY``, the ``attach`` command and the PID recorded
+      earlier. Numba has a special ``gdb`` command file that contains
+      instructions to break on the symbol ``numba_gdb_breakpoint`` and then
+      ``finish``, this is to make sure that the program stops on the
+      breakpoint but the frame it stops in is the compiled Python frame (or
+      one ``step`` away from, depending on optimisation). This command file is
+      also added to the arguments and finally and any user specified arguments
+      are added.
+
+At the call site of a :func:`numba.gdb_breakpoint` a call is injected to the
+special ``numba_gdb_breakpoint`` symbol, which is already registered and
+instrumented as a place to break and ``finish`` immediately.
+
+As a result of this, a e.g. :func:`numba.gdb` call will cause a fork in the
+program, the parent will sleep whilst the child launches ``gdb`` and attaches it
+to the parent and tells the parent to continue. The launched ``gdb`` has the
+``numba_gdb_breakpoint`` symbol registered as a breakpoint and when the parent
+continues and stops sleeping it will immediately call ``numba_gdb_breakpoint``
+on which the child will break. Additional :func:`numba.gdb_breakpoint` calls
+create calls to the registered breakpoint hence the program will also break at
+these locations.
+
+.. _debugging-cuda-python-code:
+
+Debugging CUDA Python code
+==========================
+
+Using the simulator
+-------------------
+
+CUDA Python code can be run in the Python interpreter using the CUDA Simulator,
+allowing it to be debugged with the Python debugger or with print statements. To
+enable the CUDA simulator, set the environment variable
+:envvar:`NUMBA_ENABLE_CUDASIM` to 1. For more information on the CUDA Simulator,
+see :ref:`the CUDA Simulator documentation <simulator>`.
+
+
+Debug Info
+----------
+
+By setting the ``debug`` argument to ``cuda.jit`` to ``True``
+(``@cuda.jit(debug=True)``), Numba will emit source location in the compiled
+CUDA code.  Unlike the CPU target, only filename and line information are
+available, but no variable type information is emitted.  The information
+is sufficient to debug memory error with
+`cuda-memcheck <http://docs.nvidia.com/cuda/cuda-memcheck/index.html>`_.
+
+For example, given the following cuda python code:
+
+.. code-block:: python
+  :linenos:
+
+  import numpy as np
+  from numba import cuda
+
+  @cuda.jit(debug=True)
+  def foo(arr):
+      arr[cuda.threadIdx.x] = 1
+
+  arr = np.arange(30)
+  foo[1, 32](arr)   # more threads than array elements
+
+We can use ``cuda-memcheck`` to find the memory error:
+
+.. code-block:: none
+
+  $ cuda-memcheck python chk_cuda_debug.py
+  ========= CUDA-MEMCHECK
+  ========= Invalid __global__ write of size 8
+  =========     at 0x00000148 in /home/user/chk_cuda_debug.py:6:cudapy::__main__::foo$241(Array<__int64, int=1, C, mutable, aligned>)
+  =========     by thread (31,0,0) in block (0,0,0)
+  =========     Address 0x500a600f8 is out of bounds
+  ...
+  =========
+  ========= Invalid __global__ write of size 8
+  =========     at 0x00000148 in /home/user/chk_cuda_debug.py:6:cudapy::__main__::foo$241(Array<__int64, int=1, C, mutable, aligned>)
+  =========     by thread (30,0,0) in block (0,0,0)
+  =========     Address 0x500a600f0 is out of bounds
+  ...
--- a/docs/source/user/vectorize.rst
+++ b/docs/source/user/vectorize.rst
+==================================
+Creating NumPy universal functions
+==================================
+
+There are two types of universal functions:
+
+* Those which operate on scalars, these are "universal functions" or *ufuncs*
+  (see ``@vectorize`` below).
+* Those which operate on higher dimensional arrays and scalars, these are
+  "generalized universal functions" or *gufuncs* (``@guvectorize`` below).
+
+.. _vectorize:
+
+The ``@vectorize`` decorator
+============================
+
+Numba's vectorize allows Python functions taking scalar input arguments to
+be used as NumPy `ufuncs`_.  Creating a traditional NumPy ufunc is
+not the most straightforward process and involves writing some C code.
+Numba makes this easy.  Using the :func:`~numba.vectorize` decorator, Numba
+can compile a pure Python function into a ufunc that operates over NumPy
+arrays as fast as traditional ufuncs written in C.
+
+.. _ufuncs: http://docs.scipy.org/doc/numpy/reference/ufuncs.html
+
+Using :func:`~numba.vectorize`, you write your function as operating over
+input scalars, rather than arrays.  Numba will generate the surrounding
+loop (or *kernel*) allowing efficient iteration over the actual inputs.
+
+The :func:`~numba.vectorize` decorator has two modes of operation:
+
+* Eager, or decoration-time, compilation: If you pass one or more type
+  signatures to the decorator, you will be building a NumPy universal
+  function (ufunc).  The rest of this subsection describes building
+  ufuncs using decoration-time compilation.
+
+* Lazy, or call-time, compilation: When not given any signatures, the
+  decorator will give you a Numba dynamic universal function
+  (:class:`~numba.DUFunc`) that dynamically compiles a new kernel when
+  called with a previously unsupported input type.  A later
+  subsection, ":ref:`dynamic-universal-functions`", describes this mode in
+  more depth.
+
+As described above, if you pass a list of signatures to the
+:func:`~numba.vectorize` decorator, your function will be compiled
+into a NumPy ufunc.  In the basic case, only one signature will be
+passed:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_one_signature`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_one_signature.begin
+   :end-before: magictoken.ex_vectorize_one_signature.end
+   :dedent: 12
+   :linenos:
+
+If you pass several signatures, beware that you have to pass most specific
+signatures before least specific ones (e.g., single-precision floats
+before double-precision floats), otherwise type-based dispatching will not work
+as expected:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_multiple_signatures`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_multiple_signatures.begin
+   :end-before: magictoken.ex_vectorize_multiple_signatures.end
+   :dedent: 12
+   :linenos:
+
+The function will work as expected over the specified array types:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_multiple_signatures`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_return_call_one.begin
+   :end-before: magictoken.ex_vectorize_return_call_one.end
+   :dedent: 12
+   :linenos:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_multiple_signatures`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_return_call_two.begin
+   :end-before: magictoken.ex_vectorize_return_call_two.end
+   :dedent: 12
+   :linenos:
+
+but it will fail working on other types::
+
+   >>> a = np.linspace(0, 1+1j, 6)
+   >>> f(a, a)
+   Traceback (most recent call last):
+     File "<stdin>", line 1, in <module>
+   TypeError: ufunc 'ufunc' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
+
+
+You might ask yourself, "why would I go through this instead of compiling
+a simple iteration loop using the :ref:`@jit <jit>` decorator?".  The
+answer is that NumPy ufuncs automatically get other features such as
+reduction, accumulation or broadcasting.  Using the example above:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_multiple_signatures`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_return_call_three.begin
+   :end-before: magictoken.ex_vectorize_return_call_three.end
+   :dedent: 12
+   :linenos:
+
+.. seealso::
+   `Standard features of ufuncs <http://docs.scipy.org/doc/numpy/reference/ufuncs.html#ufunc>`_ (NumPy documentation).
+
+.. note::
+   Only the broadcasting features of ufuncs are supported in compiled code.
+
+The :func:`~numba.vectorize` decorator supports multiple ufunc targets:
+
+=================       ===============================================================
+Target                    Description
+=================       ===============================================================
+cpu                     Single-threaded CPU
+
+
+parallel                Multi-core CPU
+
+
+cuda                    CUDA GPU
+
+                        .. NOTE:: This creates an *ufunc-like* object.
+			  See `documentation for CUDA ufunc <../cuda/ufunc.html>`_ for detail.
+=================       ===============================================================
+
+A general guideline is to choose different targets for different data sizes
+and algorithms.
+The "cpu" target works well for small data sizes (approx. less than 1KB) and low
+compute intensity algorithms. It has the least amount of overhead.
+The "parallel" target works well for medium data sizes (approx. less than 1MB).
+Threading adds a small delay.
+The "cuda" target works well for big data sizes (approx. greater than 1MB) and
+high compute intensity algorithms.  Transferring memory to and from the GPU adds
+significant overhead.
+
+
+.. _guvectorize:
+
+The ``@guvectorize`` decorator
+==============================
+
+While :func:`~numba.vectorize` allows you to write ufuncs that work on one
+element at a time, the :func:`~numba.guvectorize` decorator takes the concept
+one step further and allows you to write ufuncs that will work on an
+arbitrary number of elements of input arrays, and take and return arrays of
+differing dimensions.  The typical example is a running median or a
+convolution filter.
+
+Contrary to :func:`~numba.vectorize` functions, :func:`~numba.guvectorize`
+functions don't return their result value: they take it as an array
+argument, which must be filled in by the function.  This is because the
+array is actually allocated by NumPy's dispatch mechanism, which calls into
+the Numba-generated code.
+
+Similar to :func:`~numba.vectorize` decorator, :func:`~numba.guvectorize`
+also has two modes of operation: Eager, or decoration-time compilation and
+lazy, or call-time compilation.
+
+
+Here is a very simple example:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize.begin
+   :end-before: magictoken.ex_guvectorize.end
+   :dedent: 12
+   :linenos:
+
+The underlying Python function simply adds a given scalar (``y``) to all
+elements of a 1-dimension array.  What's more interesting is the declaration.
+There are two things there:
+
+* the declaration of input and output *layouts*, in symbolic form:
+  ``(n),()->(n)`` tells NumPy that the function takes a *n*-element one-dimension
+  array, a scalar (symbolically denoted by the empty tuple ``()``) and
+  returns a *n*-element one-dimension array;
+
+* the list of supported concrete *signatures* as per ``@vectorize``; here,
+  as in the above example, we demonstrate ``int64`` arrays.
+
+.. note::
+   1D array type can also receive scalar arguments (those with shape ``()``).
+   In the above example, the second argument also could be declared as
+   ``int64[:]``.  In that case, the value must be read by ``y[0]``.
+
+We can now check what the compiled ufunc does, over a simple example:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_call_one.begin
+   :end-before: magictoken.ex_guvectorize_call_one.end
+   :dedent: 12
+   :linenos:
+
+The nice thing is that NumPy will automatically dispatch over more
+complicated inputs, depending on their shapes:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_call_two.begin
+   :end-before: magictoken.ex_guvectorize_call_two.end
+   :dedent: 12
+   :linenos:
+
+
+.. note::
+   Both :func:`~numba.vectorize` and :func:`~numba.guvectorize` support
+   passing ``nopython=True`` :ref:`as in the @jit decorator <jit-nopython>`.
+   Use it to ensure the generated code does not fallback to
+   :term:`object mode`.
+
+
+.. _scalar-return-values:
+
+Scalar return values
+--------------------
+
+Now suppose we want to return a scalar value from 
+:func:`~numba.guvectorize`. To do this, we need to:
+
+* in the signatures, declare the scalar return with ``[:]`` like 
+  a 1-dimensional array (eg. ``int64[:]``),
+
+* in the layout, declare it as ``()``,
+
+* in the implementation, write to the first element (e.g. ``res[0] = acc``).
+
+The following example function computes the sum of the 1-dimensional 
+array (``x``) plus the scalar (``y``) and returns it as a scalar:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_scalar_return`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_scalar_return.begin
+   :end-before: magictoken.ex_guvectorize_scalar_return.end
+   :dedent: 12
+   :linenos:
+
+Now if we apply the wrapped function over the array, we get a scalar 
+value as the output:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_scalar_return`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_scalar_return_call.begin
+   :end-before: magictoken.ex_guvectorize_scalar_return_call.end
+   :dedent: 12
+   :linenos:
+
+
+.. _overwriting-input-values:
+
+Overwriting input values
+------------------------
+
+In most cases, writing to inputs may also appear to work - however, this
+behaviour cannot be relied on. Consider the following example function:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_overwrite`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_overwrite.begin
+   :end-before: magictoken.ex_guvectorize_overwrite.end
+   :dedent: 12
+   :linenos:
+
+Calling the `init_values` function with an array of `float64` type results in
+visible changes to the input:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_overwrite`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_overwrite_call_one.begin
+   :end-before: magictoken.ex_guvectorize_overwrite_call_one.end
+   :dedent: 12
+   :linenos:
+
+This works because NumPy can pass the input data directly into the `init_values`
+function as the data `dtype` matches that of the declared argument.  However, it
+may also create and pass in a temporary array, in which case changes to the
+input are lost. For example, this can occur when casting is required. To
+demonstrate, we can  use an array of `float32` with the `init_values` function:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_overwrite`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_overwrite_call_two.begin
+   :end-before: magictoken.ex_guvectorize_overwrite_call_two.end
+   :dedent: 12
+   :linenos:
+
+In this case, there is no change to the `invals` array because the temporary
+casted array was mutated instead.
+
+To solve this problem, one needs to tell the GUFunc engine that the ``invals``
+argument is writable. This can be achieved by passing ``writable_args=('invals',)``
+(specifying by name), or ``writable_args=(0,)`` (specifying by position) to
+``@guvectorize``. Now, the code above works as expected:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_overwrite`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_overwrite_call_three.begin
+   :end-before: magictoken.ex_guvectorize_overwrite_call_three.end
+   :dedent: 12
+   :linenos:
+
+.. _dynamic-universal-functions:
+
+Dynamic universal functions
+===========================
+
+As described above, if you do not pass any signatures to the
+:func:`~numba.vectorize` decorator, your Python function will be used
+to build a dynamic universal function, or :class:`~numba.DUFunc`.  For
+example:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_dynamic.begin
+   :end-before: magictoken.ex_vectorize_dynamic.end
+   :dedent: 12
+   :linenos:
+
+The resulting :func:`f` is a :class:`~numba.DUFunc` instance that
+starts with no supported input types.  As you make calls to :func:`f`,
+Numba generates new kernels whenever you pass a previously unsupported
+input type.  Given the example above, the following set of interpreter
+interactions illustrate how dynamic compilation works::
+
+   >>> f
+   <numba._DUFunc 'f'>
+   >>> f.ufunc
+   <ufunc 'f'>
+   >>> f.ufunc.types
+   []
+
+The example above shows that :class:`~numba.DUFunc` instances are not
+ufuncs.  Rather than subclass ufunc's, :class:`~numba.DUFunc`
+instances work by keeping a :attr:`~numba.DUFunc.ufunc` member, and
+then delegating ufunc property reads and method calls to this member
+(also known as type aggregation).  When we look at the initial types
+supported by the ufunc, we can verify there are none.
+
+Let's try to make a call to :func:`f`:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_dynamic_call_one.begin
+   :end-before: magictoken.ex_vectorize_dynamic_call_one.end
+   :dedent: 12
+   :linenos:
+
+If this was a normal NumPy ufunc, we would have seen an exception
+complaining that the ufunc couldn't handle the input types.  When we
+call :func:`f` with integer arguments, not only do we receive an
+answer, but we can verify that Numba created a loop supporting C
+:code:`long` integers.
+
+We can add additional loops by calling :func:`f` with different inputs:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_dynamic_call_two.begin
+   :end-before: magictoken.ex_vectorize_dynamic_call_two.end
+   :dedent: 12
+   :linenos:
+
+We can now verify that Numba added a second loop for dealing with
+floating-point inputs, :code:`"dd->d"`.
+
+If we mix input types to :func:`f`, we can verify that `NumPy ufunc
+casting rules`_ are still in effect:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_dynamic_call_three.begin
+   :end-before: magictoken.ex_vectorize_dynamic_call_three.end
+   :dedent: 12
+   :linenos:
+
+.. _`NumPy ufunc casting rules`: http://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules
+
+This example demonstrates that calling :func:`f` with mixed types
+caused NumPy to select the floating-point loop, and cast the integer
+argument to a floating-point value.  Thus, Numba did not create a
+special :code:`"dl->d"` kernel.
+
+This :class:`~numba.DUFunc` behavior leads us to a point similar to
+the warning given above in "`The @vectorize decorator`_" subsection,
+but instead of signature declaration order in the decorator, call
+order matters.  If we had passed in floating-point arguments first,
+any calls with integer arguments would be cast to double-precision
+floating-point values.  For example:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_vectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_vectorize_dynamic_call_four.begin
+   :end-before: magictoken.ex_vectorize_dynamic_call_four.end
+   :dedent: 12
+   :linenos:
+
+If you require precise support for various type signatures, you should
+specify them in the :func:`~numba.vectorize` decorator, and not rely
+on dynamic compilation.
+
+Dynamic generalized universal functions
+=======================================
+
+Similar to a dynamic universal function, if you do not specify any types to
+the :func:`~numba.guvectorize` decorator, your Python function will be used
+to build a dynamic generalized universal function, or :class:`~numba.GUFunc`.
+For example:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_dynamic.begin
+   :end-before: magictoken.ex_guvectorize_dynamic.end
+   :dedent: 12
+   :linenos:
+
+We can verify the resulting function :func:`g` is a :class:`~numba.GUFunc`
+instance that starts with no supported input types. For instance::
+
+   >>> g
+   <numba._GUFunc 'g'>
+   >>> g.ufunc
+   <ufunc 'g'>
+   >>> g.ufunc.types
+   []
+
+Similar to a :class:`~numba.DUFunc`, as one make calls to :func:`g()`,
+numba generates new kernels for previously unsupported input types. The
+following set of interpreter interactions will illustrate how dynamic
+compilation works for a :class:`~numba.GUFunc`:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_dynamic_call_one.begin
+   :end-before: magictoken.ex_guvectorize_dynamic_call_one.end
+   :dedent: 12
+   :linenos:
+
+If this was a normal :func:`guvectorize` function, we would have seen an
+exception complaining that the ufunc could not handle the given input types.
+When we call :func:`g()` with the input arguments, numba creates a new loop
+for the input types.
+
+We can add additional loops by calling :func:`g` with new arguments:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_dynamic_call_two.begin
+   :end-before: magictoken.ex_guvectorize_dynamic_call_two.end
+   :dedent: 12
+   :linenos:
+
+We can now verify that Numba added a second loop for dealing with
+floating-point inputs, :code:`"dd->d"`.
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_dynamic_call_three.begin
+   :end-before: magictoken.ex_guvectorize_dynamic_call_three.end
+   :dedent: 12
+   :linenos:
+
+One can also verify that NumPy ufunc casting rules are working as expected:
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_examples.py
+   :language: python
+   :caption: from ``test_guvectorize_dynamic`` of ``numba/tests/doc_examples/test_examples.py``
+   :start-after: magictoken.ex_guvectorize_dynamic_call_four.begin
+   :end-before: magictoken.ex_guvectorize_dynamic_call_four.end
+   :dedent: 12
+   :linenos:
+
+If you need precise support for various type signatures, you should not rely on dynamic
+compilation and instead, specify the types them as first
+argument in the :func:`~numba.guvectorize` decorator.
--- a/docs/source/user/withobjmode.rst
+++ b/docs/source/user/withobjmode.rst
+============================================================
+Callback into the Python Interpreter from within JIT'ed code
+============================================================
+
+There are rare but real cases when a nopython-mode function needs to callback
+into the Python interpreter to invoke code that cannot be compiled by Numba.
+Such cases include:
+
+- logging progress for long running JIT'ed functions;
+- use data structures that are not currently supported by Numba;
+- debugging inside JIT'ed code using the Python debugger.
+
+When Numba callbacks into the Python interpreter, the following has to happen:
+
+- acquire the GIL;
+- convert values in native representation back into Python objects;
+- call-back into the Python interpreter;
+- convert returned values from the Python-code into native representation;
+- release the GIL.
+
+These steps can be expensive.  Users **should not** rely on the feature
+described here on performance-critical paths.
+
+
+.. _with_objmode:
+
+The ``objmode`` context-manager
+===============================
+
+.. warning:: This feature can be easily mis-used.  Users should first consider
+    alternative approaches to achieve their intended goal before using
+    this feature.
+
+.. autofunction:: numba.objmode
--- a/docs/upcoming_changes/README.rst
+++ b/docs/upcoming_changes/README.rst
+
+Changelog
+=========
+
+This directory contains "news fragments" which are short files that contain a
+small **ReST**-formatted text that will be added to the next what's new page.
+
+Make sure to use full sentences with correct case and punctuation, and please
+try to use Sphinx intersphinx using backticks. The fragment should have a
+header line and an underline using ``""""""""`` followed by description of
+your user-facing changes as they should appear in the relase notes.
+
+Each file should be named like ``<PULL REQUEST>.<TYPE>.rst``, where
+``<PULL REQUEST>`` is a pull request number, and ``<TYPE>`` is one of:
+
+* ``highlight``: Adds a highlighted bullet point to use as a possible highlight
+  of the release.
+* ``np_support``: Addition of new NumPy functionality.
+* ``deprecation``: Changes to existing code that will now emit a deprecation warning.
+* ``expired``: Removal of a deprecated part of the API.
+* ``compatibility``: A change which requires users to change code and is not
+  backwards compatible. (Not to be used for removal of deprecated features.)
+* ``cuda``: Changes in the CUDA target implementation.
+* ``new_feature``: New user facing features like ``kwargs``.
+* ``improvement``: General improvements and edge-case changes which are
+  not new features or compatibility related.
+* ``performance``: Performance changes that should not affect other behaviour.
+* ``change``: Other changes
+* ``doc``: Documentation related changes.
+* ``infrastructure``: Infrastructure/CI related changes. 
+* ``bug_fix``: Bug Fixes for exiting features/functionality.
+
+If you are unsure what pull request type to use, don't hesitate to ask in your
+PR.
+
+You can install ``towncrier`` and run ``towncrier build --draft``
+if you want to get a preview of how your change will look in the final release
+notes.
+
+.. note::
+    This README was adapted from the NumPy changelog readme under the terms of
+    the `BSD-3 licence <https://github.com/numpy/numpy/blob/c1ffdbc0c29d48ece717acb5bfbf811c935b41f6/LICENSE.txt>`_.
--- a/docs/upcoming_changes/template.rst
+++ b/docs/upcoming_changes/template.rst
+{% set title = "Version {} (Release Date)".format(versiondata.version) %}
+
+{{ title }}
+{{ "-" * title|length }}
+
+{% for section, _ in sections.items() %}
+{% if section %}{{ section }}
+{{ "~" * section|length }}
+
+{% endif %}
+{% if sections[section] %}
+{% for category, val in definitions.items() if category in sections[section] %}
+
+{{ definitions[category]['name'] }}
+{{ "~" * definitions[category]['name']|length }}
+
+{% if definitions[category]['showcontent'] %}
+{% for text, values in sections[section][category].items() %}
+{{ text }}
+
+{{ get_indent(text) }}({{values|join(', ') }})
+
+{% endfor %}
+{% else %}
+- {{ sections[section][category]['']|join(', ') }}
+
+{% endif %}
+{% if sections[section][category]|length == 0 %}
+No significant changes.
+
+{% else %}
+{% endif %}
+{% endfor %}
+{% else %}
+No significant changes.
+
+
+{% endif %}
+{% endfor %}
\ No newline at end of file
--- a/maint/github_weekly_meeting.py
+++ b/maint/github_weekly_meeting.py
+#! /usr/bin/env python
+
+import sys
+import os
+
+from github3 import login
+import github3
+
+
+def fetch(orgname, reponame, last_num, gh):
+
+    repo = gh.repository(orgname, reponame)
+    issues = repo.issues(state='all')
+
+    opened_issues = []
+    closed_issues = []
+
+    opened_prs = []
+    closed_prs = []
+
+    max_iss_num = 0
+    for issue in issues:
+        info = issue.as_dict()
+        iss_num = int(info['number'])
+        max_iss_num = max(max_iss_num, iss_num)
+
+        if iss_num <= last_num:
+            break
+        merged = False
+        if issue.pull_request_urls:
+            # Is PR?
+            merged = bool(info['pull_request'].get("merged_at"))
+            where = {'opened': opened_prs, 'closed': closed_prs}
+        else:
+            # Is Issues
+            where = {'opened': opened_issues, 'closed': closed_issues}
+
+        line = f"{' - merged ' if merged else ''}- [{reponame}"\
+               f"#{info['number']}]({info['html_url']}) - {info['title']}"
+
+        # Is issue already merged
+        if issue.is_closed():
+            where['closed'].append(line)
+        else:
+            where['opened'].append(line)
+
+    return {
+        'opened_issues': opened_issues,
+        'closed_issues': closed_issues,
+        'opened_prs': opened_prs,
+        'closed_prs': closed_prs,
+        'max_iss_num': max_iss_num,
+    }
+
+
+def display(data):
+    print("## 1. New Issues")
+    for line in reversed(data['opened_issues']):
+        print(line)
+    print()
+
+    print("### Closed Issues")
+    for line in reversed(data['closed_issues']):
+        print(line)
+    print()
+
+    print("## 2. New PRs")
+    for line in reversed(data['opened_prs']):
+        print(line)
+    print()
+
+    print("### Closed PRs")
+    for line in reversed(data['closed_prs']):
+        print(line)
+    print()
+
+
+def main(numba_last_num, llvmlite_last_num, user=None, password=None):
+
+    if user is not None and password is not None:
+        gh = login(str(user), password=str(password))
+    else:
+        gh = github3
+
+    numba_data = fetch("numba", "numba", numba_last_num, gh)
+    llvmlite_data = fetch("numba", "llvmlite", llvmlite_last_num, gh)
+
+    # combine data
+    data = {
+        'opened_issues':
+            llvmlite_data['opened_issues'] +
+            numba_data['opened_issues'],
+        'closed_issues':
+            llvmlite_data['closed_issues'] +
+            numba_data['closed_issues'],
+        'opened_prs':
+            llvmlite_data['opened_prs'] +
+            numba_data['opened_prs'],
+        'closed_prs':
+            llvmlite_data['closed_prs'] +
+        numba_data['closed_prs'],
+    }
+
+    display(data)
+
+    print(f"(last numba: {numba_data['max_iss_num']};"
+          f"llvmlite {llvmlite_data['max_iss_num']})")
+
+
+help_msg = """
+Usage:
+    {program_name} <numba_last_num> <llvmlite_last_num>
+"""
+
+if __name__ == '__main__':
+    program_name = sys.argv[0]
+    try:
+        [numba_last_num, llvmlite_last_num] = sys.argv[1:]
+    except ValueError:
+        print(help_msg.format(program_name=program_name))
+    else:
+        main(int(numba_last_num),
+             int(llvmlite_last_num),
+             user=os.environ.get("GHUSER"),
+             password=os.environ.get("GHPASS"))
--- a/maint/gitlog2changelog.py
+++ b/maint/gitlog2changelog.py
+"""gitlog2changelog.py
+
+Usage:
+  gitlog2changelog.py (-h | --help)
+  gitlog2changelog.py --version
+  gitlog2changelog.py --token=<token> --beginning=<tag> --repo=<repo> \
+          --digits=<digits> [--summary]
+
+Options:
+  -h --help          Show this screen.
+  --version          Show version.
+  --beginning=<tag>  Where in the History to begin
+  --repo=<repo>      Which repository to look at on GitHub
+  --token=<token>    The GitHub token to talk to the API
+  --digits=<digits>  The number of digits to use in the issue finding regex
+  --summary          Show total count for each section
+
+"""
+
+import re
+
+from git import Repo
+from docopt import docopt
+from github import Github
+
+
+ghrepo = None
+
+
+def get_pr(pr_number):
+    return ghrepo.get_pull(pr_number)
+
+
+def hyperlink_user(user_obj):
+    return "`%s <%s>`_" % (user_obj.login, user_obj.html_url)
+
+
+if __name__ == '__main__':
+    arguments = docopt(__doc__, version='1.0')
+    beginning = arguments['--beginning']
+    target_ghrepo = arguments['--repo']
+    github_token = arguments['--token']
+    regex_digits = arguments['--digits']
+    summary = arguments["--summary"]
+    ghrepo = Github(github_token).get_repo(target_ghrepo)
+    repo = Repo('.')
+    all_commits = [x for x in repo.iter_commits(f'{beginning}..HEAD')]
+    merge_commits = [x for x in all_commits
+                     if 'Merge pull request' in x.message]
+    prmatch = re.compile(
+        f'^Merge pull request #([0-9]{{{regex_digits}}}) from.*')
+    ordered = {}
+    authors = set()
+    for x in merge_commits:
+        match = prmatch.match(x.message)
+        if match:
+            issue_id = match.groups()[0]
+            ordered[issue_id] = "%s" % (x.message.splitlines()[2])
+    print("Pull-Requests:\n")
+    for k in sorted(ordered.keys()):
+        pull = get_pr(int(k))
+        hyperlink = "`#%s <%s>`_" % (k, pull.html_url)
+        # get all users for all commits
+        pr_authors = set()
+        for c in pull.get_commits():
+            if c.author:
+                pr_authors.add(c.author)
+            if c.committer and c.committer.login != "web-flow":
+                pr_authors.add(c.committer)
+        print("* PR %s: %s (%s)" % (hyperlink, ordered[k],
+                                    " ".join([hyperlink_user(u) for u in
+                                              pr_authors])))
+        for a in pr_authors:
+            authors.add(a)
+    if summary:
+        print("\nTotal PRs: %s\n" % len(ordered))
+    else:
+        print()
+    print("Authors:\n")
+    [print('* %s' % hyperlink_user(x)) for x in sorted(authors, key=lambda x:
+                                                       x.login.lower())]
+    if summary:
+        print("\nTotal authors: %s" % len(authors))
--- a/mypy.ini
+++ b/mypy.ini
+# Global options:
+
+[mypy]
+warn_unused_configs = True
+follow_imports = silent
+show_error_context = True
+files = **/numba/core/types/*.py, **/numba/core/datamodel/*.py, **/numba/core/rewrites/*.py, **/numba/core/unsafe/*.py, **/numba/core/rvsdg_frontend/*.py, **/numba/core/rvsdg_frontend/rvsdg/*.py
+
+# Per-module options:
+# To classify a given module as Level 1, 2 or 3 it must be added both in files (variable above) and in the lists below.
+# Level 1 - modules checked on the strictest settings.
+;[mypy-]
+;warn_return_any = True
+;disallow_any_expr = True
+;disallow_any_explicit = True
+;disallow_any_generics = True
+;disallow_subclassing_any = True
+;disallow_untyped_calls = True
+;disallow_untyped_defs = True
+;disallow_incomplete_defs = True
+;check_untyped_defs = True
+;disallow_untyped_decorators = True
+;warn_unused_ignores = True
+;follow_imports = normal
+;warn_unreachable = True
+;strict_equality = True
+
+# Level 2 - module that pass reasonably strict settings.
+#           No untyped functions allowed. Imports must be typed or explicitly ignored.
+;[mypy-]
+;warn_return_any = True
+;disallow_untyped_defs = True
+;disallow_incomplete_defs = True
+;follow_imports = normal
+
+# Level 3 - modules that pass mypy default settings (only those in `files` global setting and not in previous levels)
+#           Function/variables are annotated to avoid mypy errors, but annotations are not complete.
+[mypy-numba.core.*]
+warn_return_any = True
+
+# Level 4 - modules that do not pass mypy check: they are excluded from "files" setting in global section
+
+# External packages that lack annotations
+[mypy-llvmlite.*]
+ignore_missing_imports = True
+
+[mypy-numpy.*]
+ignore_missing_imports = True
+
+[mypy-winreg.*]
+# this can be removed after Mypy 0.78 is out with the latest typeshed
+ignore_missing_imports = True
+
+[mypy-numba_rvsdg.*]
+ignore_missing_imports = True
+
+[mypy-graphviz.*]
+ignore_missing_imports = True