init 0.58

1fb0017a · dugupeiwen · 1fb0017a · 1fb0017a · 1fb0017a · 1fb0017a
Commit 1fb0017a authored Mar 23, 2024 by dugupeiwen
20 changed files
--- a/docs/source/developer/event_api.rst
+++ b/docs/source/developer/event_api.rst
+Event API
+=========
+
+.. automodule:: numba.core.event
+    :members:
\ No newline at end of file
--- a/docs/source/developer/generators.rst
+++ b/docs/source/developer/generators.rst
+
+.. _arch-generators:
+
+===================
+Notes on generators
+===================
+
+Numba recently gained support for compiling generator functions.  This
+document explains some of the implementation choices.
+
+
+Terminology
+===========
+
+For clarity, we distinguish between *generator functions* and
+*generators*.  A generator function is a function containing one or
+several ``yield`` statements.  A generator (sometimes also called "generator
+iterator") is the return value of a generator function; it resumes
+execution inside its frame each time :py:func:`next` is called.
+
+A *yield point* is the place where a ``yield`` statement is called.
+A *resumption point* is the place just after a *yield point* where execution
+is resumed when :py:func:`next` is called again.
+
+
+Function analysis
+=================
+
+Suppose we have the following simple generator function::
+
+   def gen(x, y):
+       yield x + y
+       yield x - y
+
+Here is its CPython bytecode, as printed out using :py:func:`dis.dis`::
+
+  7           0 LOAD_FAST                0 (x)
+              3 LOAD_FAST                1 (y)
+              6 BINARY_ADD
+              7 YIELD_VALUE
+              8 POP_TOP
+
+  8           9 LOAD_FAST                0 (x)
+             12 LOAD_FAST                1 (y)
+             15 BINARY_SUBTRACT
+             16 YIELD_VALUE
+             17 POP_TOP
+             18 LOAD_CONST               0 (None)
+             21 RETURN_VALUE
+
+When compiling this function with :envvar:`NUMBA_DUMP_IR` set to 1, the
+following information is printed out::
+
+   ----------------------------------IR DUMP: gen----------------------------------
+   label 0:
+       x = arg(0, name=x)                       ['x']
+       y = arg(1, name=y)                       ['y']
+       $0.3 = x + y                             ['$0.3', 'x', 'y']
+       $0.4 = yield $0.3                        ['$0.3', '$0.4']
+       del $0.4                                 []
+       del $0.3                                 []
+       $0.7 = x - y                             ['$0.7', 'x', 'y']
+       del y                                    []
+       del x                                    []
+       $0.8 = yield $0.7                        ['$0.7', '$0.8']
+       del $0.8                                 []
+       del $0.7                                 []
+       $const0.9 = const(NoneType, None)        ['$const0.9']
+       $0.10 = cast(value=$const0.9)            ['$0.10', '$const0.9']
+       del $const0.9                            []
+       return $0.10                             ['$0.10']
+   ------------------------------GENERATOR INFO: gen-------------------------------
+   generator state variables: ['$0.3', '$0.7', 'x', 'y']
+   yield point #1: live variables = ['x', 'y'], weak live variables = ['$0.3']
+   yield point #2: live variables = [], weak live variables = ['$0.7']
+
+
+What does it mean? The first part is the Numba IR, as already seen in
+:ref:`arch_generate_numba_ir`.  We can see the two yield points (``yield $0.3``
+and ``yield $0.7``).
+
+The second part shows generator-specific information.  To understand it
+we have to understand what suspending and resuming a generator means.
+
+When suspending a generator, we are not merely returning a value to the
+caller (the operand of the ``yield`` statement).  We also have to save the
+generator's *current state* in order to resume execution.  In trivial use
+cases, perhaps the CPU's register values or stack slots would be preserved
+until the next call to next().  However, any non-trivial case will hopelessly
+clobber those values, so we have to save them in a well-defined place.
+
+What are the values we need to save?  Well, in the context of the Numba
+Intermediate Representation, we must save all *live variables* at each
+yield point.  These live variables are computed thanks to the control
+flow graph.
+
+Once live variables are saved and the generator is suspended, resuming
+the generator simply involves the inverse operation: the live variables
+are restored from the saved generator state.
+
+.. note::
+   It is the same analysis which helps insert Numba ``del`` instructions
+   where appropriate.
+
+Let's go over the generator info again::
+
+   generator state variables: ['$0.3', '$0.7', 'x', 'y']
+   yield point #1: live variables = ['x', 'y'], weak live variables = ['$0.3']
+   yield point #2: live variables = [], weak live variables = ['$0.7']
+
+Numba has computed the union of all live variables (denoted as "state
+variables").  This will help define the layout of the :ref:`generator
+structure <generator-structure>`.  Also, for each yield point, we have
+computed two sets of variables:
+
+* the *live variables* are the variables which are used by code following
+  the resumption point (i.e. after the ``yield`` statement)
+
+* the *weak live variables* are variables which are del'ed immediately
+  after the resumption point; they have to be saved in :term:`object mode`,
+  to ensure proper reference cleanup
+
+
+.. _generator-structure:
+
+The generator structure
+=======================
+
+Layout
+------
+
+Function analysis helps us gather enough information to define the
+layout of the generator structure, which will store the entire execution
+state of a generator.  Here is a sketch of the generator structure's layout,
+in pseudo-code::
+
+   struct gen_struct_t {
+      int32_t resume_index;
+      struct gen_args_t {
+         arg_0_t arg0;
+         arg_1_t arg1;
+         ...
+         arg_N_t argN;
+      }
+      struct gen_state_t {
+         state_0_t state_var0;
+         state_1_t state_var1;
+         ...
+         state_N_t state_varN;
+      }
+   }
+
+Let's describe those fields in order.
+
+* The first member, the *resume index*, is an integer telling the generator
+  at which resumption point execution must resume.  By convention, it can
+  have two special values: 0 means execution must start at the beginning of
+  the generator (i.e. the first time :py:func:`next` is called); -1 means
+  the generator is exhausted and resumption must immediately raise
+  StopIteration.  Other values indicate the yield point's index starting from 1
+  (corresponding to the indices shown in the generator info above).
+
+* The second member, the *arguments structure* is read-only after it is first
+  initialized.  It stores the values of the arguments the generator function
+  was called with.  In our example, these are the values of ``x`` and ``y``.
+
+* The third member, the *state structure*, stores the live variables as
+  computed above.
+
+Concretely, our example's generator structure (assuming the generator
+function is called with floating-point numbers) is then::
+
+   struct gen_struct_t {
+      int32_t resume_index;
+      struct gen_args_t {
+         double arg0;
+         double arg1;
+      }
+      struct gen_state_t {
+         double $0.3;
+         double $0.7;
+         double x;
+         double y;
+      }
+   }
+
+Note that here, saving ``x`` and ``y`` is redundant: Numba isn't able to
+recognize that the state variables ``x`` and ``y`` have the same value
+as ``arg0`` and ``arg1``.
+
+Allocation
+----------
+
+How does Numba ensure the generator structure is preserved long enough?
+There are two cases:
+
+* When a Numba-compiled generator function is called from a Numba-compiled
+  function, the structure is allocated on the stack by the callee.  In this
+  case, generator instantiation is practically costless.
+
+* When a Numba-compiled generator function is called from regular Python
+  code, a CPython-compatible wrapper is instantiated that has the right
+  amount of allocated space to store the structure, and whose
+  :c:member:`~PyTypeObject.tp_iternext` slot is a wrapper around the
+  generator's native code.
+
+
+Compiling to native code
+========================
+
+When compiling a generator function, three native functions are actually
+generated by Numba:
+
+* An initialization function.  This is the function corresponding
+  to the generator function itself: it receives the function arguments and
+  stores them inside the generator structure (which is passed by pointer).
+  It also initialized the *resume index* to 0, indicating that the generator
+  hasn't started yet.
+
+* A next() function.  This is the function called to resume execution
+  inside the generator.  Its single argument is a pointer to the generator
+  structure and it returns the next yielded value (or a special exit code
+  is used if the generator is exhausted, for quick checking when called
+  from Numba-compiled functions).
+
+* An optional finalizer.  In object mode, this function ensures that all
+  live variables stored in the generator state are decref'ed, even if the
+  generator is destroyed without having been exhausted.
+
+The next() function
+-------------------
+
+The next() function is the least straight-forward of the three native
+functions.  It starts with a trampoline which dispatches execution to the
+right resume point depending on the *resume index* stored in the generator
+structure.  Here is how the function start may look like in our example:
+
+.. code-block:: llvm
+
+   define i32 @"__main__.gen.next"(
+      double* nocapture %retptr,
+      { i8*, i32 }** nocapture readnone %excinfo,
+      i8* nocapture readnone %env,
+      { i32, { double, double }, { double, double, double, double } }* nocapture %arg.gen)
+   {
+     entry:
+        %gen.resume_index = getelementptr { i32, { double, double }, { double, double, double, double } }* %arg.gen, i64 0, i32 0
+        %.47 = load i32* %gen.resume_index, align 4
+        switch i32 %.47, label %stop_iteration [
+          i32 0, label %B0
+          i32 1, label %generator_resume1
+          i32 2, label %generator_resume2
+        ]
+
+     ; rest of the function snipped
+
+(uninteresting stuff trimmed from the LLVM IR to make it more readable)
+
+We recognize the pointer to the generator structure in ``%arg.gen``.
+The trampoline switch has three targets (one for each *resume index* 0, 1
+and 2), and a fallback target label named ``stop_iteration``.  Label ``B0``
+represents the function's start, ``generator_resume1`` (resp.
+``generator_resume2``) is the resumption point after the first
+(resp. second) yield point.
+
+After generation by LLVM, the whole native assembly code for this function
+may look like this (on x86-64):
+
+.. code-block:: asm
+
+           .globl  __main__.gen.next
+           .align  16, 0x90
+   __main__.gen.next:
+           movl    (%rcx), %eax
+           cmpl    $2, %eax
+           je      .LBB1_5
+           cmpl    $1, %eax
+           jne     .LBB1_2
+           movsd   40(%rcx), %xmm0
+           subsd   48(%rcx), %xmm0
+           movl    $2, (%rcx)
+           movsd   %xmm0, (%rdi)
+           xorl    %eax, %eax
+           retq
+   .LBB1_5:
+           movl    $-1, (%rcx)
+           jmp     .LBB1_6
+   .LBB1_2:
+           testl   %eax, %eax
+           jne     .LBB1_6
+           movsd   8(%rcx), %xmm0
+           movsd   16(%rcx), %xmm1
+           movaps  %xmm0, %xmm2
+           addsd   %xmm1, %xmm2
+           movsd   %xmm1, 48(%rcx)
+           movsd   %xmm0, 40(%rcx)
+           movl    $1, (%rcx)
+           movsd   %xmm2, (%rdi)
+           xorl    %eax, %eax
+           retq
+   .LBB1_6:
+           movl    $-3, %eax
+           retq
+
+Note the function returns 0 to indicate a value is yielded, -3 to indicate
+StopIteration. ``%rcx`` points to the start of the generator structure,
+where the resume index is stored.
--- a/docs/source/developer/hashing.rst
+++ b/docs/source/developer/hashing.rst
+
+================
+Notes on Hashing
+================
+
+Numba supports the built-in :func:`hash` and does so by simply calling the
+:func:`__hash__` member function on the supplied argument. This makes it
+trivial to add hash support for new types as all that is required is the
+application of the extension API :func:`overload_method` decorator to overload
+a function for computing the hash value for the new type registered to the
+type's :func:`__hash__` method. For example::
+
+    from numba.extending import overload_method
+
+    @overload_method(myType, '__hash__')
+    def myType_hash_overload(obj):
+        # implementation details
+
+
+The Implementation
+==================
+
+The implementation of the Numba hashing functions strictly follows that of
+Python 3. The only exception to this is that for hashing Unicode and bytes (for
+content longer than ``sys.hash_info.cutoff``) the only supported algorithm is
+``siphash24`` (default in CPython 3). As a result Numba will match Python 3
+hash values for all supported types under the default conditions described.
+
+Unicode hash cache differences
+------------------------------
+
+Both Numba and CPython Unicode string internal representations have a ``hash``
+member for the purposes of caching the string's hash value. This member is
+always checked ahead of computing a hash value with the view of simply providing
+a value from cache as it is considerably cheaper to do so. The Numba Unicode
+string hash caching implementation behaves in a similar way to that of
+CPython's. The only notable behavioral change (and its only impact is a minor
+potential change in performance) is that Numba always computes and caches the
+hash for Unicode strings created in ``nopython mode`` at the time they are boxed
+for reuse in Python, this is too eager in some cases in comparison to CPython
+which may delay hashing a new Unicode string depending on creation method. It
+should also be noted that Numba copies in the ``hash`` member of the CPython
+internal representation for Unicode strings when unboxing them to its own
+representation so as to not recompute the hash of a string that already has a
+hash value associated with it.
+
+The accommodation of ``PYTHONHASHSEED``
+---------------------------------------
+
+The ``PYTHONHASHSEED`` environment variable can be used to seed the CPython
+hashing algorithms for e.g. the purposes of reproducibility. The Numba hashing
+implementation directly reads the CPython hashing algorithms' internal state and
+as a result the influence of ``PYTHONHASHSEED`` is replicated in Numba's
+hashing implementations.
--- a/docs/source/developer/index.rst
+++ b/docs/source/developer/index.rst
+
+.. _developer-manual:
+
+Developer Manual
+================
+
+.. toctree::
+   :maxdepth: 2
+
+   contributing.rst
+   release.rst
+   repomap.rst
+   architecture.rst
+   dispatching.rst
+   generators.rst
+   numba-runtime.rst
+   rewrites.rst
+   live_variable_analysis.rst
+   listings.rst
+   stencil.rst
+   custom_pipeline.rst
+   inlining.rst
+   environment.rst
+   hashing.rst
+   caching.rst
+   threading_implementation.rst
+   literal.rst
+   llvm_timings.rst
+   debugging.rst
+   event_api.rst
+   target_extension.rst
+   mission.rst
--- a/docs/source/developer/inline_example.py
+++ b/docs/source/developer/inline_example.py
+from numba import njit
+import numba
+from numba.core import ir
+
+
+@njit(inline='never')
+def never_inline():
+    return 100
+
+
+@njit(inline='always')
+def always_inline():
+    return 200
+
+
+def sentinel_cost_model(expr, caller_info, callee_info):
+    # this cost model will return True (i.e. do inlining) if either:
+    # a) the callee IR contains an `ir.Const(37)`
+    # b) the caller IR contains an `ir.Const(13)` logically prior to the call
+    #    site
+
+    # check the callee
+    for blk in callee_info.blocks.values():
+        for stmt in blk.body:
+            if isinstance(stmt, ir.Assign):
+                if isinstance(stmt.value, ir.Const):
+                    if stmt.value.value == 37:
+                        return True
+
+    # check the caller
+    before_expr = True
+    for blk in caller_info.blocks.values():
+        for stmt in blk.body:
+            if isinstance(stmt, ir.Assign):
+                if isinstance(stmt.value, ir.Expr):
+                    if stmt.value == expr:
+                        before_expr = False
+                if isinstance(stmt.value, ir.Const):
+                    if stmt.value.value == 13:
+                        return True & before_expr
+    return False
+
+
+@njit(inline=sentinel_cost_model)
+def maybe_inline1():
+    # Will not inline based on the callee IR with the declared cost model
+    # The following is ir.Const(300).
+    return 300
+
+
+@njit(inline=sentinel_cost_model)
+def maybe_inline2():
+    # Will inline based on the callee IR with the declared cost model
+    # The following is ir.Const(37).
+    return 37
+
+
+@njit
+def foo():
+    a = never_inline()  # will never inline
+    b = always_inline()  # will always inline
+
+    # will not inline as the function does not contain a magic constant known to
+    # the cost model, and the IR up to the call site does not contain a magic
+    # constant either
+    d = maybe_inline1()
+
+    # declare this magic constant to trigger inlining of maybe_inline1 in a
+    # subsequent call
+    magic_const = 13
+
+    # will inline due to above constant declaration
+    e = maybe_inline1()
+
+    # will inline as the maybe_inline2 function contains a magic constant known
+    # to the cost model
+    c = maybe_inline2()
+
+    return a + b + c + d + e + magic_const
+
+
+foo()
--- a/docs/source/developer/inline_overload_example.py
+++ b/docs/source/developer/inline_overload_example.py
+import numba
+from numba.extending import overload
+from numba import njit, types
+
+
+def bar(x):
+    """A function stub to overload"""
+    pass
+
+
+@overload(bar, inline='always')
+def ol_bar_tuple(x):
+    # An overload that will always inline, there is a type guard so that this
+    # only applies to UniTuples.
+    if isinstance(x, types.UniTuple):
+        def impl(x):
+            return x[0]
+        return impl
+
+
+def cost_model(expr, caller, callee):
+    # Only inline if the type of the argument is an Integer
+    return isinstance(caller.typemap[expr.args[0].name], types.Integer)
+
+
+@overload(bar, inline=cost_model)
+def ol_bar_scalar(x):
+    # An overload that will inline based on a cost model, it only applies to
+    # scalar values in the numerical domain as per the type guard on Number
+    if isinstance(x, types.Number):
+        def impl(x):
+            return x + 1
+        return impl
+
+
+@njit
+def foo():
+
+    # This will resolve via `ol_bar_tuple` as the argument is a types.UniTuple
+    # instance. It will always be inlined as specified in the decorator for this
+    # overload.
+    a = bar((1, 2, 3))
+
+    # This will resolve via `ol_bar_scalar` as the argument is a types.Number
+    # instance, hence the cost_model will be used to determine whether to
+    # inline.
+    # The function will be inlined as the value 100 is an IntegerLiteral which
+    # is an instance of a types.Integer as required by the cost_model function.
+    b = bar(100)
+
+    # This will also resolve via `ol_bar_scalar` as the argument is a
+    # types.Number instance, again the cost_model will be used to determine
+    # whether to inline.
+    # The function will not be inlined as the complex value is not an instance
+    # of a types.Integer as required by the cost_model function.
+    c = bar(300j)
+
+    return a + b + c
+
+
+foo()
--- a/docs/source/developer/inlining.rst
+++ b/docs/source/developer/inlining.rst
+
+=================
+Notes on Inlining
+=================
+
+There are occasions where it is useful to be able to inline a function at its
+call site, at the Numba IR level of representation. The decorators such as
+:func:`numba.jit`, :func:`numba.extending.overload` and
+:func:`register_jitable` support the keyword argument ``inline``, to facilitate
+this behaviour.
+
+When attempting to inline at this level, it is important to understand what
+purpose this serves and what effect this will have. In contrast to the inlining
+performed by LLVM, which is aimed at improving performance, the main reason to
+inline at the Numba IR level is to allow type inference to cross function
+boundaries.
+
+As an example, consider the following snippet:
+
+.. code:: python
+
+    from numba import njit
+
+
+    @njit
+    def bar(a):
+        a.append(10)
+
+
+    @njit
+    def foo():
+        z = []
+        bar(z)
+
+
+    foo()
+
+This will fail to compile and run, because the type of ``z`` can not be inferred
+as it will only be refined within ``bar``. If we now add ``inline=True`` to the
+decorator for ``bar`` the snippet will compile and run. This is because inlining
+the call to ``a.append(10)`` will mean that ``z`` will be refined to hold integers
+and so type inference will succeed.
+
+So, to recap, inlining at the Numba IR level is unlikely to have a performance
+benefit. Whereas inlining at the LLVM level stands a better chance.
+
+The ``inline`` keyword argument can be one of three values:
+
+* The string ``'never'``, this is the default and results in the function not
+  being inlined under any circumstances.
+* The string ``'always'``, this results in the function being inlined at all
+  call sites.
+* A python function that takes three arguments. The first argument is always the
+  ``ir.Expr`` node that is the ``call`` requesting the inline, this is present
+  to allow the function to make call contextually aware decisions. The second
+  and third arguments are:
+
+  * In the case of an untyped inline, i.e. that which occurs when using the
+    :func:`numba.jit` family of decorators, both arguments are
+    ``numba.ir.FunctionIR`` instances. The second argument corresponding to the
+    IR of the caller, the third argument corresponding to the IR of the callee.
+
+  * In the case of a typed inline, i.e. that which occurs when using
+    :func:`numba.extending.overload`, both arguments are instances of a
+    ``namedtuple`` with fields (corresponding to their standard use in the
+    compiler internals):
+
+    * ``func_ir`` - the function's Numba IR.
+    * ``typemap`` - the function's type map.
+    * ``calltypes`` - the call types of any calls in the function.
+    * ``signature`` - the function's signature.
+
+    The second argument holds the information from the caller, the third holds
+    the information from the callee.
+
+  In all cases the function should return True to inline and return False to not
+  inline, this essentially permitting custom inlining rules (typical use might
+  be cost models).
+* Recursive functions with ``inline='always'`` will result in a non-terminating
+  compilation. If you wish to avoid this, supply a function to limit the
+  recursion depth (see below).
+
+.. note:: No guarantee is made about the order in which functions are assessed
+          for inlining or about the order in which they are inlined.
+
+
+Example using :func:`numba.jit`
+===============================
+
+An example of using all three options to ``inline`` in the :func:`numba.njit`
+decorator:
+
+.. literalinclude:: inline_example.py
+
+which produces the following when executed (with a print of the IR after the
+legalization pass, enabled via the environment variable
+``NUMBA_DEBUG_PRINT_AFTER="ir_legalization"``):
+
+.. code-block:: none
+    :emphasize-lines: 2, 3, 9, 16, 17, 21, 22, 26, 35
+
+    label 0:
+        $0.1 = global(never_inline: CPUDispatcher(<function never_inline at 0x7f890ccf9048>)) ['$0.1']
+        $0.2 = call $0.1(func=$0.1, args=[], kws=(), vararg=None) ['$0.1', '$0.2']
+        del $0.1                                 []
+        a = $0.2                                 ['$0.2', 'a']
+        del $0.2                                 []
+        $0.3 = global(always_inline: CPUDispatcher(<function always_inline at 0x7f890ccf9598>)) ['$0.3']
+        del $0.3                                 []
+        $const0.1.0 = const(int, 200)            ['$const0.1.0']
+        $0.2.1 = $const0.1.0                     ['$0.2.1', '$const0.1.0']
+        del $const0.1.0                          []
+        $0.4 = $0.2.1                            ['$0.2.1', '$0.4']
+        del $0.2.1                               []
+        b = $0.4                                 ['$0.4', 'b']
+        del $0.4                                 []
+        $0.5 = global(maybe_inline1: CPUDispatcher(<function maybe_inline1 at 0x7f890ccf9ae8>)) ['$0.5']
+        $0.6 = call $0.5(func=$0.5, args=[], kws=(), vararg=None) ['$0.5', '$0.6']
+        del $0.5                                 []
+        d = $0.6                                 ['$0.6', 'd']
+        del $0.6                                 []
+        $const0.7 = const(int, 13)               ['$const0.7']
+        magic_const = $const0.7                  ['$const0.7', 'magic_const']
+        del $const0.7                            []
+        $0.8 = global(maybe_inline1: CPUDispatcher(<function maybe_inline1 at 0x7f890ccf9ae8>)) ['$0.8']
+        del $0.8                                 []
+        $const0.1.2 = const(int, 300)            ['$const0.1.2']
+        $0.2.3 = $const0.1.2                     ['$0.2.3', '$const0.1.2']
+        del $const0.1.2                          []
+        $0.9 = $0.2.3                            ['$0.2.3', '$0.9']
+        del $0.2.3                               []
+        e = $0.9                                 ['$0.9', 'e']
+        del $0.9                                 []
+        $0.10 = global(maybe_inline2: CPUDispatcher(<function maybe_inline2 at 0x7f890ccf9b70>)) ['$0.10']
+        del $0.10                                []
+        $const0.1.4 = const(int, 37)             ['$const0.1.4']
+        $0.2.5 = $const0.1.4                     ['$0.2.5', '$const0.1.4']
+        del $const0.1.4                          []
+        $0.11 = $0.2.5                           ['$0.11', '$0.2.5']
+        del $0.2.5                               []
+        c = $0.11                                ['$0.11', 'c']
+        del $0.11                                []
+        $0.14 = a + b                            ['$0.14', 'a', 'b']
+        del b                                    []
+        del a                                    []
+        $0.16 = $0.14 + c                        ['$0.14', '$0.16', 'c']
+        del c                                    []
+        del $0.14                                []
+        $0.18 = $0.16 + d                        ['$0.16', '$0.18', 'd']
+        del d                                    []
+        del $0.16                                []
+        $0.20 = $0.18 + e                        ['$0.18', '$0.20', 'e']
+        del e                                    []
+        del $0.18                                []
+        $0.22 = $0.20 + magic_const              ['$0.20', '$0.22', 'magic_const']
+        del magic_const                          []
+        del $0.20                                []
+        $0.23 = cast(value=$0.22)                ['$0.22', '$0.23']
+        del $0.22                                []
+        return $0.23                             ['$0.23']
+
+
+Things to note in the above:
+
+1. The call to the function ``never_inline`` remains as a call.
+2. The ``always_inline`` function has been inlined, note its
+   ``const(int, 200)`` in the caller body.
+3. There is a call to ``maybe_inline1`` before the ``const(int, 13)``
+   declaration, the cost model prevented this from being inlined.
+4. After the ``const(int, 13)`` the subsequent call to ``maybe_inline1`` has
+   been inlined as shown by the ``const(int, 300)`` in the caller body.
+5. The function ``maybe_inline2`` has been inlined as demonstrated by
+   ``const(int, 37)`` in the caller body.
+6. That dead code elimination has not been performed and as a result there are
+   superfluous statements present in the IR.
+
+
+Example using :func:`numba.extending.overload`
+==============================================
+
+An example of using inlining with the  :func:`numba.extending.overload`
+decorator. It is most interesting to note that if a function is supplied as the
+argument to ``inline`` a lot more information is available via the supplied
+function arguments for use in decision making. Also that different
+``@overload`` s can have different inlining behaviours, with multiple ways to
+achieve this:
+
+.. literalinclude:: inline_overload_example.py
+
+which produces the following when executed (with a print of the IR after the
+legalization pass, enabled via the environment variable
+``NUMBA_DEBUG_PRINT_AFTER="ir_legalization"``):
+
+.. code-block:: none
+    :emphasize-lines: 2, 3, 4, 5, 6, 15, 16, 17, 18, 19, 20, 21, 22, 28, 29, 30
+
+    label 0:
+        $const0.2 = const(tuple, (1, 2, 3))      ['$const0.2']
+        x.0 = $const0.2                          ['$const0.2', 'x.0']
+        del $const0.2                            []
+        $const0.2.2 = const(int, 0)              ['$const0.2.2']
+        $0.3.3 = getitem(value=x.0, index=$const0.2.2) ['$0.3.3', '$const0.2.2', 'x.0']
+        del x.0                                  []
+        del $const0.2.2                          []
+        $0.4.4 = $0.3.3                          ['$0.3.3', '$0.4.4']
+        del $0.3.3                               []
+        $0.3 = $0.4.4                            ['$0.3', '$0.4.4']
+        del $0.4.4                               []
+        a = $0.3                                 ['$0.3', 'a']
+        del $0.3                                 []
+        $const0.5 = const(int, 100)              ['$const0.5']
+        x.5 = $const0.5                          ['$const0.5', 'x.5']
+        del $const0.5                            []
+        $const0.2.7 = const(int, 1)              ['$const0.2.7']
+        $0.3.8 = x.5 + $const0.2.7               ['$0.3.8', '$const0.2.7', 'x.5']
+        del x.5                                  []
+        del $const0.2.7                          []
+        $0.4.9 = $0.3.8                          ['$0.3.8', '$0.4.9']
+        del $0.3.8                               []
+        $0.6 = $0.4.9                            ['$0.4.9', '$0.6']
+        del $0.4.9                               []
+        b = $0.6                                 ['$0.6', 'b']
+        del $0.6                                 []
+        $0.7 = global(bar: <function bar at 0x7f6c3710d268>) ['$0.7']
+        $const0.8 = const(complex, 300j)         ['$const0.8']
+        $0.9 = call $0.7($const0.8, func=$0.7, args=[Var($const0.8, inline_overload_example.py (56))], kws=(), vararg=None) ['$0.7', '$0.9', '$const0.8']
+        del $const0.8                            []
+        del $0.7                                 []
+        c = $0.9                                 ['$0.9', 'c']
+        del $0.9                                 []
+        $0.12 = a + b                            ['$0.12', 'a', 'b']
+        del b                                    []
+        del a                                    []
+        $0.14 = $0.12 + c                        ['$0.12', '$0.14', 'c']
+        del c                                    []
+        del $0.12                                []
+        $0.15 = cast(value=$0.14)                ['$0.14', '$0.15']
+        del $0.14                                []
+        return $0.15                             ['$0.15']
+
+Things to note in the above:
+
+1. The first highlighted section is the always inlined overload for the
+   ``UniTuple`` argument type.
+2. The second highlighted section is the overload for the ``Number`` argument
+   type that has been inlined as the cost model function decided to do so as the
+   argument was an ``Integer`` type instance.
+3. The third highlighted section is the overload for the ``Number`` argument
+   type that has not inlined as the cost model function decided to reject it as
+   the argument was an ``Complex`` type instance.
+4. That dead code elimination has not been performed and as a result there are
+   superfluous statements present in the IR.
+
+Using a function to limit the inlining depth of a recursive function
+====================================================================
+
+When using recursive inlines, you can terminate the compilation by using
+a cost model.
+
+.. code:: python
+
+    from numba import njit
+    import numpy as np
+
+    class CostModel(object):
+        def __init__(self, max_inlines):
+            self._count = 0
+            self._max_inlines = max_inlines
+
+        def __call__(self, expr, caller, callee):
+            ret = self._count < self._max_inlines
+            self._count += 1
+            return ret
+
+    @njit(inline=CostModel(3))
+    def factorial(n):
+        if n <= 0:
+            return 1
+        return n * factorial(n - 1)
+
+    factorial(5)
--- a/docs/source/developer/listings.rst
+++ b/docs/source/developer/listings.rst
+Listings
+========
+
+This shows listings from compiler internal registries  (e.g. lowering
+definitions).  The information is provided as developer reference.
+When possible, links to source code are provided via github links.
+
+New style listings
+------------------
+
+The following listings are generated from ``numba.help.inspector.write_listings()``. Users can run ``python -m numba.help.inspector --format=rst <package>`` to recreate the the documentation.
+
+.. toctree::
+   :maxdepth: 2
+
+   autogen_builtins_listing.rst
+   autogen_math_listing.rst
+   autogen_cmath_listing.rst
+   autogen_numpy_listing.rst
+
+
+Old style listings
+------------------
+
+.. toctree::
+   :maxdepth: 2
+
+   autogen_lower_listing.rst
+
--- a/docs/source/developer/literal.rst
+++ b/docs/source/developer/literal.rst
+.. _developer-literally:
+
+======================
+Notes on Literal Types
+======================
+
+.. note:: This document describes an advanced feature designed to overcome
+          some limitations of the compilation mechanism relating to types.
+
+Some features need to specialize based on the literal value during
+compilation to produce type stable code necessary for successful compilation in
+Numba. This can be achieved by propagating the literal value through the type
+system. Numba recognizes inline literal values as :class:`numba.types.Literal`.
+For example::
+
+    def foo(x):
+        a = 123
+        return bar(x, a)
+
+Numba will infer the type of ``a`` as ``Literal[int](123)``. The definition of
+``bar()`` can subsequently specialize its implementation knowing that the
+second argument is an ``int`` with the value ``123``.
+
+``Literal`` Type
+----------------
+
+Classes and methods related to the ``Literal`` type.
+
+.. autoclass:: numba.types.Literal
+
+.. autofunction:: numba.types.literal
+
+.. autofunction:: numba.types.unliteral
+
+.. autofunction:: numba.types.maybe_literal
+
+Specifying for Literal Typing
+-----------------------------
+
+To specify a value as a ``Literal`` type in code scheduled for JIT compilation,
+use the following function:
+
+.. autofunction:: numba.literally
+
+Code Example
+~~~~~~~~~~~~
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_literally_usage.py
+   :language: python
+   :caption: from ``test_literally_usage`` of ``numba/tests/doc_examples/test_literally_usage.py``
+   :start-after: magictoken.ex_literally_usage.begin
+   :end-before: magictoken.ex_literally_usage.end
+   :dedent: 4
+   :linenos:
+
+
+Internal Details
+~~~~~~~~~~~~~~~~
+
+Internally, the compiler raises a ``ForceLiteralArgs`` exception to signal
+the dispatcher to wrap specified arguments using the ``Literal`` type.
+
+.. autoclass:: numba.errors.ForceLiteralArg
+    :members: __init__, combine, __or__
+
+
+Inside Extensions
+-----------------
+
+``@overload`` extensions can use ``literally`` inside the implementation body
+like in normal jit-code.
+
+Explicit handling of literal requirements is possible through use of the
+following:
+
+.. autoclass:: numba.extending.SentryLiteralArgs
+    :members:
+
+.. autoclass:: numba.extending.BoundLiteralArgs
+    :members:
+
+.. autofunction:: numba.extending.sentry_literal_args
--- a/docs/source/developer/live_variable_analysis.rst
+++ b/docs/source/developer/live_variable_analysis.rst
+.. _live variable analysis:
+
+======================
+Live Variable Analysis
+======================
+
+(Related issue https://github.com/numba/numba/pull/1611)
+
+Numba uses reference-counting for garbage collection, a technique that
+requires cooperation by the compiler.  The Numba IR encodes the location
+where a decref must be inserted.  These locations are determined by live
+variable analysis.  The corresponding source code is the ``_insert_var_dels()``
+method in https://github.com/numba/numba/blob/main/numba/interpreter.py.
+
+
+In Python semantic, once a variable is defined inside a function, it is alive
+until the variable is explicitly deleted or the function scope is ended.
+However, Numba analyzes the code to determine the minimum bound of the lifetime
+of each variable by its definition and usages during compilation.
+As soon as a variable is unreachable, a ``del`` instruction is inserted at the
+closest basic-block (either at the start of the next block(s) or at the
+end of the current block).  This means variables can be released earlier than in
+regular Python code.
+
+The behavior of the live variable analysis affects memory usage of the compiled
+code.  Internally, Numba does not differentiate temporary variables and user
+variables.  Since each operation generates at least one temporary variable,
+a function can accumulate a high number of temporary variables if they are not
+released as soon as possible.
+Our generator implementation can benefit from early releasing of variables,
+which reduces the size of the state to suspend at each yield point.
+
+
+Notes on behavior of the live variable analysis
+================================================
+
+
+Variable deleted before definition
+-----------------------------------
+
+(Related issue: https://github.com/numba/numba/pull/1738)
+
+When a variable lifetime is confined within the loop body (its definition and
+usage does not escape the loop body), like:
+
+.. code-block:: python
+
+    def f(arr):
+      # BB 0
+      res = 0
+      # BB 1
+      for i in (0, 1):
+          # BB 2
+          t = arr[i]
+          if t[i] > 1:
+              # BB 3
+              res += t[i]
+      # BB 4
+      return res
+
+
+Variable ``t`` is never referenced outside of the loop.
+A ``del`` instruction is emitted for ``t`` at the head of the loop (BB 1)
+before a variable is defined.  The reason is obvious once we know the control
+flow graph::
+
+             +------------------------------> BB4
+             |
+             |
+    BB 0 --> BB 1  -->  BB 2 ---> BB 3
+             ^          |          |
+             |          V          V
+             +---------------------+
+
+
+Variable ``t`` is defined in BB 1.  In BB 2, the evaluation of
+``t[i] > 1`` uses ``t``, which is the last use if execution takes the false
+branch and goto BB 1.  In BB 3, ``t`` is only used in ``res += t[i]``, which is
+the last use if execution takes the true branch.  Because BB 3, an outgoing
+branch of BB 2 uses ``t``, ``t`` must be deleted at the common predecessor.
+The closest point is BB 1, which does not have ``t`` defined from the incoming
+edge of BB 0.
+
+Alternatively, if ``t`` is deleted at BB 4, we will still have to delete the
+variable before its definition because BB4 can be executed without executing
+the loop body (BB 2 and BB 3), where the variable is defined.
--- a/docs/source/developer/llvm_timings.rst
+++ b/docs/source/developer/llvm_timings.rst
+.. _developer-llvm-timings:
+
+====================
+Notes on timing LLVM
+====================
+
+
+Getting LLVM Pass Timings
+-------------------------
+
+The dispatcher stores LLVM pass timings in the dispatcher object metadata under
+the ``llvm_pass_timings`` key when :envvar:`NUMBA_LLVM_PASS_TIMINGS` is
+enabled or ``numba.config.LLVM_PASS_TIMINGS`` is set to truthy.
+The timings information contains details on how much time
+has been spent in each pass. The pass timings are also grouped by their purpose.
+For example, there will be pass timings for function-level pre-optimizations,
+module-level optimizations, and object code generation.
+
+
+Code Example
+~~~~~~~~~~~~
+
+.. literalinclude:: ../../../numba/tests/doc_examples/test_llvm_pass_timings.py
+   :language: python
+   :caption: from ``test_pass_timings`` of ``numba/tests/doc_examples/test_llvm_pass_timings.py``
+   :start-after: magictoken.ex_llvm_pass_timings.begin
+   :end-before: magictoken.ex_llvm_pass_timings.end
+   :dedent: 16
+   :linenos:
+
+Example output:
+
+.. code-block:: text
+
+  Printing pass timings for JITCodeLibrary('DocsLLVMPassTimings.test_pass_timings.<locals>.foo')
+  Total time: 0.0376
+  == #0 Function passes on '_ZN5numba5tests12doc_examples22test_llvm_pass_timings19DocsLLVMPassTimings17test_pass_timings12$3clocals$3e7foo$241Ex'
+  Percent: 4.8%
+  Total 0.0018s
+  Top timings:
+    0.0015s ( 81.6%) SROA #3
+    0.0002s (  9.3%) Early CSE #2
+    0.0001s (  4.0%) Simplify the CFG #9
+    0.0000s (  1.5%) Prune NRT refops #4
+    0.0000s (  1.1%) Post-Dominator Tree Construction #5
+  == #1 Function passes on '_ZN7cpython5numba5tests12doc_examples22test_llvm_pass_timings19DocsLLVMPassTimings17test_pass_timings12$3clocals$3e7foo$241Ex'
+  Percent: 0.8%
+  Total 0.0003s
+  Top timings:
+    0.0001s ( 30.4%) Simplify the CFG #10
+    0.0001s ( 24.1%) Early CSE #3
+    0.0001s ( 17.8%) SROA #4
+    0.0000s (  8.8%) Prune NRT refops #5
+    0.0000s (  5.6%) Post-Dominator Tree Construction #6
+  == #2 Function passes on 'cfunc._ZN5numba5tests12doc_examples22test_llvm_pass_timings19DocsLLVMPassTimings17test_pass_timings12$3clocals$3e7foo$241Ex'
+  Percent: 0.5%
+  Total 0.0002s
+  Top timings:
+    0.0001s ( 27.7%) Early CSE #4
+    0.0001s ( 26.8%) Simplify the CFG #11
+    0.0000s ( 13.8%) Prune NRT refops #6
+    0.0000s (  7.4%) Post-Dominator Tree Construction #7
+    0.0000s (  6.7%) Dominator Tree Construction #29
+  == #3 Module passes (cheap optimization for refprune)
+  Percent: 3.7%
+  Total 0.0014s
+  Top timings:
+    0.0007s ( 52.0%) Combine redundant instructions
+    0.0001s (  5.4%) Function Integration/Inlining
+    0.0001s (  4.9%) Prune NRT refops #2
+    0.0001s (  4.8%) Natural Loop Information
+    0.0001s (  4.6%) Post-Dominator Tree Construction #2
+  == #4 Module passes (full optimization)
+  Percent: 43.9%
+  Total 0.0165s
+  Top timings:
+    0.0032s ( 19.5%) Combine redundant instructions #9
+    0.0022s ( 13.5%) Combine redundant instructions #7
+    0.0010s (  6.1%) Induction Variable Simplification
+    0.0008s (  4.8%) Unroll loops #2
+    0.0007s (  4.5%) Loop Vectorization
+  == #5 Finalize object
+  Percent: 46.3%
+  Total 0.0174s
+  Top timings:
+    0.0060s ( 34.6%) X86 DAG->DAG Instruction Selection #2
+    0.0019s ( 11.0%) Greedy Register Allocator #2
+    0.0013s (  7.4%) Machine Instruction Scheduler #2
+    0.0012s (  7.1%) Loop Strength Reduction
+    0.0004s (  2.3%) Induction Variable Users
+
+
+API for custom analysis
+~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to get more details then the summary text in the above example.
+The pass timings are stored in a
+:class:`numba.misc.llvm_pass_timings.PassTimingsCollection`, which contains
+methods for accessing individual record for each pass.
+
+.. autoclass:: numba.misc.llvm_pass_timings.PassTimingsCollection
+    :members: get_total_time, list_longest_first, summary, __getitem__, __len__
+
+.. autoclass:: numba.misc.llvm_pass_timings.ProcessedPassTimings
+    :members: get_raw_data, get_total_time, list_records, list_top, summary
+
+.. autoclass:: numba.misc.llvm_pass_timings.PassTimingRecord
--- a/docs/source/developer/mission.rst
+++ b/docs/source/developer/mission.rst
+Numba Mission Statement
+=======================
+
+Introduction
+------------
+
+This document is the mission statement for the Numba project. It exists to
+provide a clear description of the purposes and goals of the project.  As such,
+this document provides background on Numba's users and use-cases, and outlines
+the project's overall goals.
+
+This is a living document:
+
+=========================== =============
+The first revision date is: May 2022
+The last updated date is:   May 2022
+The next review date is:    November 2022
+=========================== =============
+
+Background
+----------
+
+The Numba project provides tools to improve the performance of Python software.
+It comprises numerous facilities including just-in-time (JIT) compilation,
+extension points for library authors, and a compiler toolkit on which new
+computational acceleration technologies can be explored and built.
+
+The range of use-cases and applications that can be targeted by Numba includes,
+but is not limited to:
+
+* Scientific Computing
+* Computationally intensive tasks
+* Numerically oriented applications
+* Data science utilities and programs
+
+The user base of Numba includes anyone needing to perform intensive
+computational work, including users from a wide range of disciplines, examples
+include:
+
+* The most common use case, a user wanting to JIT compile some numerical
+  functions.
+* Users providing JIT accelerated libraries for domain specific use cases e.g.
+  scientific researchers.
+* Users providing JIT accelerated libraries for use as part of the numerical
+  Python ecosystem.
+* Those writing more advanced JIT accelerated libraries containing their own
+  domain specific data types etc.
+* Compiler engineers who explore new compiler use-cases and/or need a custom
+  compiler.
+* Hardware vendors looking to extend Numba to provide Python support for their
+  custom silicon or new hardware.
+
+Project Goals
+-------------
+
+The primary aims of the Numba project are:
+
+* To make it easier for Python users to write high performance code.
+* To have a core package with a well defined and pragmatically selected feature
+  scope that meets the needs of the user base without being overly complex.
+* To provide a compiler toolkit for Python that is extensible and can be
+  customized to meet the needs of the user base. This comes with the expectation
+  that users potentially need to invest time and effort to extend and/or
+  customize the software themselves.
+* To support both the Python core language/standard libraries and NumPy.
+* To consistently produce high quality software:
+
+  * Feature stability across versions.
+  * Well established and tested public APIs.
+  * Clearly documented deprecation cycles.
+  * Internally stable code base.
+  * Externally tested release candidates.
+  * Regular releases with a predictable and published release cycle.
+  * Maintain suitable infrastructure for both testing and releasing. With as
+    much in public as feasible.
+
+* To make it as easy as possible for people to contribute.
+* To have a maintained public roadmap which will also include areas under active
+  development.
+* To have a governance document in place and it working in practice.
+* To ensure that Numba receives timely updates for its core dependencies: LLVM,
+  NumPy and Python.
--- a/docs/source/developer/numba-runtime.rst
+++ b/docs/source/developer/numba-runtime.rst
+.. _arch-numba-runtime:
+
+======================
+Notes on Numba Runtime
+======================
+
+
+The *Numba Runtime (NRT)* provides the language runtime to the *nopython mode*
+Python subset.  NRT is a standalone C library with a Python binding.  This
+allows :term:`NPM` runtime feature to be used without the GIL.  Currently, the 
+only language feature implemented in NRT is memory management.
+
+
+Memory Management
+=================
+
+NRT implements memory management for :term:`NPM` code.  It uses *atomic 
+reference count* for threadsafe, deterministic memory management.  NRT maintains 
+a separate ``MemInfo`` structure for storing information about each allocation.
+
+Cooperating with CPython
+------------------------
+
+For NRT to cooperate with CPython, the NRT python binding provides adaptors for
+converting python objects that export a memory region.  When such an
+object is used as an argument to a :term:`NPM` function, a new ``MemInfo`` is 
+created and it acquires a reference to the Python object.  When a :term:`NPM` 
+value is returned to the Python interpreter, the associated ``MemInfo`` 
+(if any) is checked.  If the ``MemInfo`` references a Python object, the 
+underlying Python object is released and returned instead.  Otherwise, the 
+``MemInfo`` is wrapped in a Python object and returned. Additional process 
+maybe required depending on the type.
+
+The current implementation supports Numpy array and any buffer-exporting types.
+
+
+Compiler-side Cooperation
+-------------------------
+
+NRT reference counting requires the compiler to emit incref/decref operations
+according to the usage.  When the reference count drops to zero, the compiler
+must call the destructor routine in NRT.
+
+
+.. _nrt-refct-opt-pass:
+
+Optimizations
+-------------
+
+The compiler is allowed to emit incref/decref operations naively.  It relies
+on an optimization pass to remove redundant reference count operations.
+
+A new optimization pass is implemented in version 0.52.0 to remove reference
+count operations that fall into the following four categories of control-flow
+structure---per basic-block, diamond, fanout, fanout+raise. See the documentation
+for :envvar:`NUMBA_LLVM_REFPRUNE_FLAGS` for their descriptions.
+
+The old optimization pass runs at block level to avoid control flow analysis.
+It depends on LLVM function optimization pass to simplify the control flow,
+stack-to-register, and simplify instructions.  It works by matching and
+removing incref and decref pairs within each block.  The old pass can be
+enabled by setting :envvar:`NUMBA_LLVM_REFPRUNE_PASS` to `0`.
+
+Important assumptions
+---------------------
+
+Both the old (pre-0.52.0) and the new (post-0.52.0) optimization passes assume
+that the only function that can consume a reference is ``NRT_decref``.
+It is important that there are no other functions that will consume references.
+Since the passes operate on LLVM IR, the "functions" here are referring to any
+callee in a LLVM call instruction.
+
+To summarize, all functions exposed to the refcount optimization pass
+**must not** consume counted references unless done so via ``NRT_decref``.
+
+
+Quirks of the old optimization pass
+-----------------------------------
+
+Since the pre-0.52.0 `refcount optimization pass <nrt-refct-opt-pass_>`_
+requires the LLVM function optimization pass, the pass works on the LLVM IR as
+text. The optimized IR is then materialized again as a new LLVM in-memory
+bitcode object.
+
+
+Debugging Leaks
+---------------
+
+To debug reference leaks in NRT MemInfo, each MemInfo python object has a
+``.refcount`` attribute for inspection.  To get the MemInfo from a ndarray
+allocated by NRT, use the ``.base`` attribute.
+
+To debug memory leaks in NRT, the ``numba.core.runtime.rtsys`` defines
+``.get_allocation_stats()``.  It returns a namedtuple containing the
+number of allocation and deallocation since the start of the program.
+Checking that the allocation and deallocation counters are matching is the
+simplest way to know if the NRT is leaking.
+
+
+Debugging Leaks in C
+--------------------
+
+The start of `numba/core/runtime/nrt.h
+<https://github.com/numba/numba/blob/main/numba/core/runtime/nrt.h>`_
+has these lines:
+
+.. code-block:: C
+
+  /* Debugging facilities - enabled at compile-time */
+  /* #undef NDEBUG */
+  #if 0
+  #   define NRT_Debug(X) X
+  #else
+  #   define NRT_Debug(X) if (0) { X; }
+  #endif
+
+Undefining NDEBUG (uncomment the ``#undef NDEBUG`` line) enables the assertion
+check in NRT.
+
+Enabling the NRT_Debug (replace ``#if 0`` with ``#if 1``) turns on
+debug print inside NRT.
+
+
+Recursion Support
+=================
+
+During the compilation of a pair of mutually recursive functions, one of the
+functions will contain unresolved symbol references since the compiler handles
+one function at a time.  The memory for the unresolved symbols is allocated and
+initialized to the address of the *unresolved symbol abort* function
+(``nrt_unresolved_abort``) just before the machine code is
+generated by LLVM.  These symbols are tracked and resolved as new functions are
+compiled.  If a bug prevents the resolution of these symbols,
+the abort function will be called, raising a ``RuntimeError`` exception.
+
+The *unresolved symbol abort* function is defined in the NRT with a zero-argument
+signature. The caller is safe to call it with arbitrary number of
+arguments.  Therefore, it is safe to be used inplace of the intended callee.
+
+Using the NRT from C code
+=========================
+
+Externally compiled C code should use the ``NRT_api_functions`` struct as a
+function table to access the NRT API. The struct is defined in
+:ghfile:`numba/core/runtime/nrt_external.h`. Users can use the utility function
+``numba.extending.include_path()`` to determine the include directory for
+Numba provided C headers.
+
+.. literalinclude:: ../../../numba/core/runtime/nrt_external.h
+  :language: C
+  :caption: `numba/core/runtime/nrt_external.h`
+
+Inside Numba compiled code, the ``numba.core.unsafe.nrt.NRT_get_api()``
+intrinsic can be used to obtain a pointer to the ``NRT_api_functions``.
+
+Here is an example that uses the ``nrt_external.h``:
+
+.. code-block:: C
+
+  #include <stdio.h>
+  #include "numba/core/runtime/nrt_external.h"
+
+  void my_dtor(void *ptr) {
+      free(ptr);
+  }
+
+  NRT_MemInfo* my_allocate(NRT_api_functions *nrt) {
+      /* heap allocate some memory */
+      void * data = malloc(10);
+      /* wrap the allocated memory; yield a new reference */
+      NRT_MemInfo *mi = nrt->manage_memory(data, my_dtor);
+      /* acquire reference */
+      nrt->acquire(mi);
+      /* release reference */
+      nrt->release(mi);
+      return mi;
+  }
+
+It is important to ensure that the NRT is initialized prior to making calls to
+it, calling ``numba.core.runtime.nrt.rtsys.initialize(context)`` from Python
+will have the desired effect. Similarly the code snippet:
+
+.. code-block:: Python
+
+  from numba.core.registry import cpu_target # Get the CPU target singleton
+  cpu_target.target_context # Access the target_context property to initialize
+
+will achieve the same specifically for Numba's CPU target (the default). Failure
+to initialize the NRT will result in access violations as function pointers for
+various internal atomic operations will be missing in the ``NRT_MemSys`` struct.
+
+Future Plan
+===========
+
+The plan for NRT is to make a standalone shared library that can be linked to
+Numba compiled code, including use within the Python interpreter and without
+the Python interpreter.  To make that work, we will be doing some refactoring:
+
+* numba :term:`NPM` code references statically compiled code in "helperlib.c".
+  Those functions should be moved to NRT.
--- a/docs/source/developer/release.rst
+++ b/docs/source/developer/release.rst
+Numba Release Process
+=====================
+
+The goal of the Numba release process -- from a high level perspective -- is to
+publish source and binary artifacts that correspond to a given version
+number. This usually involves a sequence of individual tasks that must be
+performed in the correct order and with diligence. Numba and llvmlite are
+commonly released in lockstep since there is usually a one-to-one mapping
+between a Numba version and a corresponding llvmlite version.
+
+This section contains various notes and templates that can be used to create a
+Numba release checklist on the Numba Github issue tracker. This is an aid for
+the maintainers during the release process and helps to ensure that all tasks
+are completed in the correct order and that no tasks are accidentally omitted.
+
+If new or additional items do appear during release, please do remember to add
+them to the checklist templates. Also note that the release process itself is
+always a work in progress. This means that some of the information here may be
+outdated. If you notice this please do remember to submit a pull-request to
+update this document.
+
+All release checklists are available as Gitub issue templates. To create a new
+release checklist simply open a new issue and select the correct template.
+
+
+Primary Release Candidate Checklist
+-----------------------------------
+
+This is for the first/primary release candidate for minor release i.e. the
+first release of every series. It is special, because during this release, the
+release branch will have to be created. Release candidate indexing begins at 1.
+
+.. literalinclude:: ../../../.github/ISSUE_TEMPLATE/first_rc_checklist.md
+    :language: md
+    :lines: 9-
+
+`Open a primary release checklist <https://github.com/numba/numba/issues/new?template=first_rc_checklist.md>`_.
+
+Subsequent Release Candidates, Final Releases and Patch Releases
+----------------------------------------------------------------
+
+Releases subsequent to the first release in a series usually involves a series
+of cherry-picks, the recipe is therefore slightly different.
+
+.. literalinclude:: ../../../.github/ISSUE_TEMPLATE/sub_rc_checklist.md
+    :language: md
+    :lines: 9-
+
+`Open a subsequent release checklist <https://github.com/numba/numba/issues/new?template=sub_rc_checklist.md>`_.
--- a/docs/source/developer/repomap.rst
+++ b/docs/source/developer/repomap.rst
+A Map of the Numba Repository
+=============================
+
+The Numba repository is quite large, and due to age has functionality spread
+around many locations.  To help orient developers, this document will try to
+summarize where different categories of functionality can be found.
+
+
+Support Files
+-------------
+
+Build and Packaging
+'''''''''''''''''''
+
+- :ghfile:`setup.py` - Standard Python distutils/setuptools script
+- :ghfile:`MANIFEST.in` - Distutils packaging instructions
+- :ghfile:`requirements.txt` - Pip package requirements, not used by conda
+- :ghfile:`versioneer.py` - Handles automatic setting of version in
+  installed package from git tags
+- :ghfile:`.flake8` - Preferences for code formatting.  Files should be
+  fixed and removed from the exception list as time allows.
+- :ghfile:`.pre-commit-config.yaml` - Configuration file for pre-commit hooks.
+- :ghfile:`.readthedocs.yml` - Configuration file for Read the Docs.
+- :ghfile:`buildscripts/condarecipe.local` - Conda build recipe
+
+
+Continuous Integration
+''''''''''''''''''''''
+- :ghfile:`azure-pipelines.yml` - Azure Pipelines CI config (active:
+  Win/Mac/Linux)
+- :ghfile:`buildscripts/azure/` - Azure Pipeline configuration for specific
+  platforms
+- :ghfile:`buildscripts/incremental/` - Generic scripts for building Numba
+  on various CI systems
+- :ghfile:`codecov.yml` - Codecov.io coverage reporting
+
+
+Documentation
+'''''''''''''
+- :ghfile:`LICENSE` - License for Numba
+- :ghfile:`LICENSES.third-party` - License for third party code vendored
+  into Numba
+- :ghfile:`README.rst` - README for repo, also uploaded to PyPI
+- :ghfile:`CONTRIBUTING.md` - Documentation on how to contribute to project
+  (out of date, should be updated to point to Sphinx docs)
+- :ghfile:`CHANGE_LOG` - History of Numba releases, also directly embedded
+  into Sphinx documentation
+- :ghfile:`docs/` - Documentation source
+- :ghfile:`docs/_templates/` - Directory for templates (to override defaults
+  with Sphinx theme)
+- :ghfile:`docs/Makefile` - Used to build Sphinx docs with ``make``
+- :ghfile:`docs/source` - ReST source for Numba documentation
+- :ghfile:`docs/_static/` - Static CSS and image assets for Numba docs
+- :ghfile:`docs/make.bat` - Not used (remove?)
+- :ghfile:`docs/requirements.txt` - Pip package requirements for building docs
+  with Read the Docs.
+- :ghfile:`numba/scripts/generate_lower_listing.py` - Dump all registered
+  implementations decorated with ``@lower*`` for reference
+  documentation.  Currently misses implementations from the higher
+  level extension API.
+
+
+Numba Source Code
+-----------------
+
+Numba ships with both the source code and tests in one package.
+
+- :ghfile:`numba/` - all of the source code and tests
+
+
+Public API
+''''''''''
+
+These define aspects of the public Numba interface.
+
+- :ghfile:`numba/core/decorators.py` - User-facing decorators for compiling
+  regular functions on the CPU
+- :ghfile:`numba/core/extending.py` - Public decorators for extending Numba
+  (``overload``, ``intrinsic``, etc)
+  - :ghfile:`numba/experimental/structref.py` - Public API for defining a mutable struct
+- :ghfile:`numba/core/ccallback.py` - ``@cfunc`` decorator for compiling
+  functions to a fixed C signature.  Used to make callbacks.
+- :ghfile:`numba/np/ufunc/decorators.py` - ufunc/gufunc compilation
+  decorators
+- :ghfile:`numba/core/config.py` - Numba global config options and environment
+  variable handling
+- :ghfile:`numba/core/annotations` - Gathering and printing type annotations of
+  Numba IR
+- :ghfile:`numba/core/annotations/pretty_annotate.py` - Code highlighting of
+  Numba functions and types (both ANSI terminal and HTML)
+- :ghfile:`numba/core/event.py` - A simple event system for applications to
+  listen to specific compiler events.
+
+
+Dispatching
+'''''''''''
+
+- :ghfile:`numba/core/dispatcher.py` - Dispatcher objects are compiled functions
+  produced by ``@jit``.  A dispatcher has different implementations
+  for different type signatures.
+- :ghfile:`numba/_dispatcher.cpp` - C++ dispatcher implementation (for speed on
+  common data types)
+- :ghfile:`numba/core/retarget.py` - Support for dispatcher objects to switch
+  target via a specific with-context.
+
+
+Compiler Pipeline
+'''''''''''''''''
+
+- :ghfile:`numba/core/compiler.py` - Compiler pipelines and flags
+- :ghfile:`numba/core/errors.py` - Numba exception and warning classes
+- :ghfile:`numba/core/ir.py` - Numba IR data structure objects
+- :ghfile:`numba/core/bytecode.py` - Bytecode parsing and function identity (??)
+- :ghfile:`numba/core/interpreter.py` - Translate Python interpreter bytecode to
+  Numba IR
+- :ghfile:`numba/core/analysis.py` - Utility functions to analyze Numba IR
+  (variable lifetime, prune branches, etc)
+- :ghfile:`numba/core/controlflow.py` - Control flow analysis of Numba IR and
+  Python bytecode
+- :ghfile:`numba/core/typeinfer.py` - Type inference algorithm
+- :ghfile:`numba/core/transforms.py` - Numba IR transformations
+- :ghfile:`numba/core/rewrites` - Rewrite passes used by compiler
+- :ghfile:`numba/core/rewrites/__init__.py` - Loads all rewrite passes so they
+  are put into the registry
+- :ghfile:`numba/core/rewrites/registry.py` - Registry object for collecting
+  rewrite passes
+- :ghfile:`numba/core/rewrites/ir_print.py` - Write print() calls into special
+  print nodes in the IR
+- :ghfile:`numba/core/rewrites/static_raise.py` - Converts exceptions with
+  static arguments into a special form that can be lowered
+- :ghfile:`numba/core/rewrites/static_getitem.py` - Rewrites getitem and setitem
+  with constant arguments to allow type inference
+- :ghfile:`numba/core/rewrites/static_binop.py` - Rewrites binary operations
+  (specifically ``**``) with constant arguments so faster code can be
+  generated
+- :ghfile:`numba/core/inline_closurecall.py` - Inlines body of closure functions
+  to call site.  Support for array comprehensions, reduction inlining,
+  and stencil inlining.
+- :ghfile:`numba/core/postproc.py` - Postprocessor for Numba IR that computes
+  variable lifetime, inserts del operations, and handles generators
+- :ghfile:`numba/core/lowering.py` - General implementation of lowering Numba IR
+  to LLVM
+  :ghfile:`numba/core/environment.py` - Runtime environment object
+- :ghfile:`numba/core/withcontexts.py` - General scaffolding for implementing
+  context managers in nopython mode, and the objectmode context
+  manager
+- :ghfile:`numba/core/pylowering.py` - Lowering of Numba IR in object mode
+- :ghfile:`numba/core/pythonapi.py` - LLVM IR code generation to interface with
+  CPython API
+- :ghfile:`numba/core/targetconfig.py` - Utils for target configurations such
+  as compiler flags.
+
+
+Type Management
+'''''''''''''''
+
+- :ghfile:`numba/core/typeconv/` - Implementation of type casting and type
+  signature matching in both C++ and Python
+- :ghfile:`numba/capsulethunk.h` - Used by typeconv
+- :ghfile:`numba/core/types/` - definition of the Numba type hierarchy, used
+  everywhere in compiler to select implementations
+- :ghfile:`numba/core/consts.py` - Constant inference (used to make constant
+  values available during codegen when possible)
+- :ghfile:`numba/core/datamodel` - LLVM IR representations of data types in
+  different contexts
+- :ghfile:`numba/core/datamodel/models.py` - Models for most standard types
+- :ghfile:`numba/core/datamodel/registry.py` - Decorator to register new data
+  models
+- :ghfile:`numba/core/datamodel/packer.py` - Pack typed values into a data
+  structure
+- :ghfile:`numba/core/datamodel/testing.py` - Data model tests (this should
+  move??)
+- :ghfile:`numba/core/datamodel/manager.py` - Map types to data models
+
+
+Compiled Extensions
+'''''''''''''''''''
+
+Numba uses a small amount of compiled C/C++ code for core
+functionality, like dispatching and type matching where performance
+matters, and it is more convenient to encapsulate direct interaction
+with CPython APIs.
+
+- :ghfile:`numba/_arraystruct.h` - Struct for holding NumPy array
+  attributes.  Used in helperlib and the Numba Runtime.
+- :ghfile:`numba/_helperlib.c` - C functions required by Numba compiled code
+  at runtime.  Linked into ahead-of-time compiled modules
+- :ghfile:`numba/_helpermod.c` - Python extension module with pointers to
+  functions from ``_helperlib.c``
+- :ghfile:`numba/_dynfuncmod.c` - Python extension module exporting
+  _dynfunc.c functionality
+- :ghfile:`numba/_dynfunc.c` - C level Environment and Closure objects (keep
+  in sync with numba/target/base.py)
+- :ghfile:`numba/mathnames.h` - Macros for defining names of math functions
+- :ghfile:`numba/_pymodule.h` - C macros for Python 2/3 portable naming of C
+  API functions
+- :ghfile:`numba/mviewbuf.c` - Handles Python memoryviews
+- :ghfile:`numba/_typeof.{h,cpp}` - C++ implementation of type fingerprinting,
+  used by dispatcher
+- :ghfile:`numba/_numba_common.h` - Portable C macro for marking symbols
+  that can be shared between object files, but not outside the
+  library.
+
+
+
+Misc Support
+''''''''''''
+
+- :ghfile:`numba/_version.py` - Updated by versioneer
+- :ghfile:`numba/core/runtime` - Language runtime.  Currently manages
+  reference-counted memory allocated on the heap by Numba-compiled
+  functions
+- :ghfile:`numba/core/ir_utils.py` - Utility functions for working with Numba IR
+  data structures
+- :ghfile:`numba/core/cgutils.py` - Utility functions for generating common code
+  patterns in LLVM IR
+- :ghfile:`numba/core/utils.py` - Python 2 backports of Python 3 functionality
+  (also imports local copy of ``six``)
+- :ghfile:`numba/misc/appdirs.py` - Vendored package for determining application
+  config directories on every platform
+- :ghfile:`numba/core/compiler_lock.py` - Global compiler lock because Numba's
+  usage of LLVM is not thread-safe
+- :ghfile:`numba/misc/special.py` - Python stub implementations of special Numba
+  functions (prange, gdb*)
+- :ghfile:`numba/core/itanium_mangler.py` - Python implementation of Itanium C++
+  name mangling
+- :ghfile:`numba/misc/findlib.py` - Helper function for locating shared
+  libraries on all platforms
+- :ghfile:`numba/core/debuginfo.py` - Helper functions to construct LLVM IR
+  debug
+  info
+- :ghfile:`numba/core/unsafe/refcount.py` - Read reference count of object
+- :ghfile:`numba/core/unsafe/eh.py` - Exception handling helpers
+- :ghfile:`numba/core/unsafe/nrt.py` - Numba runtime (NRT) helpers
+- :ghfile:`numba/cpython/unsafe/tuple.py` - Replace a value in a tuple slot
+- :ghfile:`numba/np/unsafe/ndarray.py` - NumPy array helpers
+- :ghfile:`numba/core/unsafe/bytes.py` - Copying and dereferencing data from
+  void pointers
+- :ghfile:`numba/misc/dummyarray.py` - Used by GPU backends to hold array
+  information on the host, but not the data.
+- :ghfile:`numba/core/callwrapper.py` - Handles argument unboxing and releasing
+  the GIL when moving from Python to nopython mode
+- :ghfile:`numba/np/numpy_support.py` - Helper functions for working with NumPy
+  and translating Numba types to and from NumPy dtypes.
+- :ghfile:`numba/core/tracing.py` - Decorator for tracing Python calls and
+  emitting log messages
+- :ghfile:`numba/core/funcdesc.py` - Classes for describing function metadata
+  (used in the compiler)
+- :ghfile:`numba/core/sigutils.py` - Helper functions for parsing and
+  normalizing Numba type signatures
+- :ghfile:`numba/core/serialize.py` - Support for pickling compiled functions
+- :ghfile:`numba/core/caching.py` - Disk cache for compiled functions
+- :ghfile:`numba/np/npdatetime.py` - Helper functions for implementing NumPy
+  datetime64 support
+- :ghfile:`numba/misc/llvm_pass_timings.py` - Helper to record timings of
+  LLVM passes.
+- :ghfile:`numba/cloudpickle` - Vendored cloudpickle subpackage
+
+Core Python Data Types
+''''''''''''''''''''''
+
+- :ghfile:`numba/_hashtable.{h,cpp}` - Adaptation of the Python 3.7 hash table
+  implementation
+- :ghfile:`numba/cext/dictobject.{h,c}` - C level implementation of typed
+  dictionary
+- :ghfile:`numba/typed/dictobject.py` - Nopython mode wrapper for typed
+  dictionary
+- :ghfile:`numba/cext/listobject.{h,c}` - C level implementation of typed list
+- :ghfile:`numba/typed/listobject.py` - Nopython mode wrapper for typed list
+- :ghfile:`numba/typed/typedobjectutils.py` - Common utilities for typed
+  dictionary and list
+- :ghfile:`numba/cpython/unicode.py` - Unicode strings (Python 3.5 and later)
+- :ghfile:`numba/typed` - Python interfaces to statically typed containers
+- :ghfile:`numba/typed/typeddict.py` - Python interface to typed dictionary
+- :ghfile:`numba/typed/typedlist.py` - Python interface to typed list
+- :ghfile:`numba/experimental/jitclass` - Implementation of experimental JIT
+  compilation of Python classes
+- :ghfile:`numba/core/generators.py` - Support for lowering Python generators
+
+
+Math
+''''
+
+- :ghfile:`numba/_random.c` - Reimplementation of NumPy / CPython random
+  number generator
+- :ghfile:`numba/_lapack.c` - Wrappers for calling BLAS and LAPACK functions
+  (requires SciPy)
+
+
+ParallelAccelerator
+'''''''''''''''''''
+
+Code transformation passes that extract parallelizable code from
+a function and convert it into multithreaded gufunc calls.
+
+- :ghfile:`numba/parfors/parfor.py` - General ParallelAccelerator
+- :ghfile:`numba/parfors/parfor_lowering.py` - gufunc lowering for
+  ParallelAccelerator
+- :ghfile:`numba/parfors/array_analysis.py` - Array analysis passes used in
+  ParallelAccelerator
+
+
+Stencil
+'''''''
+
+Implementation of ``@stencil``:
+
+- :ghfile:`numba/stencils/stencil.py` - Stencil function decorator (implemented
+  without ParallelAccelerator)
+- :ghfile:`numba/stencils/stencilparfor.py` - ParallelAccelerator implementation
+  of stencil
+
+
+Debugging Support
+'''''''''''''''''
+
+- :ghfile:`numba/misc/gdb_hook.py` - Hooks to jump into GDB from nopython
+  mode
+- :ghfile:`numba/misc/cmdlang.gdb` - Commands to setup GDB for setting
+  explicit breakpoints from Python
+
+
+Type Signatures (CPU)
+'''''''''''''''''''''
+
+Some (usually older) Numba supported functionality separates the
+declaration of allowed type signatures from the definition of
+implementations.  This package contains registries of type signatures
+that must be matched during type inference.
+
+- :ghfile:`numba/core/typing` - Type signature module
+- :ghfile:`numba/core/typing/templates.py` - Base classes for type signature
+  templates
+- :ghfile:`numba/core/typing/cmathdecl.py` - Python complex math (``cmath``)
+  module
+- :ghfile:`numba/core/typing/bufproto.py` - Interpreting objects supporting the
+  buffer protocol
+- :ghfile:`numba/core/typing/mathdecl.py` - Python ``math`` module
+- :ghfile:`numba/core/typing/listdecl.py` - Python lists
+- :ghfile:`numba/core/typing/builtins.py` - Python builtin global functions and
+  operators
+- :ghfile:`numba/core/typing/setdecl.py` - Python sets
+- :ghfile:`numba/core/typing/npydecl.py` - NumPy ndarray (and operators), NumPy
+  functions
+- :ghfile:`numba/core/typing/arraydecl.py` - Python ``array`` module
+- :ghfile:`numba/core/typing/context.py` - Implementation of typing context
+  (class that collects methods used in type inference)
+- :ghfile:`numba/core/typing/collections.py` - Generic container operations and
+  namedtuples
+- :ghfile:`numba/core/typing/ctypes_utils.py` - Typing ctypes-wrapped function
+  pointers
+- :ghfile:`numba/core/typing/enumdecl.py` - Enum types
+- :ghfile:`numba/core/typing/cffi_utils.py` - Typing of CFFI objects
+- :ghfile:`numba/core/typing/typeof.py` - Implementation of typeof operations
+  (maps Python object to Numba type)
+- :ghfile:`numba/core/typing/asnumbatype.py` - Implementation of
+  ``as_numba_type`` operations (maps Python types to Numba type)
+- :ghfile:`numba/core/typing/npdatetime.py` - Datetime dtype support for NumPy
+  arrays
+
+
+Target Implementations (CPU)
+''''''''''''''''''''''''''''
+
+Implementations of Python / NumPy functions and some data models.
+These modules are responsible for generating LLVM IR during lowering.
+Note that some of these modules do not have counterparts in the typing
+package because newer Numba extension APIs (like overload) allow
+typing and implementation to be specified together.
+
+- :ghfile:`numba/core/cpu.py` - Context for code gen on CPU
+- :ghfile:`numba/core/base.py` - Base class for all target contexts
+- :ghfile:`numba/core/codegen.py` - Driver for code generation
+- :ghfile:`numba/core/boxing.py` - Boxing and unboxing for most data
+  types
+- :ghfile:`numba/core/intrinsics.py` - Utilities for converting LLVM
+  intrinsics to other math calls
+- :ghfile:`numba/core/callconv.py` - Implements different calling
+  conventions for Numba-compiled functions
+- :ghfile:`numba/core/options.py` - Container for options that control
+  lowering
+- :ghfile:`numba/core/optional.py` - Special type representing value or
+  ``None``
+- :ghfile:`numba/core/registry.py` - Registry object for collecting
+  implementations for a specific target
+- :ghfile:`numba/core/imputils.py` - Helper functions for lowering
+- :ghfile:`numba/core/externals.py` - Registers external C functions
+  needed to link generated code
+- :ghfile:`numba/core/fastmathpass.py` - Rewrite pass to add fastmath
+  attributes to function call sites and binary operations
+- :ghfile:`numba/core/removerefctpass.py` - Rewrite pass to remove
+  unnecessary incref/decref pairs
+- :ghfile:`numba/core/descriptors.py` - empty base class for all target
+  descriptors (is this needed?)
+- :ghfile:`numba/cpython/builtins.py` - Python builtin functions and
+  operators
+- :ghfile:`numba/cpython/cmathimpl.py` - Python complex math module
+- :ghfile:`numba/cpython/enumimpl.py` - Enum objects
+- :ghfile:`numba/cpython/hashing.py` - Hashing algorithms
+- :ghfile:`numba/cpython/heapq.py` - Python ``heapq`` module
+- :ghfile:`numba/cpython/iterators.py` - Iterable data types and iterators
+- :ghfile:`numba/cpython/listobj.py` - Python lists
+- :ghfile:`numba/cpython/mathimpl.py` - Python ``math`` module
+- :ghfile:`numba/cpython/numbers.py` - Numeric values (int, float, etc)
+- :ghfile:`numba/cpython/printimpl.py` - Print function
+- :ghfile:`numba/cpython/randomimpl.py` - Python and NumPy ``random``
+  modules
+- :ghfile:`numba/cpython/rangeobj.py` - Python `range` objects
+- :ghfile:`numba/cpython/slicing.py` - Slice objects, and index calculations
+  used in slicing
+- :ghfile:`numba/cpython/setobj.py` - Python set type
+- :ghfile:`numba/cpython/tupleobj.py` - Tuples (statically typed as
+  immutable struct)
+- :ghfile:`numba/misc/cffiimpl.py` - CFFI functions
+- :ghfile:`numba/misc/quicksort.py` - Quicksort implementation used with
+  list and array objects
+- :ghfile:`numba/misc/mergesort.py` - Mergesort implementation used with
+  array objects
+- :ghfile:`numba/np/arraymath.py` - Math operations on arrays (both
+  Python and NumPy)
+- :ghfile:`numba/np/arrayobj.py` - Array operations (both NumPy and
+  buffer protocol)
+- :ghfile:`numba/np/linalg.py` - NumPy linear algebra operations
+- :ghfile:`numba/np/npdatetime.py` - NumPy datetime operations
+- :ghfile:`numba/np/npyfuncs.py` - Kernels used in generating some
+  NumPy ufuncs
+- :ghfile:`numba/np/npyimpl.py` - Implementations of most NumPy ufuncs
+- :ghfile:`numba/np/polynomial.py` - ``numpy.roots`` function
+- :ghfile:`numba/np/ufunc_db.py` - Big table mapping types to ufunc
+  implementations
+
+
+Ufunc Compiler and Runtime
+''''''''''''''''''''''''''
+
+- :ghfile:`numba/np/ufunc` - ufunc compiler implementation
+- :ghfile:`numba/np/ufunc/_internal.{h,c}` - Python extension module with
+  helper functions that use CPython & NumPy C API
+- :ghfile:`numba/np/ufunc/_ufunc.c` - Used by `_internal.c`
+- :ghfile:`numba/np/ufunc/deviceufunc.py` - Custom ufunc dispatch for
+  non-CPU targets
+- :ghfile:`numba/np/ufunc/gufunc_scheduler.{h,cpp}` - Schedule work chunks
+  to threads
+- :ghfile:`numba/np/ufunc/dufunc.py` - Special ufunc that can compile new
+  implementations at call time
+- :ghfile:`numba/np/ufunc/ufuncbuilder.py` - Top-level orchestration of
+  ufunc/gufunc compiler pipeline
+- :ghfile:`numba/np/ufunc/sigparse.py` - Parser for generalized ufunc
+  indexing signatures
+- :ghfile:`numba/np/ufunc/parallel.py` - Codegen for ``parallel`` target
+- :ghfile:`numba/np/ufunc/array_exprs.py` - Rewrite pass for turning array
+  expressions in regular functions into ufuncs
+- :ghfile:`numba/np/ufunc/wrappers.py` - Wrap scalar function kernel with
+  loops
+- :ghfile:`numba/np/ufunc/workqueue.{h,c}` - Threading backend based on
+  pthreads/Windows threads and queues
+- :ghfile:`numba/np/ufunc/omppool.cpp` - Threading backend based on OpenMP
+- :ghfile:`numba/np/ufunc/tbbpool.cpp` - Threading backend based on TBB
+
+
+
+Unit Tests (CPU)
+''''''''''''''''
+
+CPU unit tests (GPU target unit tests listed in later sections
+
+- :ghfile:`runtests.py` - Convenience script that launches test runner and
+  turns on full compiler tracebacks
+- :ghfile:`.coveragerc` - Coverage.py configuration
+- :ghfile:`numba/runtests.py` - Entry point to unittest runner
+- :ghfile:`numba/testing/_runtests.py` - Implementation of custom test runner
+  command line interface
+- :ghfile:`numba/tests/test_*` - Test cases
+- :ghfile:`numba/tests/*_usecases.py` - Python functions compiled by some
+  unit tests
+- :ghfile:`numba/tests/support.py` - Helper functions for testing and
+  special TestCase implementation
+- :ghfile:`numba/tests/dummy_module.py` - Module used in
+  ``test_dispatcher.py``
+- :ghfile:`numba/tests/npyufunc` - ufunc / gufunc compiler tests
+- :ghfile:`numba/testing` - Support code for testing
+- :ghfile:`numba/testing/loader.py` - Find tests on disk
+- :ghfile:`numba/testing/notebook.py` - Support for testing notebooks
+- :ghfile:`numba/testing/main.py` - Numba test runner
+
+
+Command Line Utilities
+''''''''''''''''''''''
+- :ghfile:`bin/numba` - Command line stub, delegates to main in
+  ``numba_entry.py``
+- :ghfile:`numba/misc/numba_entry.py` - Main function for ``numba`` command line
+  tool
+- :ghfile:`numba/pycc` - Ahead of time compilation of functions to shared
+  library extension
+- :ghfile:`numba/pycc/__init__.py` - Main function for ``pycc`` command line
+  tool
+- :ghfile:`numba/pycc/cc.py` - User-facing API for tagging functions to
+  compile ahead of time
+- :ghfile:`numba/pycc/compiler.py` - Compiler pipeline for creating
+  standalone Python extension modules
+- :ghfile:`numba/pycc/llvm_types.py` - Aliases to LLVM data types used by
+  ``compiler.py``
+- :ghfile:`numba/pycc/modulemixin.c` - C file compiled into every compiled
+  extension.  Pulls in C source from Numba core that is needed to make
+  extension standalone
+- :ghfile:`numba/pycc/platform.py` - Portable interface to platform-specific
+  compiler toolchains
+- :ghfile:`numba/pycc/decorators.py` - Deprecated decorators for tagging
+  functions to compile.  Use ``cc.py`` instead.
+
+
+CUDA GPU Target
+'''''''''''''''
+
+Note that the CUDA target does reuse some parts of the CPU target.
+
+- :ghfile:`numba/cuda/` - The implementation of the CUDA (NVIDIA GPU) target
+  and associated unit tests
+- :ghfile:`numba/cuda/decorators.py` - Compiler decorators for CUDA kernels
+  and device functions
+- :ghfile:`numba/cuda/dispatcher.py` - Dispatcher for CUDA JIT functions
+- :ghfile:`numba/cuda/printimpl.py` - Special implementation of device printing
+- :ghfile:`numba/cuda/libdevice.py` - Registers libdevice functions
+- :ghfile:`numba/cuda/kernels/` - Custom kernels for reduction and transpose
+- :ghfile:`numba/cuda/device_init.py` - Initializes the CUDA target when
+  imported
+- :ghfile:`numba/cuda/compiler.py` - Compiler pipeline for CUDA target
+- :ghfile:`numba/cuda/intrinsic_wrapper.py` - CUDA device intrinsics
+  (shuffle, ballot, etc)
+- :ghfile:`numba/cuda/initialize.py` - Deferred initialization of the CUDA
+  device and subsystem.  Called only when user imports ``numba.cuda``
+- :ghfile:`numba/cuda/simulator_init.py` - Initializes the CUDA simulator
+  subsystem (only when user requests it with env var)
+- :ghfile:`numba/cuda/random.py` - Implementation of random number generator
+- :ghfile:`numba/cuda/api.py` - User facing APIs imported into ``numba.cuda.*``
+- :ghfile:`numba/cuda/stubs.py` - Python placeholders for functions that
+  only can be used in GPU device code
+- :ghfile:`numba/cuda/simulator/` - Simulate execution of CUDA kernels in
+  Python interpreter
+- :ghfile:`numba/cuda/vectorizers.py` - Subclasses of ufunc/gufunc compilers
+  for CUDA
+- :ghfile:`numba/cuda/args.py` - Management of kernel arguments, including
+  host<->device transfers
+- :ghfile:`numba/cuda/target.py` - Typing and target contexts for GPU
+- :ghfile:`numba/cuda/cudamath.py` - Type signatures for math functions in
+  CUDA Python
+- :ghfile:`numba/cuda/errors.py` - Validation of kernel launch configuration
+- :ghfile:`numba/cuda/nvvmutils.py` - Helper functions for generating
+  NVVM-specific IR
+- :ghfile:`numba/cuda/testing.py` - Support code for creating CUDA unit
+  tests and capturing standard out
+- :ghfile:`numba/cuda/cudadecl.py` - Type signatures of CUDA API (threadIdx,
+  blockIdx, atomics) in Python on GPU
+- :ghfile:`numba/cuda/cudaimpl.py` - Implementations of CUDA API functions
+  on GPU
+- :ghfile:`numba/cuda/codegen.py` - Code generator object for CUDA target
+- :ghfile:`numba/cuda/cudadrv/` - Wrapper around CUDA driver API
+- :ghfile:`numba/cuda/tests/` - CUDA unit tests, skipped when CUDA is not
+  detected
+- :ghfile:`numba/cuda/tests/cudasim/` - Tests of CUDA simulator
+- :ghfile:`numba/cuda/tests/nocuda/` - Tests for NVVM functionality when
+  CUDA not present
+- :ghfile:`numba/cuda/tests/cudapy/` - Tests of compiling Python functions
+  for GPU
+- :ghfile:`numba/cuda/tests/cudadrv/` - Tests of Python wrapper around CUDA
+  API
+
--- a/docs/source/developer/rewrites.rst
+++ b/docs/source/developer/rewrites.rst
+=====================================================
+Using the Numba Rewrite Pass for Fun and Optimization
+=====================================================
+
+Overview
+========
+
+This section introduces intermediate representation (IR) rewrites, and
+how they can be used to implement optimizations.
+
+As discussed earlier in ":ref:`rewrite-typed-ir`", rewriting the Numba
+IR allows us to perform optimizations that would be much more
+difficult to perform at the lower LLVM level.  Similar to the Numba
+type and lowering subsystems, the rewrite subsystem is user
+extensible.  This extensibility affords Numba the possibility of
+supporting a wide variety of domain-specific optimizations (DSO's).
+
+The remaining subsections detail the mechanics of implementing a
+rewrite, registering a rewrite with the rewrite registry, and provide
+examples of adding new rewrites, as well as internals of the array
+expression optimization pass.  We conclude by reviewing some use cases
+exposed in the examples, as well as reviewing any points where
+developers should take care.
+
+
+Rewriting Passes
+================
+
+Rewriting passes have a simple :func:`~Rewrite.match` and
+:func:`~Rewrite.apply` interface.  The division between matching and
+rewriting follows how one would define a term rewrite in a declarative
+domain-specific languages (DSL's).  In such DSL's, one may write a
+rewrite as follows::
+
+  <match> => <replacement>
+
+
+The ``<match>`` and ``<replacement>`` symbols represent IR term
+expressions, where the left-hand side presents a pattern to match, and
+the right-hand side an IR term constructor to build upon matching.
+Whenever the rewrite matches an IR pattern, any free variables in the
+left-hand side are bound within a custom environment.  When applied,
+the rewrite uses the pattern matching environment to bind any free
+variables in the right-hand side.
+
+As Python is not commonly used in a declarative capacity, Numba uses
+object state to handle the transfer of information between the
+matching and application steps.
+
+
+The :class:`Rewrite` Base Class
+-------------------------------
+
+.. class:: Rewrite
+
+   The :class:`Rewrite` class simply defines an abstract base class
+   for Numba rewrites.  Developers should define rewrites as
+   subclasses of this base type, overloading the
+   :func:`~Rewrite.match` and :func:`~Rewrite.apply` methods.
+
+   .. attribute:: pipeline
+
+       The pipeline attribute contains the
+       :class:`numba.compiler.Pipeline` instance that is currently
+       compiling the function under consideration for rewriting.
+
+   .. method:: __init__(self, pipeline, *args, **kws)
+
+       The base constructor for rewrites simply stashes its arguments
+       into attributes of the same name.  Unless being used in
+       debugging or testing, rewrites should only be constructed by
+       the :class:`RewriteRegistry` in the
+       :func:`RewriteRegistry.apply` method, and the construction
+       interface should remain stable (though the pipeline will
+       commonly contain just about everything there is to know).
+
+   .. method:: match(self, block, typemap, callmap)
+
+      The :func:`~Rewrite.match` method takes four arguments other
+      than *self*:
+
+      * *func_ir*: This is an instance of :class:`numba.ir.FunctionIR` for the
+        function being rewritten.
+
+      * *block*: This is an instance of :class:`numba.ir.Block`.  The
+        matching method should iterate over the instructions contained
+        in the :attr:`numba.ir.Block.body` member.
+
+      * *typemap*: This is a Python :class:`dict` instance mapping
+        from symbol names in the IR, represented as strings, to Numba
+        types.
+
+      * *callmap*: This is another :class:`dict` instance mapping from
+        calls, represented as :class:`numba.ir.Expr` instances, to
+        their corresponding call site type signatures, represented as
+        a :class:`numba.typing.templates.Signature` instance.
+
+      The :func:`~Rewrite.match` method should return a :class:`bool`
+      result.  A :obj:`True` result should indicate that one or more
+      matches were found, and the :func:`~Rewrite.apply` method will
+      return a new replacement :class:`numba.ir.Block` instance.  A
+      :obj:`False` result should indicate that no matches were found, and
+      subsequent calls to :func:`~Rewrite.apply` will return undefined
+      or invalid results.
+
+   .. method:: apply(self)
+
+      The :func:`~Rewrite.apply` method should only be invoked
+      following a successful call to :func:`~Rewrite.match`.  This
+      method takes no additional parameters other than *self*, and
+      should return a replacement :class:`numba.ir.Block` instance.
+
+      As mentioned above, the behavior of calling
+      :func:`~Rewrite.apply` is undefined unless
+      :func:`~Rewrite.match` has already been called and returned
+      :obj:`True`.
+
+
+Subclassing :class:`Rewrite`
+----------------------------
+
+Before going into the expectations for the overloaded methods any
+:class:`Rewrite` subclass must have, let's step back a minute to
+review what is taking place here.  By providing an extensible
+compiler, Numba opens itself to user-defined code generators which may
+be incomplete, or worse, incorrect.  When a code generator goes awry,
+it can cause abnormal program behavior or early termination.
+User-defined rewrites add a new level of complexity because they must
+not only generate correct code, but the code they generate should
+ensure that the compiler does not get stuck in a match/apply loop.
+Non-termination by the compiler will directly lead to non-termination
+of user function calls.
+
+There are several ways to help ensure that a rewrite terminates:
+
+* *Typing*: A rewrite should generally attempt to decompose composite
+  types, and avoid composing new types.  If the rewrite is matching a
+  specific type, changing expression types to a lower-level type will
+  ensure they will no long match after the rewrite is applied.
+
+* *Special instructions*: A rewrite may synthesize custom operators or
+  use special functions in the target IR.  This technique again
+  generates code that is no longer within the domain of the original
+  match, and the rewrite will terminate.
+
+In the ":ref:`case-study-array-expressions`" subsection, below, we'll
+see how the array expression rewriter uses both of these techniques.
+
+
+Overloading :func:`Rewrite.match`
+---------------------------------
+
+Every rewrite developer should seek to have their implementation of
+:func:`~Rewrite.match` return a :obj:`False` value as quickly as
+possible.  Numba is a just-in-time compiler, and adding compilation
+time ultimately adds to the user's run time.  When a rewrite returns
+:obj:`False` for a given block, the registry will no longer process that
+block with that rewrite, and the compiler is that much closer to
+proceeding to lowering.
+
+This need for timeliness has to be balanced against collecting the
+necessary information to make a match for a rewrite.  Rewrite
+developers should be comfortable adding dynamic attributes to their
+subclasses, and then having these new attributes guide construction of
+the replacement basic block.
+
+
+Overloading :func:`Rewrite.apply`
+-----------------------------------
+
+The :func:`~Rewrite.apply` method should return a replacement
+:class:`numba.ir.Block` instance to replace the basic block that
+contained a match for the rewrite.  As mentioned above, the IR built
+by :func:`~Rewrite.apply` methods should preserve the semantics of the
+user's code, but also seek to avoid generating another match for the
+same rewrite or set of rewrites.
+
+
+The Rewrite Registry
+====================
+
+When you want to include a rewrite in the rewrite pass, you should
+register it with the rewrite registry.  The :mod:`numba.rewrites`
+module provides both the abstract base class and a class decorator for
+hooking into the Numba rewrite subsystem.  The following illustrates a
+stub definition of a new rewrite::
+
+  from numba import rewrites
+
+  @rewrites.register_rewrite
+  class MyRewrite(rewrites.Rewrite):
+
+      def match(self, block, typemap, calltypes):
+          raise NotImplementedError("FIXME")
+
+      def apply(self):
+          raise NotImplementedError("FIXME")
+
+
+Developers should note that using the class decorator as shown above
+will register a rewrite at import time.  It is the developer's
+responsibility to ensure their extensions are loaded before
+compilation starts.
+
+
+.. _`case-study-array-expressions`:
+
+Case study: Array Expressions
+=============================
+
+This subsection looks at the array expression rewriter in more depth.
+The array expression rewriter, and most of its support functionality,
+are found in the :mod:`numba.npyufunc.array_exprs` module.  The
+rewriting pass itself is implemented in the :class:`RewriteArrayExprs`
+class.  In addition to the rewriter, the
+:mod:`~numba.npyufunc.array_exprs` module includes a function for
+lowering array expressions,
+:func:`~numba.npyufunc.array_exprs._lower_array_expr`.  The overall
+optimization process is as follows:
+
+* :func:`RewriteArrayExprs.match`: The rewrite pass looks for one or
+  more array operations that form an array expression.
+
+* :func:`RewriteArrayExprs.apply`: Once an array expression is found,
+  the rewriter replaces the individual array operations with a new
+  kind of IR expression, the ``arrayexpr``.
+
+* :func:`numba.npyufunc.array_exprs._lower_array_expr`: During
+  lowering, the code generator calls
+  :func:`~numba.npyufunc.array_exprs._lower_array_expr` whenever it
+  finds an ``arrayexpr`` IR expression.
+
+More details on each step of the optimization are given below.
+
+
+The :func:`RewriteArrayExprs.match` method
+------------------------------------------
+
+The array expression optimization pass starts by looking for array
+operations, including calls to supported :class:`~numpy.ufunc`\'s and
+user-defined :class:`~numba.DUFunc`\'s.  Numba IR follows the
+conventions of a static single assignment (SSA) language, meaning that
+the search for array operators begins with looking for assignment
+instructions.
+
+When the rewriting pass calls the :func:`RewriteArrayExprs.match`
+method, it first checks to see if it can trivially reject the basic
+block.  If the method determines the block to be a candidate for
+matching, it sets up the following state variables in the rewrite
+object:
+
+* *crnt_block*: The current basic block being matched.
+
+* *typemap*: The *typemap* for the function being matched.
+
+* *matches*: A list of variable names that reference array expressions.
+
+* *array_assigns*: A map from assignment variable names to the actual
+  assignment instructions that define the given variable.
+
+* *const_assigns*: A map from assignment variable names to the
+  constant valued expression that defines the constant variable.
+
+At this point, the match method iterates over the assignment
+instructions in the input basic block.  For each assignment
+instruction, the matcher looks for one of two things:
+
+* Array operations: If the right-hand side of the assignment
+  instruction is an expression, and the result of that expression is
+  an array type, the matcher checks to see if the expression is either
+  a known array operation, or a call to a universal function.  If an
+  array operator is found, the matcher stores the left-hand variable
+  name and the whole instruction in the *array_assigns* member.
+  Finally, the matcher tests to see if any operands of the array
+  operation have also been identified as targets of other array
+  operations.  If one or more operands are also targets of array
+  operations, then the matcher will also append the left-hand side
+  variable name to the *matches* member.
+
+* Constants: Constants (even scalars) can be operands to array
+  operations.  Without worrying about the constant being apart of an
+  array expression, the matcher stores constant names and values in
+  the *const_assigns* member.
+
+The end of the matching method simply checks for a non-empty *matches*
+list, returning :obj:`True` if there were one or more matches, and
+:obj:`False` when *matches* is empty.
+
+
+The :func:`RewriteArrayExprs.apply` method
+------------------------------------------
+
+When one or matching array expressions are found by
+:func:`RewriteArrayExprs.match`, the rewriting pass will call
+:func:`RewriteArrayExprs.apply`.  The apply method works in two
+passes.  The first pass iterates over the matches found, and builds a
+map from instructions in the old basic block to new instructions in
+the new basic block.  The second pass iterates over the instructions
+in the old basic block, copying instructions that are not changed by
+the rewrite, and replacing or deleting instructions that were
+identified by the first pass.
+
+The :func:`RewriteArrayExprs._handle_matches` implements the first
+pass of the code generation portion of the rewrite.  For each match,
+this method builds a special IR expression that contains an expression
+tree for the array expression.  To compute the leaves of the
+expression tree, the :func:`~RewriteArrayExprs._handle_matches` method
+iterates over the operands of the identified root operation.  If the
+operand is another array operation, it is translated into an
+expression sub-tree.  If the operand is a constant,
+:func:`~RewriteArrayExprs._handle_matches` copies the constant value.
+Otherwise, the operand is marked as being used by an array expression.
+As the method builds array expression nodes, it builds a map from old
+instructions to new instructions (*replace_map*), as well as sets of
+variables that may have moved (*used_vars*), and variables that should
+be removed altogether (*dead_vars*).  These three data structures are
+returned back to the calling :func:`RewriteArrayExprs.apply` method.
+
+The remaining part of the :func:`RewriteArrayExprs.apply` method
+iterates over the instructions in the old basic block.  For each
+instruction, this method either replaces, deletes, or duplicates that
+instruction based on the results of
+:func:`RewriteArrayExprs._handle_matches`.  The following list
+describes how the optimization handles individual instructions:
+
+* When an instruction is an assignment,
+  :func:`~RewriteArrayExprs.apply` checks to see if it is in the
+  replacement instruction map.  When an assignment instruction is found
+  in the instruction map, :func:`~RewriteArrayExprs.apply` must then
+  check to see if the replacement instruction is also in the replacement
+  map.  The optimizer continues this check until it either arrives at a
+  :obj:`None` value or an instruction that isn't in the replacement map.
+  Instructions that have a replacement that is :obj:`None` are deleted.
+  Instructions that have a non-:obj:`None` replacement are replaced.
+  Assignment instructions not in the replacement map are appended to the
+  new basic block with no changes made.
+
+* When the instruction is a delete instruction, the rewrite checks to
+  see if it deletes a variable that may still be used by a later array
+  expression, or if it deletes a dead variable.  Delete instructions for
+  used variables are added to a map of deferred delete instructions that
+  :func:`~RewriteArrayExprs.apply` uses to move them past any uses of
+  that variable.  The loop copies delete instructions for non-dead
+  variables, and ignores delete instructions for dead variables
+  (effectively removing them from the basic block).
+
+* All other instructions are appended to the new basic block.
+
+Finally, the :func:`~RewriteArrayExprs.apply` method returns the new
+basic block for lowering.
+
+
+The :func:`~numba.npyufunc.array_exprs._lower_array_expr` function
+------------------------------------------------------------------
+
+If we left things at just the rewrite, then the lowering stage of the
+compiler would fail, complaining it doesn't know how to lower
+``arrayexpr`` operations.  We start by hooking a lowering function
+into the target context whenever the :class:`RewriteArrayExprs` class
+is instantiated by the compiler.  This hook causes the lowering pass to
+call :func:`~numba.npyufunc.array_exprs._lower_array_expr` whenever it
+encounters an ``arrayexr`` operator.
+
+This function has two steps:
+
+* Synthesize a Python function that implements the array expression:
+  This new Python function essentially behaves like a Numpy
+  :class:`~numpy.ufunc`, returning the result of the expression on
+  scalar values in the broadcasted array arguments.  The lowering
+  function accomplishes this by translating from the array expression
+  tree into a Python AST.
+
+* Compile the synthetic Python function into a kernel:  At this point,
+  the lowering function relies on existing code for lowering ufunc and
+  DUFunc kernels, calling
+  :func:`numba.targets.numpyimpl.numpy_ufunc_kernel` after defining
+  how to lower calls to the synthetic function.
+
+The end result is similar to loop lifting in Numba's object mode.
+
+
+Conclusions and Caveats
+=======================
+
+We have seen how to implement rewrites in Numba, starting with the
+interface, and ending with an actual optimization.  The key points of
+this section are:
+
+* When writing a good plug-in, the matcher should try to get a
+  go/no-go result as soon as possible.
+
+* The rewrite application portion can be more computationally
+  expensive, but should still generate code that won't cause infinite
+  loops in the compiler.
+
+* We use object state to communicate any results of matching to the
+  rewrite application pass.
--- a/docs/source/developer/stencil.rst
+++ b/docs/source/developer/stencil.rst
+.. Copyright (c) 2017 Intel Corporation
+   SPDX-License-Identifier: BSD-2-Clause
+
+.. _arch-stencil:
+
+=================
+Notes on stencils 
+=================
+
+Numba provides the :ref:`@stencil decorator <numba-stencil>` to
+represent stencil computations.  This document explains how this
+feature is implemented in the several different modes available in
+Numba.  Currently, calls to the stencil from non-jitted code is
+supported as well as calls from jitted code, either with or without
+the :ref:`parallel=True <parallel_jit_option>` option.
+
+The stencil decorator
+=====================
+
+The stencil decorator itself just returns a ``StencilFunc`` object.
+This object encapsulates the original stencil kernel function
+as specified in the program and the options passed to the
+stencil decorator.  Also of note is that after the first compilation
+of the stencil, the computed neighborhood of the stencil is
+stored in the ``StencilFunc`` object in the ``neighborhood`` attribute.
+
+Handling the three modes
+========================
+
+As mentioned above, Numba supports the calling of stencils
+from inside or outside a ``@jit`` compiled function, with or
+without the :ref:`parallel=True <parallel_jit_option>` option.
+
+Outside jit context
+-------------------
+
+``StencilFunc`` overrides the ``__call__`` method so that calls
+to ``StencilFunc`` objects execute the stencil::
+
+    def __call__(self, *args, **kwargs):                                        
+        result = kwargs.get('out')
+                                                                                
+        new_stencil_func = self._stencil_wrapper(result, None, *args)           
+                                                                                
+        if result is None:                                                      
+            return new_stencil_func.entry_point(*args)                          
+        else:                                                                   
+            return new_stencil_func.entry_point(*args, result)                  
+
+First, the presence of the optional :ref:`out <stencil-function-out>`
+parameter is checked.  If it is present then the output array is
+stored in ``result``.  Then, the call to ``_stencil_wrapper``
+generates the stencil function given the result and argument types
+and finally the generated stencil function is executed and its result
+returned.
+
+Jit without ``parallel=True``
+-----------------------------
+
+When constructed, a ``StencilFunc`` inserts itself into the typing
+context's set of user functions and provides the ``_type_me``
+callback.  In this way, the standard Numba compiler is able to
+determine the output type and signature of a ``StencilFunc``.
+Each ``StencilFunc`` maintains a cache of previously seen combinations
+of input argument types and keyword types.  If previously seen,
+the ``StencilFunc`` returns the computed signature.  If not previously
+computed, the ``StencilFunc`` computes the return type of the stencil
+by running the Numba compiler frontend on the stencil kernel and
+then performing type inference on the :term:`Numba IR` (IR) to get the scalar
+return type of the kernel.  From that, a Numpy array type is constructed
+whose element type matches that scalar return type.
+
+After computing the signature of the stencil for a previously
+unseen combination of input and keyword types, the ``StencilFunc``
+then :ref:`creates the stencil function <arch-stencil-create-function>` itself.
+``StencilFunc`` then installs the new stencil function's definition
+in the target context so that jitted code is able to call it.
+
+Thus, in this mode, the generated stencil function is a stand-alone
+function called like a normal function from within jitted code.
+
+Jit with ``parallel=True``
+--------------------------
+
+When calling a ``StencilFunc`` from a jitted context with ``parallel=True``,
+a separate stencil function as generated by :ref:`arch-stencil-create-function`
+is not used.  Instead, `parfors` (:ref:`parallel-accelerator`) are
+created within the current function that implements the stencil.
+This code again starts with the stencil kernel and does a similar kernel
+size computation but then rather than standard Python looping syntax,
+corresponding `parfors` are created so that the execution of the stencil
+will take place in parallel.
+
+The stencil to `parfor` translations can also be selectively disabled
+by setting ``parallel={'stencil': False}``, among other sub-options
+described in :ref:`parallel-accelerator`.
+
+.. _arch-stencil-create-function:
+
+Creating the stencil function
+=============================
+
+Conceptually, a stencil function is created from the user-specified
+stencil kernel by adding looping code around the kernel, transforming
+the relative kernel indices into absolute array indices based on the
+loop indices, and replacing the kernel's ``return`` statement with
+a statement to assign the computed value into the output array.
+
+To accomplish this transformation, first, a copy of the stencil 
+kernel IR is created so that subsequent modifications of the IR
+for different stencil signatures will not effect each other.
+
+Then, an approach similar to how GUFunc's are created for `parfors`
+is employed.  In a text buffer, a Python function is created with
+a unique name.  The input array parameter is added to the function
+definition and if the ``out`` argument type is present then an
+``out`` parameter is added to the stencil function definition.
+If the ``out`` argument is not present then first an output array
+is created with ``numpy.zeros`` having the same shape as the
+input array.
+
+The kernel is then analyzed to compute the stencil size and the
+shape of the boundary (or the ``neighborhood`` stencil decorator
+argument is used for this purpose if present).
+Then, one ``for`` loop for each dimension of the input array is
+added to the stencil function definition.  The range of each
+loop is controlled by the stencil kernel size previously computed
+so that the boundary of the output image is not modified but instead
+left as is.   The body of the innermost ``for`` loop is a single
+``sentinel`` statement that is easily recognized in the IR.
+A call to ``exec`` with the text buffer is used to force the
+stencil function into existence and an ``eval`` is used to get
+access to the corresponding function on which ``run_frontend`` is
+used to get the stencil function IR.
+
+Various renaming and relabeling is performed on the stencil function
+IR and the kernel IR so that the two can be combined without conflict.
+The relative indices in the kernel IR (i.e., ``getitem`` calls) are
+replaced with expressions where the corresponding loop index variables
+are added to the relative indices.  The ``return`` statement in the
+kernel IR is replaced with a ``setitem`` for the corresponding element
+in the output array.
+The stencil function IR is then scanned for the sentinel and the
+sentinel replaced with the modified kernel IR.
+
+Next, ``compile_ir`` is used to compile the combined stencil function
+IR.  The resulting compile result is cached in the ``StencilFunc`` so that
+other calls to the same stencil do not need to undertake this process
+again.
+
+Exceptions raised
+=================
+
+Various checks are performed during stencil compilation to make sure
+that user-specified options do not conflict with each other or with
+other runtime parameters.  For example, if the user has manually
+specified a ``neighborhood`` to the stencil decorator, the length of
+that neighborhood must match the dimensionality of the input array.
+If this is not the case, a ``ValueError`` is raised.
+
+If the neighborhood has not been specified then it must be inferred
+and a requirement to infer the kernel is that all indices are constant
+integers.  If they are not, a ``ValueError`` is raised indicating that
+kernel indices may not be non-constant.
+
+Finally, the stencil implementation detects the output array type
+by running Numba type inference on the stencil kernel.  If the
+return type of this kernel does not match the type of the value
+passed to the ``cval`` stencil decorator option then a ``ValueError``
+is raised.
--- a/docs/source/developer/target_extension.rst
+++ b/docs/source/developer/target_extension.rst
+==========================
+Notes on Target Extensions
+==========================
+
+.. warning:: All features and APIs described in this page are in-development and
+             may change at any time without deprecation notices being issued.
+
+
+Inheriting compiler flags from the caller
+=========================================
+
+Compiler flags, i.e. options such as ``fastmath``, ``nrt`` in
+``@jit(nrt=True, fastmath=True))`` are specified per-function but their
+effects are not well-defined---some flags affect the entire callgraph, some
+flags affect only the current function. Sometimes it is necessary for callees
+to inherit flags from the caller; for example the ``fastmath`` flag should be
+infectious.
+
+To address the problem, the following are needed:
+
+1. Better definitions for the semantics of compiler flags. Preferably, all flags should
+   limit their effect to the current function. (TODO)
+2. Allow compiler flags to be inherited from the caller. (Done)
+3. Consider compiler flags in function resolution. (TODO)
+
+:class:`numba.core.targetconfig.ConfigStack` is used to propagate the compiler flags
+throughout the compiler. At the start of the compilation, the flags are pushed
+into the ``ConfigStack``, which maintains a thread-local stack for the
+compilation. Thus, callees can check the flags in the caller.
+
+.. autoclass:: numba.core.targetconfig.ConfigStack
+    :members:
+
+Compiler flags
+--------------
+
+`Compiler flags`_ are defined as a subclass of ``TargetConfig``:
+
+.. _Compiler flags: https://github.com/numba/numba/blob/7e8538140ce3f8d01a5273a39233b5481d8b20b1/numba/core/compiler.py#L39
+
+.. autoclass:: numba.core.targetconfig.TargetConfig
+    :members:
+
+
+These are internal compiler flags and they are different from the user-facing
+options used in the jit decorators.
+
+Internally, `the user-facing options are mapped to the internal compiler flags <https://github.com/numba/numba/blob/7e8538140ce3f8d01a5273a39233b5481d8b20b1/numba/core/options.py#L72>`_
+by :class:`numba.core.options.TargetOptions`. Each target can override the
+default compiler flags and control the flag inheritance in
+``TargetOptions.finalize``. `The CPU target overrides it.
+<https://github.com/numba/numba/blob/7e8538140ce3f8d01a5273a39233b5481d8b20b1/numba/core/cpu.py#L259>`_
+
+.. autoclass:: numba.core.options.TargetOptions
+    :members: finalize
+
+
+In :meth:`numba.core.options.TargetOptions.finalize`,
+use :meth:`numba.core.targetconfig.TargetConfig.inherit_if_not_set`
+to request a compiler flag from the caller if it is not set for the current
+function.
--- a/docs/source/developer/threading_implementation.rst
+++ b/docs/source/developer/threading_implementation.rst
+=========================================
+Notes on Numba's threading implementation
+=========================================
+
+The execution of the work presented by the Numba ``parallel`` targets is
+undertaken by the Numba threading layer. Practically, the "threading layer"
+is a Numba built-in library that can perform the required concurrent execution.
+At the time of writing there are three threading layers available, each
+implemented via a different lower level native threading library. More
+information on the threading layers and appropriate selection of a threading
+layer for a given application/system can be found in the
+:ref:`threading layer documentation <numba-threading-layer>`.
+
+The pertinent information to note for the following sections is that the
+function in the threading library that performs the parallel execution is the
+``parallel_for`` function. The job of this function is to both orchestrate and
+execute the parallel tasks.
+
+The relevant source files referenced in this document are
+
+- ``numba/np/ufunc/tbbpool.cpp``
+- ``numba/np/ufunc/omppool.cpp``
+- ``numba/np/ufunc/workqueue.c``
+
+  These files contain the TBB, OpenMP, and workqueue threadpool
+  implementations, respectively. Each includes the functions
+  ``set_num_threads()``, ``get_num_threads()``, and ``get_thread_id()``, as
+  well as the relevant logic for thread masking in their respective
+  schedulers. Note that the basic thread local variable logic is duplicated in
+  each of these files, and not shared between them.
+
+- ``numba/np/ufunc/parallel.py``
+
+  This file contains the Python and JIT compatible wrappers for
+  ``set_num_threads()``, ``get_num_threads()``, and ``get_thread_id()``, as
+  well as the code that loads the above libraries into Python and launches the
+  threadpool.
+
+- ``numba/parfors/parfor_lowering.py``
+
+  This file contains the main logic for generating code for the parallel
+  backend. The thread mask is accessed in this file in the code that generates
+  scheduler code, and passed to the relevant backend scheduler function (see
+  below).
+
+Thread masking
+--------------
+
+As part of its design, Numba never launches new threads beyond the threads
+that are launched initially with ``numba.np.ufunc.parallel._launch_threads()``
+when the first parallel execution is run. This is due to the way threads were
+already implemented in Numba prior to thread masking being implemented. This
+restriction was kept to keep the design simple, although it could be removed
+in the future. Consequently, it's possible to programmatically set the number
+of threads, but only to less than or equal to the total number that have
+already been launched. This is done by "masking" out unused threads, causing
+them to do no work. For example, on a 16 core machine, if the user were to
+call ``set_num_threads(4)``, Numba would always have 16 threads present, but
+12 of them would sit idle for parallel computations. A further call to
+``set_num_threads(16)`` would cause those same threads to do work in later
+computations.
+
+:ref:`Thread masking <numba-threading-layer-thread-masking>` was added to make
+it possible for a user to programmatically alter the number of threads
+performing work in the threading layer. Thread masking proved challenging to
+implement as it required the development of a programming model that is suitable
+for users, easy to reason about, and could be implemented safely, with
+consistent behavior across the various threading layers.
+
+Programming model
+~~~~~~~~~~~~~~~~~
+
+The programming model chosen is similar to that found in OpenMP. The reasons
+for this choice were that it is familiar to a lot of users, restricted in
+scope and also simple. The number of threads in use is specified by calling
+``set_num_threads`` and the number of threads in use can be queried by calling
+``get_num_threads``.These two functions are synonymous with their OpenMP
+counterparts (with the above restriction that the mask must be less than or
+equal to the number of launched threads). The execution semantics are also
+similar to OpenMP in that once a parallel region is launched, altering the
+thread mask has no impact on the currently executing region, but will have an
+impact on parallel regions executed subsequently.
+
+The Implementation
+~~~~~~~~~~~~~~~~~~
+
+So as to place no further restrictions on user code other than those that
+already existed in the threading layer libraries, careful consideration of the
+design of thread masking was required. The "thread mask" cannot be stored in a
+global value as concurrent use of the threading layer may result in classic
+forms of race conditions on the value itself. Numerous designs were discussed
+involving various types of mutex on such a global value, all of which were
+eventually broken through thought experiment alone. It eventually transpired
+that, following some OpenMP implementations, the "thread mask" is best
+implemented as a ``thread local``. This means each thread that executes a Numba
+parallel function will have a thread local storage (TLS) slot that contains the
+value of the thread mask to use when scheduling threads in the ``parallel_for``
+function.
+
+The above notion of TLS use for a thread mask is relatively easy to implement,
+``get_num_threads`` and ``set_num_threads`` simply need to address the TLS slot
+in a given threading layer. This also means that the execution schedule for a
+parallel region can be derived from a run time call to ``get_num_threads``. This
+is achieved via a well known and relatively easy to implement pattern of a ``C``
+library function registration and wrapping it in the internal Numba
+implementation.
+
+In addition to satisfying the original upfront thread masking requirements, a
+few more complicated scenarios needed consideration as follows.
+
+Nested parallelism
+******************
+
+In all threading layers a "main thread" will invoke the ``parallel_for``
+function and then in the parallel region, depending on the threading layer,
+some number of additional threads will assist in doing the actual work.
+If the work contains a call to another parallel function (i.e. nested
+parallelism) it is necessary for the thread making the call to know what the
+"thread mask" of the main thread is so that it can propagate it into the
+``parallel_for`` call it makes when executing the nested parallel function.
+The implementation of this behavior is threading layer specific but the general
+principle is for the "main thread" to always "send" the value of the thread mask
+from its TLS slot to all threads in the threading layer that are active in the
+parallel region. These active threads then update their TLS slots with this
+value prior to performing any work. The net result of this implementation detail
+is that:
+
+* thread masks correctly propagate into nested functions
+* it's still possible for each thread in a parallel region to safely have a
+  different mask with which to call nested functions, if it's not set explicitly
+  then the inherited mask from the "main thread" is used
+* threading layers which have dynamic scheduling with threads potentially
+  joining and leaving the active pool during a ``parallel_for`` execution are
+  successfully accommodated
+* any "main thread" thread mask is entirely decoupled from the in-flux nature
+  of the thread masks of the threads in the active thread pool
+
+Python threads independently invoking parallel functions
+********************************************************
+
+The threading layer launch sequence is heavily guarded to ensure that the
+launch is both thread and process safe and run once per process. In a system
+with numerous Python ``threading`` module threads all using Numba, the first
+thread through the launch sequence will get its thread mask set appropriately,
+but no further threads can run the launch sequence. This means that other
+threads will need their initial thread mask set some other way. This is
+achieved when ``get_num_threads`` is called and no thread mask is present, in
+this case the thread mask will be set to the default. In the implementation,
+"no thread mask is present" is represented by the value ``-1`` and the "default
+thread mask" (unset) is represented by the value ``0``. The implementation also
+immediately calls ``set_num_threads(NUMBA_NUM_THREADS)`` after doing this, so
+if either ``-1`` or ``0`` is encountered as a result from ``get_num_threads()`` it
+indicates a bug in the above processes.
+
+OS ``fork()`` calls
+*******************
+
+The use of TLS was also in part driven by the Linux (the most popular
+platform for Numba use by far) having a ``fork(2, 3P)`` call that will do TLS
+propagation into child processes, see ``clone(2)``\ 's ``CLONE_SETTLS``.
+
+Thread ID
+*********
+
+A private ``get_thread_id()`` function was added to each threading backend,
+which returns a unique ID for each thread. This can be accessed from Python by
+``numba.np.ufunc.parallel._get_thread_id()`` (it can also be used inside a
+JIT compiled function). The thread ID function is useful for testing that the
+thread masking behavior is correct, but it should not be used outside of the
+tests. For example, one can call ``set_num_threads(4)`` and then collect all
+unique ``_get_thread_id()``\ s in a parallel region to verify that only 4
+threads are run.
+
+Caveats
+~~~~~~~
+
+Some caveats to be aware of when testing thread masking:
+
+- The TBB backend may choose to schedule fewer than the given mask number of
+  threads. Thus a test such as the one described above may return fewer than 4
+  unique threads.
+
+- The workqueue backend is not threadsafe, so attempts to do multithreading
+  nested parallelism with it may result in deadlocks or other undefined
+  behavior. The workqueue backend will raise a SIGABRT signal if it detects
+  nested parallelism.
+
+- Certain backends may reuse the main thread for computation, but this
+  behavior shouldn't be relied upon (for instance, if propagating exceptions).
+
+Use in Code Generation
+~~~~~~~~~~~~~~~~~~~~~~
+
+The general pattern for using ``get_num_threads`` in code generation is
+
+.. code:: python
+
+   from llvmlite import ir as llvmir
+
+   get_num_threads = cgutils.get_or_insert_function(builder.module
+       llvmir.FunctionType(llvmir.IntType(types.intp.bitwidth), []),
+       name="get_num_threads")
+
+   num_threads = builder.call(get_num_threads, [])
+
+   with cgutils.if_unlikely(builder, builder.icmp_signed('<=', num_threads,
+                                                 num_threads.type(0))):
+       cgutils.printf(builder, "num_threads: %d\n", num_threads)
+       context.call_conv.return_user_exc(builder, RuntimeError,
+                                                 ("Invalid number of threads. "
+                                                  "This likely indicates a bug in Numba.",))
+
+   # Pass num_threads through to the appropriate backend function here
+
+See the code in ``numba/parfors/parfor_lowering.py``.
+
+The guard against ``num_threads`` being <= 0 is not strictly necessary, but it
+can protect against accidentally incorrect behavior in case the thread masking
+logic contains a bug.
+
+The ``num_threads`` variable should be passed through to the appropriate
+backend function, such as ``do_scheduling`` or ``parallel_for``. If it's used
+in some way other than passing it through to the backend function, the above
+considerations should be taken into account to ensure the use of the
+``num_threads`` variable is safe. It would probably be better to keep such
+logic in the threading backends, rather than trying to do it in code
+generation.
+
+.. _chunk-details-label:
+
+Parallel Chunksize Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are some cases in which the actual parallel work chunk sizes may differ
+from the requested
+chunk size that is requested through :func:`numba.set_parallel_chunksize`.
+First, if the number of required chunks based on the specified chunk size
+is less than the number of configured threads then Numba will use all of the configured
+threads to execute the parallel region.  In this case, the actual chunk size will be
+less than the requested chunk size.  Second, due to truncation, in cases where the
+iteration count is slightly less than a multiple of the chunk size
+(e.g., 14 iterations and a specified chunk size of 5), the actual chunk size will be
+larger than the specified chunk size.  As in the given example, the number of chunks
+would be 2 and the actual chunk size would be 7 (i.e. 14 / 2).  Lastly, since Numba
+divides an N-dimensional iteration space into N-dimensional (hyper)rectangular chunks,
+it may be the case there are not N integer factors whose product is equal to the chunk
+size.  In this case, some chunks will have an area/volume larger than the chunk size
+whereas others will be less than the specified chunk size.
+
--- a/docs/source/extending/entrypoints.rst
+++ b/docs/source/extending/entrypoints.rst
+Registering Extensions with Entry Points
+========================================
+
+Often, third party packages will have a user-facing API as well as define
+extensions to the Numba compiler.  In those situations, the new types and
+overloads can registered with Numba when the package is imported by the user.
+However, there are situations where a Numba extension would not normally be
+imported directly by the user, but must still be registered with the Numba
+compiler.  An example of this is the `numba-scipy
+<https://github.com/numba/numba-scipy>`_ package, which adds support for some
+SciPy functions to Numba.  The end user does not need to ``import
+numba_scipy`` to enable compiler support for SciPy, the extension only needs
+to be installed in the Python environment.
+
+Numba discovers extensions using the `entry points
+<https://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery-of-services-and-plugins>`_
+feature of ``setuptools``.  This allows a Python package to register an
+initializer function that will be called before ``numba`` compiles for the
+first time.  The delay ensures that the cost of importing extensions is
+deferred until it is necessary.
+
+
+Adding Support for the "Init" Entry Point
+-----------------------------------------
+
+A package can register an initialization function with Numba by adding the
+``entry_points`` argument to the ``setup()`` function call in ``setup.py``:
+
+.. code-block:: python
+
+    setup(
+        ...,
+        entry_points={
+            "numba_extensions": [
+                "init = numba_scipy:_init_extension",
+            ],
+        },
+        ...
+    )
+
+Numba currently only looks for the ``init`` entry point in the
+``numba_extensions`` group.  The entry point should be a function (any name,
+as long as it matches what is listed in ``setup.py``) that takes no arguments,
+and the return value is ignored.  This function should register types,
+overloads, or call other Numba extension APIs.  The order of initialization of
+extensions is undefined.
+
+Testing your Entry Point
+------------------------
+
+Numba loads all entry points when the first function is compiled. To test your
+entry point, it is not sufficient to just ``import numba``; you have to define
+and run a small function, like this:
+
+.. code-block:: python
+
+    import numba; numba.njit(lambda x: x + 1)(123)
+
+It is not necessary to import your module: entry points are identified by the
+``entry_points.txt`` file in your library's ``*.egg-info`` directory.
+
+The ``setup.py build`` command does not create eggs, but ``setup.py sdist``
+(for testing in a local directory) and ``setup.py install`` do. All entry points
+registered in eggs that are on the Python path are loaded. Be sure to check for
+stale ``entry_points.txt`` when debugging.