Commit 1fb0017a authored by dugupeiwen's avatar dugupeiwen
Browse files

init 0.58

parents
Event API
=========
.. automodule:: numba.core.event
:members:
\ No newline at end of file
.. _arch-generators:
===================
Notes on generators
===================
Numba recently gained support for compiling generator functions. This
document explains some of the implementation choices.
Terminology
===========
For clarity, we distinguish between *generator functions* and
*generators*. A generator function is a function containing one or
several ``yield`` statements. A generator (sometimes also called "generator
iterator") is the return value of a generator function; it resumes
execution inside its frame each time :py:func:`next` is called.
A *yield point* is the place where a ``yield`` statement is called.
A *resumption point* is the place just after a *yield point* where execution
is resumed when :py:func:`next` is called again.
Function analysis
=================
Suppose we have the following simple generator function::
def gen(x, y):
yield x + y
yield x - y
Here is its CPython bytecode, as printed out using :py:func:`dis.dis`::
7 0 LOAD_FAST 0 (x)
3 LOAD_FAST 1 (y)
6 BINARY_ADD
7 YIELD_VALUE
8 POP_TOP
8 9 LOAD_FAST 0 (x)
12 LOAD_FAST 1 (y)
15 BINARY_SUBTRACT
16 YIELD_VALUE
17 POP_TOP
18 LOAD_CONST 0 (None)
21 RETURN_VALUE
When compiling this function with :envvar:`NUMBA_DUMP_IR` set to 1, the
following information is printed out::
----------------------------------IR DUMP: gen----------------------------------
label 0:
x = arg(0, name=x) ['x']
y = arg(1, name=y) ['y']
$0.3 = x + y ['$0.3', 'x', 'y']
$0.4 = yield $0.3 ['$0.3', '$0.4']
del $0.4 []
del $0.3 []
$0.7 = x - y ['$0.7', 'x', 'y']
del y []
del x []
$0.8 = yield $0.7 ['$0.7', '$0.8']
del $0.8 []
del $0.7 []
$const0.9 = const(NoneType, None) ['$const0.9']
$0.10 = cast(value=$const0.9) ['$0.10', '$const0.9']
del $const0.9 []
return $0.10 ['$0.10']
------------------------------GENERATOR INFO: gen-------------------------------
generator state variables: ['$0.3', '$0.7', 'x', 'y']
yield point #1: live variables = ['x', 'y'], weak live variables = ['$0.3']
yield point #2: live variables = [], weak live variables = ['$0.7']
What does it mean? The first part is the Numba IR, as already seen in
:ref:`arch_generate_numba_ir`. We can see the two yield points (``yield $0.3``
and ``yield $0.7``).
The second part shows generator-specific information. To understand it
we have to understand what suspending and resuming a generator means.
When suspending a generator, we are not merely returning a value to the
caller (the operand of the ``yield`` statement). We also have to save the
generator's *current state* in order to resume execution. In trivial use
cases, perhaps the CPU's register values or stack slots would be preserved
until the next call to next(). However, any non-trivial case will hopelessly
clobber those values, so we have to save them in a well-defined place.
What are the values we need to save? Well, in the context of the Numba
Intermediate Representation, we must save all *live variables* at each
yield point. These live variables are computed thanks to the control
flow graph.
Once live variables are saved and the generator is suspended, resuming
the generator simply involves the inverse operation: the live variables
are restored from the saved generator state.
.. note::
It is the same analysis which helps insert Numba ``del`` instructions
where appropriate.
Let's go over the generator info again::
generator state variables: ['$0.3', '$0.7', 'x', 'y']
yield point #1: live variables = ['x', 'y'], weak live variables = ['$0.3']
yield point #2: live variables = [], weak live variables = ['$0.7']
Numba has computed the union of all live variables (denoted as "state
variables"). This will help define the layout of the :ref:`generator
structure <generator-structure>`. Also, for each yield point, we have
computed two sets of variables:
* the *live variables* are the variables which are used by code following
the resumption point (i.e. after the ``yield`` statement)
* the *weak live variables* are variables which are del'ed immediately
after the resumption point; they have to be saved in :term:`object mode`,
to ensure proper reference cleanup
.. _generator-structure:
The generator structure
=======================
Layout
------
Function analysis helps us gather enough information to define the
layout of the generator structure, which will store the entire execution
state of a generator. Here is a sketch of the generator structure's layout,
in pseudo-code::
struct gen_struct_t {
int32_t resume_index;
struct gen_args_t {
arg_0_t arg0;
arg_1_t arg1;
...
arg_N_t argN;
}
struct gen_state_t {
state_0_t state_var0;
state_1_t state_var1;
...
state_N_t state_varN;
}
}
Let's describe those fields in order.
* The first member, the *resume index*, is an integer telling the generator
at which resumption point execution must resume. By convention, it can
have two special values: 0 means execution must start at the beginning of
the generator (i.e. the first time :py:func:`next` is called); -1 means
the generator is exhausted and resumption must immediately raise
StopIteration. Other values indicate the yield point's index starting from 1
(corresponding to the indices shown in the generator info above).
* The second member, the *arguments structure* is read-only after it is first
initialized. It stores the values of the arguments the generator function
was called with. In our example, these are the values of ``x`` and ``y``.
* The third member, the *state structure*, stores the live variables as
computed above.
Concretely, our example's generator structure (assuming the generator
function is called with floating-point numbers) is then::
struct gen_struct_t {
int32_t resume_index;
struct gen_args_t {
double arg0;
double arg1;
}
struct gen_state_t {
double $0.3;
double $0.7;
double x;
double y;
}
}
Note that here, saving ``x`` and ``y`` is redundant: Numba isn't able to
recognize that the state variables ``x`` and ``y`` have the same value
as ``arg0`` and ``arg1``.
Allocation
----------
How does Numba ensure the generator structure is preserved long enough?
There are two cases:
* When a Numba-compiled generator function is called from a Numba-compiled
function, the structure is allocated on the stack by the callee. In this
case, generator instantiation is practically costless.
* When a Numba-compiled generator function is called from regular Python
code, a CPython-compatible wrapper is instantiated that has the right
amount of allocated space to store the structure, and whose
:c:member:`~PyTypeObject.tp_iternext` slot is a wrapper around the
generator's native code.
Compiling to native code
========================
When compiling a generator function, three native functions are actually
generated by Numba:
* An initialization function. This is the function corresponding
to the generator function itself: it receives the function arguments and
stores them inside the generator structure (which is passed by pointer).
It also initialized the *resume index* to 0, indicating that the generator
hasn't started yet.
* A next() function. This is the function called to resume execution
inside the generator. Its single argument is a pointer to the generator
structure and it returns the next yielded value (or a special exit code
is used if the generator is exhausted, for quick checking when called
from Numba-compiled functions).
* An optional finalizer. In object mode, this function ensures that all
live variables stored in the generator state are decref'ed, even if the
generator is destroyed without having been exhausted.
The next() function
-------------------
The next() function is the least straight-forward of the three native
functions. It starts with a trampoline which dispatches execution to the
right resume point depending on the *resume index* stored in the generator
structure. Here is how the function start may look like in our example:
.. code-block:: llvm
define i32 @"__main__.gen.next"(
double* nocapture %retptr,
{ i8*, i32 }** nocapture readnone %excinfo,
i8* nocapture readnone %env,
{ i32, { double, double }, { double, double, double, double } }* nocapture %arg.gen)
{
entry:
%gen.resume_index = getelementptr { i32, { double, double }, { double, double, double, double } }* %arg.gen, i64 0, i32 0
%.47 = load i32* %gen.resume_index, align 4
switch i32 %.47, label %stop_iteration [
i32 0, label %B0
i32 1, label %generator_resume1
i32 2, label %generator_resume2
]
; rest of the function snipped
(uninteresting stuff trimmed from the LLVM IR to make it more readable)
We recognize the pointer to the generator structure in ``%arg.gen``.
The trampoline switch has three targets (one for each *resume index* 0, 1
and 2), and a fallback target label named ``stop_iteration``. Label ``B0``
represents the function's start, ``generator_resume1`` (resp.
``generator_resume2``) is the resumption point after the first
(resp. second) yield point.
After generation by LLVM, the whole native assembly code for this function
may look like this (on x86-64):
.. code-block:: asm
.globl __main__.gen.next
.align 16, 0x90
__main__.gen.next:
movl (%rcx), %eax
cmpl $2, %eax
je .LBB1_5
cmpl $1, %eax
jne .LBB1_2
movsd 40(%rcx), %xmm0
subsd 48(%rcx), %xmm0
movl $2, (%rcx)
movsd %xmm0, (%rdi)
xorl %eax, %eax
retq
.LBB1_5:
movl $-1, (%rcx)
jmp .LBB1_6
.LBB1_2:
testl %eax, %eax
jne .LBB1_6
movsd 8(%rcx), %xmm0
movsd 16(%rcx), %xmm1
movaps %xmm0, %xmm2
addsd %xmm1, %xmm2
movsd %xmm1, 48(%rcx)
movsd %xmm0, 40(%rcx)
movl $1, (%rcx)
movsd %xmm2, (%rdi)
xorl %eax, %eax
retq
.LBB1_6:
movl $-3, %eax
retq
Note the function returns 0 to indicate a value is yielded, -3 to indicate
StopIteration. ``%rcx`` points to the start of the generator structure,
where the resume index is stored.
================
Notes on Hashing
================
Numba supports the built-in :func:`hash` and does so by simply calling the
:func:`__hash__` member function on the supplied argument. This makes it
trivial to add hash support for new types as all that is required is the
application of the extension API :func:`overload_method` decorator to overload
a function for computing the hash value for the new type registered to the
type's :func:`__hash__` method. For example::
from numba.extending import overload_method
@overload_method(myType, '__hash__')
def myType_hash_overload(obj):
# implementation details
The Implementation
==================
The implementation of the Numba hashing functions strictly follows that of
Python 3. The only exception to this is that for hashing Unicode and bytes (for
content longer than ``sys.hash_info.cutoff``) the only supported algorithm is
``siphash24`` (default in CPython 3). As a result Numba will match Python 3
hash values for all supported types under the default conditions described.
Unicode hash cache differences
------------------------------
Both Numba and CPython Unicode string internal representations have a ``hash``
member for the purposes of caching the string's hash value. This member is
always checked ahead of computing a hash value with the view of simply providing
a value from cache as it is considerably cheaper to do so. The Numba Unicode
string hash caching implementation behaves in a similar way to that of
CPython's. The only notable behavioral change (and its only impact is a minor
potential change in performance) is that Numba always computes and caches the
hash for Unicode strings created in ``nopython mode`` at the time they are boxed
for reuse in Python, this is too eager in some cases in comparison to CPython
which may delay hashing a new Unicode string depending on creation method. It
should also be noted that Numba copies in the ``hash`` member of the CPython
internal representation for Unicode strings when unboxing them to its own
representation so as to not recompute the hash of a string that already has a
hash value associated with it.
The accommodation of ``PYTHONHASHSEED``
---------------------------------------
The ``PYTHONHASHSEED`` environment variable can be used to seed the CPython
hashing algorithms for e.g. the purposes of reproducibility. The Numba hashing
implementation directly reads the CPython hashing algorithms' internal state and
as a result the influence of ``PYTHONHASHSEED`` is replicated in Numba's
hashing implementations.
.. _developer-manual:
Developer Manual
================
.. toctree::
:maxdepth: 2
contributing.rst
release.rst
repomap.rst
architecture.rst
dispatching.rst
generators.rst
numba-runtime.rst
rewrites.rst
live_variable_analysis.rst
listings.rst
stencil.rst
custom_pipeline.rst
inlining.rst
environment.rst
hashing.rst
caching.rst
threading_implementation.rst
literal.rst
llvm_timings.rst
debugging.rst
event_api.rst
target_extension.rst
mission.rst
from numba import njit
import numba
from numba.core import ir
@njit(inline='never')
def never_inline():
return 100
@njit(inline='always')
def always_inline():
return 200
def sentinel_cost_model(expr, caller_info, callee_info):
# this cost model will return True (i.e. do inlining) if either:
# a) the callee IR contains an `ir.Const(37)`
# b) the caller IR contains an `ir.Const(13)` logically prior to the call
# site
# check the callee
for blk in callee_info.blocks.values():
for stmt in blk.body:
if isinstance(stmt, ir.Assign):
if isinstance(stmt.value, ir.Const):
if stmt.value.value == 37:
return True
# check the caller
before_expr = True
for blk in caller_info.blocks.values():
for stmt in blk.body:
if isinstance(stmt, ir.Assign):
if isinstance(stmt.value, ir.Expr):
if stmt.value == expr:
before_expr = False
if isinstance(stmt.value, ir.Const):
if stmt.value.value == 13:
return True & before_expr
return False
@njit(inline=sentinel_cost_model)
def maybe_inline1():
# Will not inline based on the callee IR with the declared cost model
# The following is ir.Const(300).
return 300
@njit(inline=sentinel_cost_model)
def maybe_inline2():
# Will inline based on the callee IR with the declared cost model
# The following is ir.Const(37).
return 37
@njit
def foo():
a = never_inline() # will never inline
b = always_inline() # will always inline
# will not inline as the function does not contain a magic constant known to
# the cost model, and the IR up to the call site does not contain a magic
# constant either
d = maybe_inline1()
# declare this magic constant to trigger inlining of maybe_inline1 in a
# subsequent call
magic_const = 13
# will inline due to above constant declaration
e = maybe_inline1()
# will inline as the maybe_inline2 function contains a magic constant known
# to the cost model
c = maybe_inline2()
return a + b + c + d + e + magic_const
foo()
import numba
from numba.extending import overload
from numba import njit, types
def bar(x):
"""A function stub to overload"""
pass
@overload(bar, inline='always')
def ol_bar_tuple(x):
# An overload that will always inline, there is a type guard so that this
# only applies to UniTuples.
if isinstance(x, types.UniTuple):
def impl(x):
return x[0]
return impl
def cost_model(expr, caller, callee):
# Only inline if the type of the argument is an Integer
return isinstance(caller.typemap[expr.args[0].name], types.Integer)
@overload(bar, inline=cost_model)
def ol_bar_scalar(x):
# An overload that will inline based on a cost model, it only applies to
# scalar values in the numerical domain as per the type guard on Number
if isinstance(x, types.Number):
def impl(x):
return x + 1
return impl
@njit
def foo():
# This will resolve via `ol_bar_tuple` as the argument is a types.UniTuple
# instance. It will always be inlined as specified in the decorator for this
# overload.
a = bar((1, 2, 3))
# This will resolve via `ol_bar_scalar` as the argument is a types.Number
# instance, hence the cost_model will be used to determine whether to
# inline.
# The function will be inlined as the value 100 is an IntegerLiteral which
# is an instance of a types.Integer as required by the cost_model function.
b = bar(100)
# This will also resolve via `ol_bar_scalar` as the argument is a
# types.Number instance, again the cost_model will be used to determine
# whether to inline.
# The function will not be inlined as the complex value is not an instance
# of a types.Integer as required by the cost_model function.
c = bar(300j)
return a + b + c
foo()
=================
Notes on Inlining
=================
There are occasions where it is useful to be able to inline a function at its
call site, at the Numba IR level of representation. The decorators such as
:func:`numba.jit`, :func:`numba.extending.overload` and
:func:`register_jitable` support the keyword argument ``inline``, to facilitate
this behaviour.
When attempting to inline at this level, it is important to understand what
purpose this serves and what effect this will have. In contrast to the inlining
performed by LLVM, which is aimed at improving performance, the main reason to
inline at the Numba IR level is to allow type inference to cross function
boundaries.
As an example, consider the following snippet:
.. code:: python
from numba import njit
@njit
def bar(a):
a.append(10)
@njit
def foo():
z = []
bar(z)
foo()
This will fail to compile and run, because the type of ``z`` can not be inferred
as it will only be refined within ``bar``. If we now add ``inline=True`` to the
decorator for ``bar`` the snippet will compile and run. This is because inlining
the call to ``a.append(10)`` will mean that ``z`` will be refined to hold integers
and so type inference will succeed.
So, to recap, inlining at the Numba IR level is unlikely to have a performance
benefit. Whereas inlining at the LLVM level stands a better chance.
The ``inline`` keyword argument can be one of three values:
* The string ``'never'``, this is the default and results in the function not
being inlined under any circumstances.
* The string ``'always'``, this results in the function being inlined at all
call sites.
* A python function that takes three arguments. The first argument is always the
``ir.Expr`` node that is the ``call`` requesting the inline, this is present
to allow the function to make call contextually aware decisions. The second
and third arguments are:
* In the case of an untyped inline, i.e. that which occurs when using the
:func:`numba.jit` family of decorators, both arguments are
``numba.ir.FunctionIR`` instances. The second argument corresponding to the
IR of the caller, the third argument corresponding to the IR of the callee.
* In the case of a typed inline, i.e. that which occurs when using
:func:`numba.extending.overload`, both arguments are instances of a
``namedtuple`` with fields (corresponding to their standard use in the
compiler internals):
* ``func_ir`` - the function's Numba IR.
* ``typemap`` - the function's type map.
* ``calltypes`` - the call types of any calls in the function.
* ``signature`` - the function's signature.
The second argument holds the information from the caller, the third holds
the information from the callee.
In all cases the function should return True to inline and return False to not
inline, this essentially permitting custom inlining rules (typical use might
be cost models).
* Recursive functions with ``inline='always'`` will result in a non-terminating
compilation. If you wish to avoid this, supply a function to limit the
recursion depth (see below).
.. note:: No guarantee is made about the order in which functions are assessed
for inlining or about the order in which they are inlined.
Example using :func:`numba.jit`
===============================
An example of using all three options to ``inline`` in the :func:`numba.njit`
decorator:
.. literalinclude:: inline_example.py
which produces the following when executed (with a print of the IR after the
legalization pass, enabled via the environment variable
``NUMBA_DEBUG_PRINT_AFTER="ir_legalization"``):
.. code-block:: none
:emphasize-lines: 2, 3, 9, 16, 17, 21, 22, 26, 35
label 0:
$0.1 = global(never_inline: CPUDispatcher(<function never_inline at 0x7f890ccf9048>)) ['$0.1']
$0.2 = call $0.1(func=$0.1, args=[], kws=(), vararg=None) ['$0.1', '$0.2']
del $0.1 []
a = $0.2 ['$0.2', 'a']
del $0.2 []
$0.3 = global(always_inline: CPUDispatcher(<function always_inline at 0x7f890ccf9598>)) ['$0.3']
del $0.3 []
$const0.1.0 = const(int, 200) ['$const0.1.0']
$0.2.1 = $const0.1.0 ['$0.2.1', '$const0.1.0']
del $const0.1.0 []
$0.4 = $0.2.1 ['$0.2.1', '$0.4']
del $0.2.1 []
b = $0.4 ['$0.4', 'b']
del $0.4 []
$0.5 = global(maybe_inline1: CPUDispatcher(<function maybe_inline1 at 0x7f890ccf9ae8>)) ['$0.5']
$0.6 = call $0.5(func=$0.5, args=[], kws=(), vararg=None) ['$0.5', '$0.6']
del $0.5 []
d = $0.6 ['$0.6', 'd']
del $0.6 []
$const0.7 = const(int, 13) ['$const0.7']
magic_const = $const0.7 ['$const0.7', 'magic_const']
del $const0.7 []
$0.8 = global(maybe_inline1: CPUDispatcher(<function maybe_inline1 at 0x7f890ccf9ae8>)) ['$0.8']
del $0.8 []
$const0.1.2 = const(int, 300) ['$const0.1.2']
$0.2.3 = $const0.1.2 ['$0.2.3', '$const0.1.2']
del $const0.1.2 []
$0.9 = $0.2.3 ['$0.2.3', '$0.9']
del $0.2.3 []
e = $0.9 ['$0.9', 'e']
del $0.9 []
$0.10 = global(maybe_inline2: CPUDispatcher(<function maybe_inline2 at 0x7f890ccf9b70>)) ['$0.10']
del $0.10 []
$const0.1.4 = const(int, 37) ['$const0.1.4']
$0.2.5 = $const0.1.4 ['$0.2.5', '$const0.1.4']
del $const0.1.4 []
$0.11 = $0.2.5 ['$0.11', '$0.2.5']
del $0.2.5 []
c = $0.11 ['$0.11', 'c']
del $0.11 []
$0.14 = a + b ['$0.14', 'a', 'b']
del b []
del a []
$0.16 = $0.14 + c ['$0.14', '$0.16', 'c']
del c []
del $0.14 []
$0.18 = $0.16 + d ['$0.16', '$0.18', 'd']
del d []
del $0.16 []
$0.20 = $0.18 + e ['$0.18', '$0.20', 'e']
del e []
del $0.18 []
$0.22 = $0.20 + magic_const ['$0.20', '$0.22', 'magic_const']
del magic_const []
del $0.20 []
$0.23 = cast(value=$0.22) ['$0.22', '$0.23']
del $0.22 []
return $0.23 ['$0.23']
Things to note in the above:
1. The call to the function ``never_inline`` remains as a call.
2. The ``always_inline`` function has been inlined, note its
``const(int, 200)`` in the caller body.
3. There is a call to ``maybe_inline1`` before the ``const(int, 13)``
declaration, the cost model prevented this from being inlined.
4. After the ``const(int, 13)`` the subsequent call to ``maybe_inline1`` has
been inlined as shown by the ``const(int, 300)`` in the caller body.
5. The function ``maybe_inline2`` has been inlined as demonstrated by
``const(int, 37)`` in the caller body.
6. That dead code elimination has not been performed and as a result there are
superfluous statements present in the IR.
Example using :func:`numba.extending.overload`
==============================================
An example of using inlining with the :func:`numba.extending.overload`
decorator. It is most interesting to note that if a function is supplied as the
argument to ``inline`` a lot more information is available via the supplied
function arguments for use in decision making. Also that different
``@overload`` s can have different inlining behaviours, with multiple ways to
achieve this:
.. literalinclude:: inline_overload_example.py
which produces the following when executed (with a print of the IR after the
legalization pass, enabled via the environment variable
``NUMBA_DEBUG_PRINT_AFTER="ir_legalization"``):
.. code-block:: none
:emphasize-lines: 2, 3, 4, 5, 6, 15, 16, 17, 18, 19, 20, 21, 22, 28, 29, 30
label 0:
$const0.2 = const(tuple, (1, 2, 3)) ['$const0.2']
x.0 = $const0.2 ['$const0.2', 'x.0']
del $const0.2 []
$const0.2.2 = const(int, 0) ['$const0.2.2']
$0.3.3 = getitem(value=x.0, index=$const0.2.2) ['$0.3.3', '$const0.2.2', 'x.0']
del x.0 []
del $const0.2.2 []
$0.4.4 = $0.3.3 ['$0.3.3', '$0.4.4']
del $0.3.3 []
$0.3 = $0.4.4 ['$0.3', '$0.4.4']
del $0.4.4 []
a = $0.3 ['$0.3', 'a']
del $0.3 []
$const0.5 = const(int, 100) ['$const0.5']
x.5 = $const0.5 ['$const0.5', 'x.5']
del $const0.5 []
$const0.2.7 = const(int, 1) ['$const0.2.7']
$0.3.8 = x.5 + $const0.2.7 ['$0.3.8', '$const0.2.7', 'x.5']
del x.5 []
del $const0.2.7 []
$0.4.9 = $0.3.8 ['$0.3.8', '$0.4.9']
del $0.3.8 []
$0.6 = $0.4.9 ['$0.4.9', '$0.6']
del $0.4.9 []
b = $0.6 ['$0.6', 'b']
del $0.6 []
$0.7 = global(bar: <function bar at 0x7f6c3710d268>) ['$0.7']
$const0.8 = const(complex, 300j) ['$const0.8']
$0.9 = call $0.7($const0.8, func=$0.7, args=[Var($const0.8, inline_overload_example.py (56))], kws=(), vararg=None) ['$0.7', '$0.9', '$const0.8']
del $const0.8 []
del $0.7 []
c = $0.9 ['$0.9', 'c']
del $0.9 []
$0.12 = a + b ['$0.12', 'a', 'b']
del b []
del a []
$0.14 = $0.12 + c ['$0.12', '$0.14', 'c']
del c []
del $0.12 []
$0.15 = cast(value=$0.14) ['$0.14', '$0.15']
del $0.14 []
return $0.15 ['$0.15']
Things to note in the above:
1. The first highlighted section is the always inlined overload for the
``UniTuple`` argument type.
2. The second highlighted section is the overload for the ``Number`` argument
type that has been inlined as the cost model function decided to do so as the
argument was an ``Integer`` type instance.
3. The third highlighted section is the overload for the ``Number`` argument
type that has not inlined as the cost model function decided to reject it as
the argument was an ``Complex`` type instance.
4. That dead code elimination has not been performed and as a result there are
superfluous statements present in the IR.
Using a function to limit the inlining depth of a recursive function
====================================================================
When using recursive inlines, you can terminate the compilation by using
a cost model.
.. code:: python
from numba import njit
import numpy as np
class CostModel(object):
def __init__(self, max_inlines):
self._count = 0
self._max_inlines = max_inlines
def __call__(self, expr, caller, callee):
ret = self._count < self._max_inlines
self._count += 1
return ret
@njit(inline=CostModel(3))
def factorial(n):
if n <= 0:
return 1
return n * factorial(n - 1)
factorial(5)
Listings
========
This shows listings from compiler internal registries (e.g. lowering
definitions). The information is provided as developer reference.
When possible, links to source code are provided via github links.
New style listings
------------------
The following listings are generated from ``numba.help.inspector.write_listings()``. Users can run ``python -m numba.help.inspector --format=rst <package>`` to recreate the the documentation.
.. toctree::
:maxdepth: 2
autogen_builtins_listing.rst
autogen_math_listing.rst
autogen_cmath_listing.rst
autogen_numpy_listing.rst
Old style listings
------------------
.. toctree::
:maxdepth: 2
autogen_lower_listing.rst
.. _developer-literally:
======================
Notes on Literal Types
======================
.. note:: This document describes an advanced feature designed to overcome
some limitations of the compilation mechanism relating to types.
Some features need to specialize based on the literal value during
compilation to produce type stable code necessary for successful compilation in
Numba. This can be achieved by propagating the literal value through the type
system. Numba recognizes inline literal values as :class:`numba.types.Literal`.
For example::
def foo(x):
a = 123
return bar(x, a)
Numba will infer the type of ``a`` as ``Literal[int](123)``. The definition of
``bar()`` can subsequently specialize its implementation knowing that the
second argument is an ``int`` with the value ``123``.
``Literal`` Type
----------------
Classes and methods related to the ``Literal`` type.
.. autoclass:: numba.types.Literal
.. autofunction:: numba.types.literal
.. autofunction:: numba.types.unliteral
.. autofunction:: numba.types.maybe_literal
Specifying for Literal Typing
-----------------------------
To specify a value as a ``Literal`` type in code scheduled for JIT compilation,
use the following function:
.. autofunction:: numba.literally
Code Example
~~~~~~~~~~~~
.. literalinclude:: ../../../numba/tests/doc_examples/test_literally_usage.py
:language: python
:caption: from ``test_literally_usage`` of ``numba/tests/doc_examples/test_literally_usage.py``
:start-after: magictoken.ex_literally_usage.begin
:end-before: magictoken.ex_literally_usage.end
:dedent: 4
:linenos:
Internal Details
~~~~~~~~~~~~~~~~
Internally, the compiler raises a ``ForceLiteralArgs`` exception to signal
the dispatcher to wrap specified arguments using the ``Literal`` type.
.. autoclass:: numba.errors.ForceLiteralArg
:members: __init__, combine, __or__
Inside Extensions
-----------------
``@overload`` extensions can use ``literally`` inside the implementation body
like in normal jit-code.
Explicit handling of literal requirements is possible through use of the
following:
.. autoclass:: numba.extending.SentryLiteralArgs
:members:
.. autoclass:: numba.extending.BoundLiteralArgs
:members:
.. autofunction:: numba.extending.sentry_literal_args
.. _live variable analysis:
======================
Live Variable Analysis
======================
(Related issue https://github.com/numba/numba/pull/1611)
Numba uses reference-counting for garbage collection, a technique that
requires cooperation by the compiler. The Numba IR encodes the location
where a decref must be inserted. These locations are determined by live
variable analysis. The corresponding source code is the ``_insert_var_dels()``
method in https://github.com/numba/numba/blob/main/numba/interpreter.py.
In Python semantic, once a variable is defined inside a function, it is alive
until the variable is explicitly deleted or the function scope is ended.
However, Numba analyzes the code to determine the minimum bound of the lifetime
of each variable by its definition and usages during compilation.
As soon as a variable is unreachable, a ``del`` instruction is inserted at the
closest basic-block (either at the start of the next block(s) or at the
end of the current block). This means variables can be released earlier than in
regular Python code.
The behavior of the live variable analysis affects memory usage of the compiled
code. Internally, Numba does not differentiate temporary variables and user
variables. Since each operation generates at least one temporary variable,
a function can accumulate a high number of temporary variables if they are not
released as soon as possible.
Our generator implementation can benefit from early releasing of variables,
which reduces the size of the state to suspend at each yield point.
Notes on behavior of the live variable analysis
================================================
Variable deleted before definition
-----------------------------------
(Related issue: https://github.com/numba/numba/pull/1738)
When a variable lifetime is confined within the loop body (its definition and
usage does not escape the loop body), like:
.. code-block:: python
def f(arr):
# BB 0
res = 0
# BB 1
for i in (0, 1):
# BB 2
t = arr[i]
if t[i] > 1:
# BB 3
res += t[i]
# BB 4
return res
Variable ``t`` is never referenced outside of the loop.
A ``del`` instruction is emitted for ``t`` at the head of the loop (BB 1)
before a variable is defined. The reason is obvious once we know the control
flow graph::
+------------------------------> BB4
|
|
BB 0 --> BB 1 --> BB 2 ---> BB 3
^ | |
| V V
+---------------------+
Variable ``t`` is defined in BB 1. In BB 2, the evaluation of
``t[i] > 1`` uses ``t``, which is the last use if execution takes the false
branch and goto BB 1. In BB 3, ``t`` is only used in ``res += t[i]``, which is
the last use if execution takes the true branch. Because BB 3, an outgoing
branch of BB 2 uses ``t``, ``t`` must be deleted at the common predecessor.
The closest point is BB 1, which does not have ``t`` defined from the incoming
edge of BB 0.
Alternatively, if ``t`` is deleted at BB 4, we will still have to delete the
variable before its definition because BB4 can be executed without executing
the loop body (BB 2 and BB 3), where the variable is defined.
.. _developer-llvm-timings:
====================
Notes on timing LLVM
====================
Getting LLVM Pass Timings
-------------------------
The dispatcher stores LLVM pass timings in the dispatcher object metadata under
the ``llvm_pass_timings`` key when :envvar:`NUMBA_LLVM_PASS_TIMINGS` is
enabled or ``numba.config.LLVM_PASS_TIMINGS`` is set to truthy.
The timings information contains details on how much time
has been spent in each pass. The pass timings are also grouped by their purpose.
For example, there will be pass timings for function-level pre-optimizations,
module-level optimizations, and object code generation.
Code Example
~~~~~~~~~~~~
.. literalinclude:: ../../../numba/tests/doc_examples/test_llvm_pass_timings.py
:language: python
:caption: from ``test_pass_timings`` of ``numba/tests/doc_examples/test_llvm_pass_timings.py``
:start-after: magictoken.ex_llvm_pass_timings.begin
:end-before: magictoken.ex_llvm_pass_timings.end
:dedent: 16
:linenos:
Example output:
.. code-block:: text
Printing pass timings for JITCodeLibrary('DocsLLVMPassTimings.test_pass_timings.<locals>.foo')
Total time: 0.0376
== #0 Function passes on '_ZN5numba5tests12doc_examples22test_llvm_pass_timings19DocsLLVMPassTimings17test_pass_timings12$3clocals$3e7foo$241Ex'
Percent: 4.8%
Total 0.0018s
Top timings:
0.0015s ( 81.6%) SROA #3
0.0002s ( 9.3%) Early CSE #2
0.0001s ( 4.0%) Simplify the CFG #9
0.0000s ( 1.5%) Prune NRT refops #4
0.0000s ( 1.1%) Post-Dominator Tree Construction #5
== #1 Function passes on '_ZN7cpython5numba5tests12doc_examples22test_llvm_pass_timings19DocsLLVMPassTimings17test_pass_timings12$3clocals$3e7foo$241Ex'
Percent: 0.8%
Total 0.0003s
Top timings:
0.0001s ( 30.4%) Simplify the CFG #10
0.0001s ( 24.1%) Early CSE #3
0.0001s ( 17.8%) SROA #4
0.0000s ( 8.8%) Prune NRT refops #5
0.0000s ( 5.6%) Post-Dominator Tree Construction #6
== #2 Function passes on 'cfunc._ZN5numba5tests12doc_examples22test_llvm_pass_timings19DocsLLVMPassTimings17test_pass_timings12$3clocals$3e7foo$241Ex'
Percent: 0.5%
Total 0.0002s
Top timings:
0.0001s ( 27.7%) Early CSE #4
0.0001s ( 26.8%) Simplify the CFG #11
0.0000s ( 13.8%) Prune NRT refops #6
0.0000s ( 7.4%) Post-Dominator Tree Construction #7
0.0000s ( 6.7%) Dominator Tree Construction #29
== #3 Module passes (cheap optimization for refprune)
Percent: 3.7%
Total 0.0014s
Top timings:
0.0007s ( 52.0%) Combine redundant instructions
0.0001s ( 5.4%) Function Integration/Inlining
0.0001s ( 4.9%) Prune NRT refops #2
0.0001s ( 4.8%) Natural Loop Information
0.0001s ( 4.6%) Post-Dominator Tree Construction #2
== #4 Module passes (full optimization)
Percent: 43.9%
Total 0.0165s
Top timings:
0.0032s ( 19.5%) Combine redundant instructions #9
0.0022s ( 13.5%) Combine redundant instructions #7
0.0010s ( 6.1%) Induction Variable Simplification
0.0008s ( 4.8%) Unroll loops #2
0.0007s ( 4.5%) Loop Vectorization
== #5 Finalize object
Percent: 46.3%
Total 0.0174s
Top timings:
0.0060s ( 34.6%) X86 DAG->DAG Instruction Selection #2
0.0019s ( 11.0%) Greedy Register Allocator #2
0.0013s ( 7.4%) Machine Instruction Scheduler #2
0.0012s ( 7.1%) Loop Strength Reduction
0.0004s ( 2.3%) Induction Variable Users
API for custom analysis
~~~~~~~~~~~~~~~~~~~~~~~
It is possible to get more details then the summary text in the above example.
The pass timings are stored in a
:class:`numba.misc.llvm_pass_timings.PassTimingsCollection`, which contains
methods for accessing individual record for each pass.
.. autoclass:: numba.misc.llvm_pass_timings.PassTimingsCollection
:members: get_total_time, list_longest_first, summary, __getitem__, __len__
.. autoclass:: numba.misc.llvm_pass_timings.ProcessedPassTimings
:members: get_raw_data, get_total_time, list_records, list_top, summary
.. autoclass:: numba.misc.llvm_pass_timings.PassTimingRecord
Numba Mission Statement
=======================
Introduction
------------
This document is the mission statement for the Numba project. It exists to
provide a clear description of the purposes and goals of the project. As such,
this document provides background on Numba's users and use-cases, and outlines
the project's overall goals.
This is a living document:
=========================== =============
The first revision date is: May 2022
The last updated date is: May 2022
The next review date is: November 2022
=========================== =============
Background
----------
The Numba project provides tools to improve the performance of Python software.
It comprises numerous facilities including just-in-time (JIT) compilation,
extension points for library authors, and a compiler toolkit on which new
computational acceleration technologies can be explored and built.
The range of use-cases and applications that can be targeted by Numba includes,
but is not limited to:
* Scientific Computing
* Computationally intensive tasks
* Numerically oriented applications
* Data science utilities and programs
The user base of Numba includes anyone needing to perform intensive
computational work, including users from a wide range of disciplines, examples
include:
* The most common use case, a user wanting to JIT compile some numerical
functions.
* Users providing JIT accelerated libraries for domain specific use cases e.g.
scientific researchers.
* Users providing JIT accelerated libraries for use as part of the numerical
Python ecosystem.
* Those writing more advanced JIT accelerated libraries containing their own
domain specific data types etc.
* Compiler engineers who explore new compiler use-cases and/or need a custom
compiler.
* Hardware vendors looking to extend Numba to provide Python support for their
custom silicon or new hardware.
Project Goals
-------------
The primary aims of the Numba project are:
* To make it easier for Python users to write high performance code.
* To have a core package with a well defined and pragmatically selected feature
scope that meets the needs of the user base without being overly complex.
* To provide a compiler toolkit for Python that is extensible and can be
customized to meet the needs of the user base. This comes with the expectation
that users potentially need to invest time and effort to extend and/or
customize the software themselves.
* To support both the Python core language/standard libraries and NumPy.
* To consistently produce high quality software:
* Feature stability across versions.
* Well established and tested public APIs.
* Clearly documented deprecation cycles.
* Internally stable code base.
* Externally tested release candidates.
* Regular releases with a predictable and published release cycle.
* Maintain suitable infrastructure for both testing and releasing. With as
much in public as feasible.
* To make it as easy as possible for people to contribute.
* To have a maintained public roadmap which will also include areas under active
development.
* To have a governance document in place and it working in practice.
* To ensure that Numba receives timely updates for its core dependencies: LLVM,
NumPy and Python.
.. _arch-numba-runtime:
======================
Notes on Numba Runtime
======================
The *Numba Runtime (NRT)* provides the language runtime to the *nopython mode*
Python subset. NRT is a standalone C library with a Python binding. This
allows :term:`NPM` runtime feature to be used without the GIL. Currently, the
only language feature implemented in NRT is memory management.
Memory Management
=================
NRT implements memory management for :term:`NPM` code. It uses *atomic
reference count* for threadsafe, deterministic memory management. NRT maintains
a separate ``MemInfo`` structure for storing information about each allocation.
Cooperating with CPython
------------------------
For NRT to cooperate with CPython, the NRT python binding provides adaptors for
converting python objects that export a memory region. When such an
object is used as an argument to a :term:`NPM` function, a new ``MemInfo`` is
created and it acquires a reference to the Python object. When a :term:`NPM`
value is returned to the Python interpreter, the associated ``MemInfo``
(if any) is checked. If the ``MemInfo`` references a Python object, the
underlying Python object is released and returned instead. Otherwise, the
``MemInfo`` is wrapped in a Python object and returned. Additional process
maybe required depending on the type.
The current implementation supports Numpy array and any buffer-exporting types.
Compiler-side Cooperation
-------------------------
NRT reference counting requires the compiler to emit incref/decref operations
according to the usage. When the reference count drops to zero, the compiler
must call the destructor routine in NRT.
.. _nrt-refct-opt-pass:
Optimizations
-------------
The compiler is allowed to emit incref/decref operations naively. It relies
on an optimization pass to remove redundant reference count operations.
A new optimization pass is implemented in version 0.52.0 to remove reference
count operations that fall into the following four categories of control-flow
structure---per basic-block, diamond, fanout, fanout+raise. See the documentation
for :envvar:`NUMBA_LLVM_REFPRUNE_FLAGS` for their descriptions.
The old optimization pass runs at block level to avoid control flow analysis.
It depends on LLVM function optimization pass to simplify the control flow,
stack-to-register, and simplify instructions. It works by matching and
removing incref and decref pairs within each block. The old pass can be
enabled by setting :envvar:`NUMBA_LLVM_REFPRUNE_PASS` to `0`.
Important assumptions
---------------------
Both the old (pre-0.52.0) and the new (post-0.52.0) optimization passes assume
that the only function that can consume a reference is ``NRT_decref``.
It is important that there are no other functions that will consume references.
Since the passes operate on LLVM IR, the "functions" here are referring to any
callee in a LLVM call instruction.
To summarize, all functions exposed to the refcount optimization pass
**must not** consume counted references unless done so via ``NRT_decref``.
Quirks of the old optimization pass
-----------------------------------
Since the pre-0.52.0 `refcount optimization pass <nrt-refct-opt-pass_>`_
requires the LLVM function optimization pass, the pass works on the LLVM IR as
text. The optimized IR is then materialized again as a new LLVM in-memory
bitcode object.
Debugging Leaks
---------------
To debug reference leaks in NRT MemInfo, each MemInfo python object has a
``.refcount`` attribute for inspection. To get the MemInfo from a ndarray
allocated by NRT, use the ``.base`` attribute.
To debug memory leaks in NRT, the ``numba.core.runtime.rtsys`` defines
``.get_allocation_stats()``. It returns a namedtuple containing the
number of allocation and deallocation since the start of the program.
Checking that the allocation and deallocation counters are matching is the
simplest way to know if the NRT is leaking.
Debugging Leaks in C
--------------------
The start of `numba/core/runtime/nrt.h
<https://github.com/numba/numba/blob/main/numba/core/runtime/nrt.h>`_
has these lines:
.. code-block:: C
/* Debugging facilities - enabled at compile-time */
/* #undef NDEBUG */
#if 0
# define NRT_Debug(X) X
#else
# define NRT_Debug(X) if (0) { X; }
#endif
Undefining NDEBUG (uncomment the ``#undef NDEBUG`` line) enables the assertion
check in NRT.
Enabling the NRT_Debug (replace ``#if 0`` with ``#if 1``) turns on
debug print inside NRT.
Recursion Support
=================
During the compilation of a pair of mutually recursive functions, one of the
functions will contain unresolved symbol references since the compiler handles
one function at a time. The memory for the unresolved symbols is allocated and
initialized to the address of the *unresolved symbol abort* function
(``nrt_unresolved_abort``) just before the machine code is
generated by LLVM. These symbols are tracked and resolved as new functions are
compiled. If a bug prevents the resolution of these symbols,
the abort function will be called, raising a ``RuntimeError`` exception.
The *unresolved symbol abort* function is defined in the NRT with a zero-argument
signature. The caller is safe to call it with arbitrary number of
arguments. Therefore, it is safe to be used inplace of the intended callee.
Using the NRT from C code
=========================
Externally compiled C code should use the ``NRT_api_functions`` struct as a
function table to access the NRT API. The struct is defined in
:ghfile:`numba/core/runtime/nrt_external.h`. Users can use the utility function
``numba.extending.include_path()`` to determine the include directory for
Numba provided C headers.
.. literalinclude:: ../../../numba/core/runtime/nrt_external.h
:language: C
:caption: `numba/core/runtime/nrt_external.h`
Inside Numba compiled code, the ``numba.core.unsafe.nrt.NRT_get_api()``
intrinsic can be used to obtain a pointer to the ``NRT_api_functions``.
Here is an example that uses the ``nrt_external.h``:
.. code-block:: C
#include <stdio.h>
#include "numba/core/runtime/nrt_external.h"
void my_dtor(void *ptr) {
free(ptr);
}
NRT_MemInfo* my_allocate(NRT_api_functions *nrt) {
/* heap allocate some memory */
void * data = malloc(10);
/* wrap the allocated memory; yield a new reference */
NRT_MemInfo *mi = nrt->manage_memory(data, my_dtor);
/* acquire reference */
nrt->acquire(mi);
/* release reference */
nrt->release(mi);
return mi;
}
It is important to ensure that the NRT is initialized prior to making calls to
it, calling ``numba.core.runtime.nrt.rtsys.initialize(context)`` from Python
will have the desired effect. Similarly the code snippet:
.. code-block:: Python
from numba.core.registry import cpu_target # Get the CPU target singleton
cpu_target.target_context # Access the target_context property to initialize
will achieve the same specifically for Numba's CPU target (the default). Failure
to initialize the NRT will result in access violations as function pointers for
various internal atomic operations will be missing in the ``NRT_MemSys`` struct.
Future Plan
===========
The plan for NRT is to make a standalone shared library that can be linked to
Numba compiled code, including use within the Python interpreter and without
the Python interpreter. To make that work, we will be doing some refactoring:
* numba :term:`NPM` code references statically compiled code in "helperlib.c".
Those functions should be moved to NRT.
Numba Release Process
=====================
The goal of the Numba release process -- from a high level perspective -- is to
publish source and binary artifacts that correspond to a given version
number. This usually involves a sequence of individual tasks that must be
performed in the correct order and with diligence. Numba and llvmlite are
commonly released in lockstep since there is usually a one-to-one mapping
between a Numba version and a corresponding llvmlite version.
This section contains various notes and templates that can be used to create a
Numba release checklist on the Numba Github issue tracker. This is an aid for
the maintainers during the release process and helps to ensure that all tasks
are completed in the correct order and that no tasks are accidentally omitted.
If new or additional items do appear during release, please do remember to add
them to the checklist templates. Also note that the release process itself is
always a work in progress. This means that some of the information here may be
outdated. If you notice this please do remember to submit a pull-request to
update this document.
All release checklists are available as Gitub issue templates. To create a new
release checklist simply open a new issue and select the correct template.
Primary Release Candidate Checklist
-----------------------------------
This is for the first/primary release candidate for minor release i.e. the
first release of every series. It is special, because during this release, the
release branch will have to be created. Release candidate indexing begins at 1.
.. literalinclude:: ../../../.github/ISSUE_TEMPLATE/first_rc_checklist.md
:language: md
:lines: 9-
`Open a primary release checklist <https://github.com/numba/numba/issues/new?template=first_rc_checklist.md>`_.
Subsequent Release Candidates, Final Releases and Patch Releases
----------------------------------------------------------------
Releases subsequent to the first release in a series usually involves a series
of cherry-picks, the recipe is therefore slightly different.
.. literalinclude:: ../../../.github/ISSUE_TEMPLATE/sub_rc_checklist.md
:language: md
:lines: 9-
`Open a subsequent release checklist <https://github.com/numba/numba/issues/new?template=sub_rc_checklist.md>`_.
A Map of the Numba Repository
=============================
The Numba repository is quite large, and due to age has functionality spread
around many locations. To help orient developers, this document will try to
summarize where different categories of functionality can be found.
Support Files
-------------
Build and Packaging
'''''''''''''''''''
- :ghfile:`setup.py` - Standard Python distutils/setuptools script
- :ghfile:`MANIFEST.in` - Distutils packaging instructions
- :ghfile:`requirements.txt` - Pip package requirements, not used by conda
- :ghfile:`versioneer.py` - Handles automatic setting of version in
installed package from git tags
- :ghfile:`.flake8` - Preferences for code formatting. Files should be
fixed and removed from the exception list as time allows.
- :ghfile:`.pre-commit-config.yaml` - Configuration file for pre-commit hooks.
- :ghfile:`.readthedocs.yml` - Configuration file for Read the Docs.
- :ghfile:`buildscripts/condarecipe.local` - Conda build recipe
Continuous Integration
''''''''''''''''''''''
- :ghfile:`azure-pipelines.yml` - Azure Pipelines CI config (active:
Win/Mac/Linux)
- :ghfile:`buildscripts/azure/` - Azure Pipeline configuration for specific
platforms
- :ghfile:`buildscripts/incremental/` - Generic scripts for building Numba
on various CI systems
- :ghfile:`codecov.yml` - Codecov.io coverage reporting
Documentation
'''''''''''''
- :ghfile:`LICENSE` - License for Numba
- :ghfile:`LICENSES.third-party` - License for third party code vendored
into Numba
- :ghfile:`README.rst` - README for repo, also uploaded to PyPI
- :ghfile:`CONTRIBUTING.md` - Documentation on how to contribute to project
(out of date, should be updated to point to Sphinx docs)
- :ghfile:`CHANGE_LOG` - History of Numba releases, also directly embedded
into Sphinx documentation
- :ghfile:`docs/` - Documentation source
- :ghfile:`docs/_templates/` - Directory for templates (to override defaults
with Sphinx theme)
- :ghfile:`docs/Makefile` - Used to build Sphinx docs with ``make``
- :ghfile:`docs/source` - ReST source for Numba documentation
- :ghfile:`docs/_static/` - Static CSS and image assets for Numba docs
- :ghfile:`docs/make.bat` - Not used (remove?)
- :ghfile:`docs/requirements.txt` - Pip package requirements for building docs
with Read the Docs.
- :ghfile:`numba/scripts/generate_lower_listing.py` - Dump all registered
implementations decorated with ``@lower*`` for reference
documentation. Currently misses implementations from the higher
level extension API.
Numba Source Code
-----------------
Numba ships with both the source code and tests in one package.
- :ghfile:`numba/` - all of the source code and tests
Public API
''''''''''
These define aspects of the public Numba interface.
- :ghfile:`numba/core/decorators.py` - User-facing decorators for compiling
regular functions on the CPU
- :ghfile:`numba/core/extending.py` - Public decorators for extending Numba
(``overload``, ``intrinsic``, etc)
- :ghfile:`numba/experimental/structref.py` - Public API for defining a mutable struct
- :ghfile:`numba/core/ccallback.py` - ``@cfunc`` decorator for compiling
functions to a fixed C signature. Used to make callbacks.
- :ghfile:`numba/np/ufunc/decorators.py` - ufunc/gufunc compilation
decorators
- :ghfile:`numba/core/config.py` - Numba global config options and environment
variable handling
- :ghfile:`numba/core/annotations` - Gathering and printing type annotations of
Numba IR
- :ghfile:`numba/core/annotations/pretty_annotate.py` - Code highlighting of
Numba functions and types (both ANSI terminal and HTML)
- :ghfile:`numba/core/event.py` - A simple event system for applications to
listen to specific compiler events.
Dispatching
'''''''''''
- :ghfile:`numba/core/dispatcher.py` - Dispatcher objects are compiled functions
produced by ``@jit``. A dispatcher has different implementations
for different type signatures.
- :ghfile:`numba/_dispatcher.cpp` - C++ dispatcher implementation (for speed on
common data types)
- :ghfile:`numba/core/retarget.py` - Support for dispatcher objects to switch
target via a specific with-context.
Compiler Pipeline
'''''''''''''''''
- :ghfile:`numba/core/compiler.py` - Compiler pipelines and flags
- :ghfile:`numba/core/errors.py` - Numba exception and warning classes
- :ghfile:`numba/core/ir.py` - Numba IR data structure objects
- :ghfile:`numba/core/bytecode.py` - Bytecode parsing and function identity (??)
- :ghfile:`numba/core/interpreter.py` - Translate Python interpreter bytecode to
Numba IR
- :ghfile:`numba/core/analysis.py` - Utility functions to analyze Numba IR
(variable lifetime, prune branches, etc)
- :ghfile:`numba/core/controlflow.py` - Control flow analysis of Numba IR and
Python bytecode
- :ghfile:`numba/core/typeinfer.py` - Type inference algorithm
- :ghfile:`numba/core/transforms.py` - Numba IR transformations
- :ghfile:`numba/core/rewrites` - Rewrite passes used by compiler
- :ghfile:`numba/core/rewrites/__init__.py` - Loads all rewrite passes so they
are put into the registry
- :ghfile:`numba/core/rewrites/registry.py` - Registry object for collecting
rewrite passes
- :ghfile:`numba/core/rewrites/ir_print.py` - Write print() calls into special
print nodes in the IR
- :ghfile:`numba/core/rewrites/static_raise.py` - Converts exceptions with
static arguments into a special form that can be lowered
- :ghfile:`numba/core/rewrites/static_getitem.py` - Rewrites getitem and setitem
with constant arguments to allow type inference
- :ghfile:`numba/core/rewrites/static_binop.py` - Rewrites binary operations
(specifically ``**``) with constant arguments so faster code can be
generated
- :ghfile:`numba/core/inline_closurecall.py` - Inlines body of closure functions
to call site. Support for array comprehensions, reduction inlining,
and stencil inlining.
- :ghfile:`numba/core/postproc.py` - Postprocessor for Numba IR that computes
variable lifetime, inserts del operations, and handles generators
- :ghfile:`numba/core/lowering.py` - General implementation of lowering Numba IR
to LLVM
:ghfile:`numba/core/environment.py` - Runtime environment object
- :ghfile:`numba/core/withcontexts.py` - General scaffolding for implementing
context managers in nopython mode, and the objectmode context
manager
- :ghfile:`numba/core/pylowering.py` - Lowering of Numba IR in object mode
- :ghfile:`numba/core/pythonapi.py` - LLVM IR code generation to interface with
CPython API
- :ghfile:`numba/core/targetconfig.py` - Utils for target configurations such
as compiler flags.
Type Management
'''''''''''''''
- :ghfile:`numba/core/typeconv/` - Implementation of type casting and type
signature matching in both C++ and Python
- :ghfile:`numba/capsulethunk.h` - Used by typeconv
- :ghfile:`numba/core/types/` - definition of the Numba type hierarchy, used
everywhere in compiler to select implementations
- :ghfile:`numba/core/consts.py` - Constant inference (used to make constant
values available during codegen when possible)
- :ghfile:`numba/core/datamodel` - LLVM IR representations of data types in
different contexts
- :ghfile:`numba/core/datamodel/models.py` - Models for most standard types
- :ghfile:`numba/core/datamodel/registry.py` - Decorator to register new data
models
- :ghfile:`numba/core/datamodel/packer.py` - Pack typed values into a data
structure
- :ghfile:`numba/core/datamodel/testing.py` - Data model tests (this should
move??)
- :ghfile:`numba/core/datamodel/manager.py` - Map types to data models
Compiled Extensions
'''''''''''''''''''
Numba uses a small amount of compiled C/C++ code for core
functionality, like dispatching and type matching where performance
matters, and it is more convenient to encapsulate direct interaction
with CPython APIs.
- :ghfile:`numba/_arraystruct.h` - Struct for holding NumPy array
attributes. Used in helperlib and the Numba Runtime.
- :ghfile:`numba/_helperlib.c` - C functions required by Numba compiled code
at runtime. Linked into ahead-of-time compiled modules
- :ghfile:`numba/_helpermod.c` - Python extension module with pointers to
functions from ``_helperlib.c``
- :ghfile:`numba/_dynfuncmod.c` - Python extension module exporting
_dynfunc.c functionality
- :ghfile:`numba/_dynfunc.c` - C level Environment and Closure objects (keep
in sync with numba/target/base.py)
- :ghfile:`numba/mathnames.h` - Macros for defining names of math functions
- :ghfile:`numba/_pymodule.h` - C macros for Python 2/3 portable naming of C
API functions
- :ghfile:`numba/mviewbuf.c` - Handles Python memoryviews
- :ghfile:`numba/_typeof.{h,cpp}` - C++ implementation of type fingerprinting,
used by dispatcher
- :ghfile:`numba/_numba_common.h` - Portable C macro for marking symbols
that can be shared between object files, but not outside the
library.
Misc Support
''''''''''''
- :ghfile:`numba/_version.py` - Updated by versioneer
- :ghfile:`numba/core/runtime` - Language runtime. Currently manages
reference-counted memory allocated on the heap by Numba-compiled
functions
- :ghfile:`numba/core/ir_utils.py` - Utility functions for working with Numba IR
data structures
- :ghfile:`numba/core/cgutils.py` - Utility functions for generating common code
patterns in LLVM IR
- :ghfile:`numba/core/utils.py` - Python 2 backports of Python 3 functionality
(also imports local copy of ``six``)
- :ghfile:`numba/misc/appdirs.py` - Vendored package for determining application
config directories on every platform
- :ghfile:`numba/core/compiler_lock.py` - Global compiler lock because Numba's
usage of LLVM is not thread-safe
- :ghfile:`numba/misc/special.py` - Python stub implementations of special Numba
functions (prange, gdb*)
- :ghfile:`numba/core/itanium_mangler.py` - Python implementation of Itanium C++
name mangling
- :ghfile:`numba/misc/findlib.py` - Helper function for locating shared
libraries on all platforms
- :ghfile:`numba/core/debuginfo.py` - Helper functions to construct LLVM IR
debug
info
- :ghfile:`numba/core/unsafe/refcount.py` - Read reference count of object
- :ghfile:`numba/core/unsafe/eh.py` - Exception handling helpers
- :ghfile:`numba/core/unsafe/nrt.py` - Numba runtime (NRT) helpers
- :ghfile:`numba/cpython/unsafe/tuple.py` - Replace a value in a tuple slot
- :ghfile:`numba/np/unsafe/ndarray.py` - NumPy array helpers
- :ghfile:`numba/core/unsafe/bytes.py` - Copying and dereferencing data from
void pointers
- :ghfile:`numba/misc/dummyarray.py` - Used by GPU backends to hold array
information on the host, but not the data.
- :ghfile:`numba/core/callwrapper.py` - Handles argument unboxing and releasing
the GIL when moving from Python to nopython mode
- :ghfile:`numba/np/numpy_support.py` - Helper functions for working with NumPy
and translating Numba types to and from NumPy dtypes.
- :ghfile:`numba/core/tracing.py` - Decorator for tracing Python calls and
emitting log messages
- :ghfile:`numba/core/funcdesc.py` - Classes for describing function metadata
(used in the compiler)
- :ghfile:`numba/core/sigutils.py` - Helper functions for parsing and
normalizing Numba type signatures
- :ghfile:`numba/core/serialize.py` - Support for pickling compiled functions
- :ghfile:`numba/core/caching.py` - Disk cache for compiled functions
- :ghfile:`numba/np/npdatetime.py` - Helper functions for implementing NumPy
datetime64 support
- :ghfile:`numba/misc/llvm_pass_timings.py` - Helper to record timings of
LLVM passes.
- :ghfile:`numba/cloudpickle` - Vendored cloudpickle subpackage
Core Python Data Types
''''''''''''''''''''''
- :ghfile:`numba/_hashtable.{h,cpp}` - Adaptation of the Python 3.7 hash table
implementation
- :ghfile:`numba/cext/dictobject.{h,c}` - C level implementation of typed
dictionary
- :ghfile:`numba/typed/dictobject.py` - Nopython mode wrapper for typed
dictionary
- :ghfile:`numba/cext/listobject.{h,c}` - C level implementation of typed list
- :ghfile:`numba/typed/listobject.py` - Nopython mode wrapper for typed list
- :ghfile:`numba/typed/typedobjectutils.py` - Common utilities for typed
dictionary and list
- :ghfile:`numba/cpython/unicode.py` - Unicode strings (Python 3.5 and later)
- :ghfile:`numba/typed` - Python interfaces to statically typed containers
- :ghfile:`numba/typed/typeddict.py` - Python interface to typed dictionary
- :ghfile:`numba/typed/typedlist.py` - Python interface to typed list
- :ghfile:`numba/experimental/jitclass` - Implementation of experimental JIT
compilation of Python classes
- :ghfile:`numba/core/generators.py` - Support for lowering Python generators
Math
''''
- :ghfile:`numba/_random.c` - Reimplementation of NumPy / CPython random
number generator
- :ghfile:`numba/_lapack.c` - Wrappers for calling BLAS and LAPACK functions
(requires SciPy)
ParallelAccelerator
'''''''''''''''''''
Code transformation passes that extract parallelizable code from
a function and convert it into multithreaded gufunc calls.
- :ghfile:`numba/parfors/parfor.py` - General ParallelAccelerator
- :ghfile:`numba/parfors/parfor_lowering.py` - gufunc lowering for
ParallelAccelerator
- :ghfile:`numba/parfors/array_analysis.py` - Array analysis passes used in
ParallelAccelerator
Stencil
'''''''
Implementation of ``@stencil``:
- :ghfile:`numba/stencils/stencil.py` - Stencil function decorator (implemented
without ParallelAccelerator)
- :ghfile:`numba/stencils/stencilparfor.py` - ParallelAccelerator implementation
of stencil
Debugging Support
'''''''''''''''''
- :ghfile:`numba/misc/gdb_hook.py` - Hooks to jump into GDB from nopython
mode
- :ghfile:`numba/misc/cmdlang.gdb` - Commands to setup GDB for setting
explicit breakpoints from Python
Type Signatures (CPU)
'''''''''''''''''''''
Some (usually older) Numba supported functionality separates the
declaration of allowed type signatures from the definition of
implementations. This package contains registries of type signatures
that must be matched during type inference.
- :ghfile:`numba/core/typing` - Type signature module
- :ghfile:`numba/core/typing/templates.py` - Base classes for type signature
templates
- :ghfile:`numba/core/typing/cmathdecl.py` - Python complex math (``cmath``)
module
- :ghfile:`numba/core/typing/bufproto.py` - Interpreting objects supporting the
buffer protocol
- :ghfile:`numba/core/typing/mathdecl.py` - Python ``math`` module
- :ghfile:`numba/core/typing/listdecl.py` - Python lists
- :ghfile:`numba/core/typing/builtins.py` - Python builtin global functions and
operators
- :ghfile:`numba/core/typing/setdecl.py` - Python sets
- :ghfile:`numba/core/typing/npydecl.py` - NumPy ndarray (and operators), NumPy
functions
- :ghfile:`numba/core/typing/arraydecl.py` - Python ``array`` module
- :ghfile:`numba/core/typing/context.py` - Implementation of typing context
(class that collects methods used in type inference)
- :ghfile:`numba/core/typing/collections.py` - Generic container operations and
namedtuples
- :ghfile:`numba/core/typing/ctypes_utils.py` - Typing ctypes-wrapped function
pointers
- :ghfile:`numba/core/typing/enumdecl.py` - Enum types
- :ghfile:`numba/core/typing/cffi_utils.py` - Typing of CFFI objects
- :ghfile:`numba/core/typing/typeof.py` - Implementation of typeof operations
(maps Python object to Numba type)
- :ghfile:`numba/core/typing/asnumbatype.py` - Implementation of
``as_numba_type`` operations (maps Python types to Numba type)
- :ghfile:`numba/core/typing/npdatetime.py` - Datetime dtype support for NumPy
arrays
Target Implementations (CPU)
''''''''''''''''''''''''''''
Implementations of Python / NumPy functions and some data models.
These modules are responsible for generating LLVM IR during lowering.
Note that some of these modules do not have counterparts in the typing
package because newer Numba extension APIs (like overload) allow
typing and implementation to be specified together.
- :ghfile:`numba/core/cpu.py` - Context for code gen on CPU
- :ghfile:`numba/core/base.py` - Base class for all target contexts
- :ghfile:`numba/core/codegen.py` - Driver for code generation
- :ghfile:`numba/core/boxing.py` - Boxing and unboxing for most data
types
- :ghfile:`numba/core/intrinsics.py` - Utilities for converting LLVM
intrinsics to other math calls
- :ghfile:`numba/core/callconv.py` - Implements different calling
conventions for Numba-compiled functions
- :ghfile:`numba/core/options.py` - Container for options that control
lowering
- :ghfile:`numba/core/optional.py` - Special type representing value or
``None``
- :ghfile:`numba/core/registry.py` - Registry object for collecting
implementations for a specific target
- :ghfile:`numba/core/imputils.py` - Helper functions for lowering
- :ghfile:`numba/core/externals.py` - Registers external C functions
needed to link generated code
- :ghfile:`numba/core/fastmathpass.py` - Rewrite pass to add fastmath
attributes to function call sites and binary operations
- :ghfile:`numba/core/removerefctpass.py` - Rewrite pass to remove
unnecessary incref/decref pairs
- :ghfile:`numba/core/descriptors.py` - empty base class for all target
descriptors (is this needed?)
- :ghfile:`numba/cpython/builtins.py` - Python builtin functions and
operators
- :ghfile:`numba/cpython/cmathimpl.py` - Python complex math module
- :ghfile:`numba/cpython/enumimpl.py` - Enum objects
- :ghfile:`numba/cpython/hashing.py` - Hashing algorithms
- :ghfile:`numba/cpython/heapq.py` - Python ``heapq`` module
- :ghfile:`numba/cpython/iterators.py` - Iterable data types and iterators
- :ghfile:`numba/cpython/listobj.py` - Python lists
- :ghfile:`numba/cpython/mathimpl.py` - Python ``math`` module
- :ghfile:`numba/cpython/numbers.py` - Numeric values (int, float, etc)
- :ghfile:`numba/cpython/printimpl.py` - Print function
- :ghfile:`numba/cpython/randomimpl.py` - Python and NumPy ``random``
modules
- :ghfile:`numba/cpython/rangeobj.py` - Python `range` objects
- :ghfile:`numba/cpython/slicing.py` - Slice objects, and index calculations
used in slicing
- :ghfile:`numba/cpython/setobj.py` - Python set type
- :ghfile:`numba/cpython/tupleobj.py` - Tuples (statically typed as
immutable struct)
- :ghfile:`numba/misc/cffiimpl.py` - CFFI functions
- :ghfile:`numba/misc/quicksort.py` - Quicksort implementation used with
list and array objects
- :ghfile:`numba/misc/mergesort.py` - Mergesort implementation used with
array objects
- :ghfile:`numba/np/arraymath.py` - Math operations on arrays (both
Python and NumPy)
- :ghfile:`numba/np/arrayobj.py` - Array operations (both NumPy and
buffer protocol)
- :ghfile:`numba/np/linalg.py` - NumPy linear algebra operations
- :ghfile:`numba/np/npdatetime.py` - NumPy datetime operations
- :ghfile:`numba/np/npyfuncs.py` - Kernels used in generating some
NumPy ufuncs
- :ghfile:`numba/np/npyimpl.py` - Implementations of most NumPy ufuncs
- :ghfile:`numba/np/polynomial.py` - ``numpy.roots`` function
- :ghfile:`numba/np/ufunc_db.py` - Big table mapping types to ufunc
implementations
Ufunc Compiler and Runtime
''''''''''''''''''''''''''
- :ghfile:`numba/np/ufunc` - ufunc compiler implementation
- :ghfile:`numba/np/ufunc/_internal.{h,c}` - Python extension module with
helper functions that use CPython & NumPy C API
- :ghfile:`numba/np/ufunc/_ufunc.c` - Used by `_internal.c`
- :ghfile:`numba/np/ufunc/deviceufunc.py` - Custom ufunc dispatch for
non-CPU targets
- :ghfile:`numba/np/ufunc/gufunc_scheduler.{h,cpp}` - Schedule work chunks
to threads
- :ghfile:`numba/np/ufunc/dufunc.py` - Special ufunc that can compile new
implementations at call time
- :ghfile:`numba/np/ufunc/ufuncbuilder.py` - Top-level orchestration of
ufunc/gufunc compiler pipeline
- :ghfile:`numba/np/ufunc/sigparse.py` - Parser for generalized ufunc
indexing signatures
- :ghfile:`numba/np/ufunc/parallel.py` - Codegen for ``parallel`` target
- :ghfile:`numba/np/ufunc/array_exprs.py` - Rewrite pass for turning array
expressions in regular functions into ufuncs
- :ghfile:`numba/np/ufunc/wrappers.py` - Wrap scalar function kernel with
loops
- :ghfile:`numba/np/ufunc/workqueue.{h,c}` - Threading backend based on
pthreads/Windows threads and queues
- :ghfile:`numba/np/ufunc/omppool.cpp` - Threading backend based on OpenMP
- :ghfile:`numba/np/ufunc/tbbpool.cpp` - Threading backend based on TBB
Unit Tests (CPU)
''''''''''''''''
CPU unit tests (GPU target unit tests listed in later sections
- :ghfile:`runtests.py` - Convenience script that launches test runner and
turns on full compiler tracebacks
- :ghfile:`.coveragerc` - Coverage.py configuration
- :ghfile:`numba/runtests.py` - Entry point to unittest runner
- :ghfile:`numba/testing/_runtests.py` - Implementation of custom test runner
command line interface
- :ghfile:`numba/tests/test_*` - Test cases
- :ghfile:`numba/tests/*_usecases.py` - Python functions compiled by some
unit tests
- :ghfile:`numba/tests/support.py` - Helper functions for testing and
special TestCase implementation
- :ghfile:`numba/tests/dummy_module.py` - Module used in
``test_dispatcher.py``
- :ghfile:`numba/tests/npyufunc` - ufunc / gufunc compiler tests
- :ghfile:`numba/testing` - Support code for testing
- :ghfile:`numba/testing/loader.py` - Find tests on disk
- :ghfile:`numba/testing/notebook.py` - Support for testing notebooks
- :ghfile:`numba/testing/main.py` - Numba test runner
Command Line Utilities
''''''''''''''''''''''
- :ghfile:`bin/numba` - Command line stub, delegates to main in
``numba_entry.py``
- :ghfile:`numba/misc/numba_entry.py` - Main function for ``numba`` command line
tool
- :ghfile:`numba/pycc` - Ahead of time compilation of functions to shared
library extension
- :ghfile:`numba/pycc/__init__.py` - Main function for ``pycc`` command line
tool
- :ghfile:`numba/pycc/cc.py` - User-facing API for tagging functions to
compile ahead of time
- :ghfile:`numba/pycc/compiler.py` - Compiler pipeline for creating
standalone Python extension modules
- :ghfile:`numba/pycc/llvm_types.py` - Aliases to LLVM data types used by
``compiler.py``
- :ghfile:`numba/pycc/modulemixin.c` - C file compiled into every compiled
extension. Pulls in C source from Numba core that is needed to make
extension standalone
- :ghfile:`numba/pycc/platform.py` - Portable interface to platform-specific
compiler toolchains
- :ghfile:`numba/pycc/decorators.py` - Deprecated decorators for tagging
functions to compile. Use ``cc.py`` instead.
CUDA GPU Target
'''''''''''''''
Note that the CUDA target does reuse some parts of the CPU target.
- :ghfile:`numba/cuda/` - The implementation of the CUDA (NVIDIA GPU) target
and associated unit tests
- :ghfile:`numba/cuda/decorators.py` - Compiler decorators for CUDA kernels
and device functions
- :ghfile:`numba/cuda/dispatcher.py` - Dispatcher for CUDA JIT functions
- :ghfile:`numba/cuda/printimpl.py` - Special implementation of device printing
- :ghfile:`numba/cuda/libdevice.py` - Registers libdevice functions
- :ghfile:`numba/cuda/kernels/` - Custom kernels for reduction and transpose
- :ghfile:`numba/cuda/device_init.py` - Initializes the CUDA target when
imported
- :ghfile:`numba/cuda/compiler.py` - Compiler pipeline for CUDA target
- :ghfile:`numba/cuda/intrinsic_wrapper.py` - CUDA device intrinsics
(shuffle, ballot, etc)
- :ghfile:`numba/cuda/initialize.py` - Deferred initialization of the CUDA
device and subsystem. Called only when user imports ``numba.cuda``
- :ghfile:`numba/cuda/simulator_init.py` - Initializes the CUDA simulator
subsystem (only when user requests it with env var)
- :ghfile:`numba/cuda/random.py` - Implementation of random number generator
- :ghfile:`numba/cuda/api.py` - User facing APIs imported into ``numba.cuda.*``
- :ghfile:`numba/cuda/stubs.py` - Python placeholders for functions that
only can be used in GPU device code
- :ghfile:`numba/cuda/simulator/` - Simulate execution of CUDA kernels in
Python interpreter
- :ghfile:`numba/cuda/vectorizers.py` - Subclasses of ufunc/gufunc compilers
for CUDA
- :ghfile:`numba/cuda/args.py` - Management of kernel arguments, including
host<->device transfers
- :ghfile:`numba/cuda/target.py` - Typing and target contexts for GPU
- :ghfile:`numba/cuda/cudamath.py` - Type signatures for math functions in
CUDA Python
- :ghfile:`numba/cuda/errors.py` - Validation of kernel launch configuration
- :ghfile:`numba/cuda/nvvmutils.py` - Helper functions for generating
NVVM-specific IR
- :ghfile:`numba/cuda/testing.py` - Support code for creating CUDA unit
tests and capturing standard out
- :ghfile:`numba/cuda/cudadecl.py` - Type signatures of CUDA API (threadIdx,
blockIdx, atomics) in Python on GPU
- :ghfile:`numba/cuda/cudaimpl.py` - Implementations of CUDA API functions
on GPU
- :ghfile:`numba/cuda/codegen.py` - Code generator object for CUDA target
- :ghfile:`numba/cuda/cudadrv/` - Wrapper around CUDA driver API
- :ghfile:`numba/cuda/tests/` - CUDA unit tests, skipped when CUDA is not
detected
- :ghfile:`numba/cuda/tests/cudasim/` - Tests of CUDA simulator
- :ghfile:`numba/cuda/tests/nocuda/` - Tests for NVVM functionality when
CUDA not present
- :ghfile:`numba/cuda/tests/cudapy/` - Tests of compiling Python functions
for GPU
- :ghfile:`numba/cuda/tests/cudadrv/` - Tests of Python wrapper around CUDA
API
=====================================================
Using the Numba Rewrite Pass for Fun and Optimization
=====================================================
Overview
========
This section introduces intermediate representation (IR) rewrites, and
how they can be used to implement optimizations.
As discussed earlier in ":ref:`rewrite-typed-ir`", rewriting the Numba
IR allows us to perform optimizations that would be much more
difficult to perform at the lower LLVM level. Similar to the Numba
type and lowering subsystems, the rewrite subsystem is user
extensible. This extensibility affords Numba the possibility of
supporting a wide variety of domain-specific optimizations (DSO's).
The remaining subsections detail the mechanics of implementing a
rewrite, registering a rewrite with the rewrite registry, and provide
examples of adding new rewrites, as well as internals of the array
expression optimization pass. We conclude by reviewing some use cases
exposed in the examples, as well as reviewing any points where
developers should take care.
Rewriting Passes
================
Rewriting passes have a simple :func:`~Rewrite.match` and
:func:`~Rewrite.apply` interface. The division between matching and
rewriting follows how one would define a term rewrite in a declarative
domain-specific languages (DSL's). In such DSL's, one may write a
rewrite as follows::
<match> => <replacement>
The ``<match>`` and ``<replacement>`` symbols represent IR term
expressions, where the left-hand side presents a pattern to match, and
the right-hand side an IR term constructor to build upon matching.
Whenever the rewrite matches an IR pattern, any free variables in the
left-hand side are bound within a custom environment. When applied,
the rewrite uses the pattern matching environment to bind any free
variables in the right-hand side.
As Python is not commonly used in a declarative capacity, Numba uses
object state to handle the transfer of information between the
matching and application steps.
The :class:`Rewrite` Base Class
-------------------------------
.. class:: Rewrite
The :class:`Rewrite` class simply defines an abstract base class
for Numba rewrites. Developers should define rewrites as
subclasses of this base type, overloading the
:func:`~Rewrite.match` and :func:`~Rewrite.apply` methods.
.. attribute:: pipeline
The pipeline attribute contains the
:class:`numba.compiler.Pipeline` instance that is currently
compiling the function under consideration for rewriting.
.. method:: __init__(self, pipeline, *args, **kws)
The base constructor for rewrites simply stashes its arguments
into attributes of the same name. Unless being used in
debugging or testing, rewrites should only be constructed by
the :class:`RewriteRegistry` in the
:func:`RewriteRegistry.apply` method, and the construction
interface should remain stable (though the pipeline will
commonly contain just about everything there is to know).
.. method:: match(self, block, typemap, callmap)
The :func:`~Rewrite.match` method takes four arguments other
than *self*:
* *func_ir*: This is an instance of :class:`numba.ir.FunctionIR` for the
function being rewritten.
* *block*: This is an instance of :class:`numba.ir.Block`. The
matching method should iterate over the instructions contained
in the :attr:`numba.ir.Block.body` member.
* *typemap*: This is a Python :class:`dict` instance mapping
from symbol names in the IR, represented as strings, to Numba
types.
* *callmap*: This is another :class:`dict` instance mapping from
calls, represented as :class:`numba.ir.Expr` instances, to
their corresponding call site type signatures, represented as
a :class:`numba.typing.templates.Signature` instance.
The :func:`~Rewrite.match` method should return a :class:`bool`
result. A :obj:`True` result should indicate that one or more
matches were found, and the :func:`~Rewrite.apply` method will
return a new replacement :class:`numba.ir.Block` instance. A
:obj:`False` result should indicate that no matches were found, and
subsequent calls to :func:`~Rewrite.apply` will return undefined
or invalid results.
.. method:: apply(self)
The :func:`~Rewrite.apply` method should only be invoked
following a successful call to :func:`~Rewrite.match`. This
method takes no additional parameters other than *self*, and
should return a replacement :class:`numba.ir.Block` instance.
As mentioned above, the behavior of calling
:func:`~Rewrite.apply` is undefined unless
:func:`~Rewrite.match` has already been called and returned
:obj:`True`.
Subclassing :class:`Rewrite`
----------------------------
Before going into the expectations for the overloaded methods any
:class:`Rewrite` subclass must have, let's step back a minute to
review what is taking place here. By providing an extensible
compiler, Numba opens itself to user-defined code generators which may
be incomplete, or worse, incorrect. When a code generator goes awry,
it can cause abnormal program behavior or early termination.
User-defined rewrites add a new level of complexity because they must
not only generate correct code, but the code they generate should
ensure that the compiler does not get stuck in a match/apply loop.
Non-termination by the compiler will directly lead to non-termination
of user function calls.
There are several ways to help ensure that a rewrite terminates:
* *Typing*: A rewrite should generally attempt to decompose composite
types, and avoid composing new types. If the rewrite is matching a
specific type, changing expression types to a lower-level type will
ensure they will no long match after the rewrite is applied.
* *Special instructions*: A rewrite may synthesize custom operators or
use special functions in the target IR. This technique again
generates code that is no longer within the domain of the original
match, and the rewrite will terminate.
In the ":ref:`case-study-array-expressions`" subsection, below, we'll
see how the array expression rewriter uses both of these techniques.
Overloading :func:`Rewrite.match`
---------------------------------
Every rewrite developer should seek to have their implementation of
:func:`~Rewrite.match` return a :obj:`False` value as quickly as
possible. Numba is a just-in-time compiler, and adding compilation
time ultimately adds to the user's run time. When a rewrite returns
:obj:`False` for a given block, the registry will no longer process that
block with that rewrite, and the compiler is that much closer to
proceeding to lowering.
This need for timeliness has to be balanced against collecting the
necessary information to make a match for a rewrite. Rewrite
developers should be comfortable adding dynamic attributes to their
subclasses, and then having these new attributes guide construction of
the replacement basic block.
Overloading :func:`Rewrite.apply`
-----------------------------------
The :func:`~Rewrite.apply` method should return a replacement
:class:`numba.ir.Block` instance to replace the basic block that
contained a match for the rewrite. As mentioned above, the IR built
by :func:`~Rewrite.apply` methods should preserve the semantics of the
user's code, but also seek to avoid generating another match for the
same rewrite or set of rewrites.
The Rewrite Registry
====================
When you want to include a rewrite in the rewrite pass, you should
register it with the rewrite registry. The :mod:`numba.rewrites`
module provides both the abstract base class and a class decorator for
hooking into the Numba rewrite subsystem. The following illustrates a
stub definition of a new rewrite::
from numba import rewrites
@rewrites.register_rewrite
class MyRewrite(rewrites.Rewrite):
def match(self, block, typemap, calltypes):
raise NotImplementedError("FIXME")
def apply(self):
raise NotImplementedError("FIXME")
Developers should note that using the class decorator as shown above
will register a rewrite at import time. It is the developer's
responsibility to ensure their extensions are loaded before
compilation starts.
.. _`case-study-array-expressions`:
Case study: Array Expressions
=============================
This subsection looks at the array expression rewriter in more depth.
The array expression rewriter, and most of its support functionality,
are found in the :mod:`numba.npyufunc.array_exprs` module. The
rewriting pass itself is implemented in the :class:`RewriteArrayExprs`
class. In addition to the rewriter, the
:mod:`~numba.npyufunc.array_exprs` module includes a function for
lowering array expressions,
:func:`~numba.npyufunc.array_exprs._lower_array_expr`. The overall
optimization process is as follows:
* :func:`RewriteArrayExprs.match`: The rewrite pass looks for one or
more array operations that form an array expression.
* :func:`RewriteArrayExprs.apply`: Once an array expression is found,
the rewriter replaces the individual array operations with a new
kind of IR expression, the ``arrayexpr``.
* :func:`numba.npyufunc.array_exprs._lower_array_expr`: During
lowering, the code generator calls
:func:`~numba.npyufunc.array_exprs._lower_array_expr` whenever it
finds an ``arrayexpr`` IR expression.
More details on each step of the optimization are given below.
The :func:`RewriteArrayExprs.match` method
------------------------------------------
The array expression optimization pass starts by looking for array
operations, including calls to supported :class:`~numpy.ufunc`\'s and
user-defined :class:`~numba.DUFunc`\'s. Numba IR follows the
conventions of a static single assignment (SSA) language, meaning that
the search for array operators begins with looking for assignment
instructions.
When the rewriting pass calls the :func:`RewriteArrayExprs.match`
method, it first checks to see if it can trivially reject the basic
block. If the method determines the block to be a candidate for
matching, it sets up the following state variables in the rewrite
object:
* *crnt_block*: The current basic block being matched.
* *typemap*: The *typemap* for the function being matched.
* *matches*: A list of variable names that reference array expressions.
* *array_assigns*: A map from assignment variable names to the actual
assignment instructions that define the given variable.
* *const_assigns*: A map from assignment variable names to the
constant valued expression that defines the constant variable.
At this point, the match method iterates over the assignment
instructions in the input basic block. For each assignment
instruction, the matcher looks for one of two things:
* Array operations: If the right-hand side of the assignment
instruction is an expression, and the result of that expression is
an array type, the matcher checks to see if the expression is either
a known array operation, or a call to a universal function. If an
array operator is found, the matcher stores the left-hand variable
name and the whole instruction in the *array_assigns* member.
Finally, the matcher tests to see if any operands of the array
operation have also been identified as targets of other array
operations. If one or more operands are also targets of array
operations, then the matcher will also append the left-hand side
variable name to the *matches* member.
* Constants: Constants (even scalars) can be operands to array
operations. Without worrying about the constant being apart of an
array expression, the matcher stores constant names and values in
the *const_assigns* member.
The end of the matching method simply checks for a non-empty *matches*
list, returning :obj:`True` if there were one or more matches, and
:obj:`False` when *matches* is empty.
The :func:`RewriteArrayExprs.apply` method
------------------------------------------
When one or matching array expressions are found by
:func:`RewriteArrayExprs.match`, the rewriting pass will call
:func:`RewriteArrayExprs.apply`. The apply method works in two
passes. The first pass iterates over the matches found, and builds a
map from instructions in the old basic block to new instructions in
the new basic block. The second pass iterates over the instructions
in the old basic block, copying instructions that are not changed by
the rewrite, and replacing or deleting instructions that were
identified by the first pass.
The :func:`RewriteArrayExprs._handle_matches` implements the first
pass of the code generation portion of the rewrite. For each match,
this method builds a special IR expression that contains an expression
tree for the array expression. To compute the leaves of the
expression tree, the :func:`~RewriteArrayExprs._handle_matches` method
iterates over the operands of the identified root operation. If the
operand is another array operation, it is translated into an
expression sub-tree. If the operand is a constant,
:func:`~RewriteArrayExprs._handle_matches` copies the constant value.
Otherwise, the operand is marked as being used by an array expression.
As the method builds array expression nodes, it builds a map from old
instructions to new instructions (*replace_map*), as well as sets of
variables that may have moved (*used_vars*), and variables that should
be removed altogether (*dead_vars*). These three data structures are
returned back to the calling :func:`RewriteArrayExprs.apply` method.
The remaining part of the :func:`RewriteArrayExprs.apply` method
iterates over the instructions in the old basic block. For each
instruction, this method either replaces, deletes, or duplicates that
instruction based on the results of
:func:`RewriteArrayExprs._handle_matches`. The following list
describes how the optimization handles individual instructions:
* When an instruction is an assignment,
:func:`~RewriteArrayExprs.apply` checks to see if it is in the
replacement instruction map. When an assignment instruction is found
in the instruction map, :func:`~RewriteArrayExprs.apply` must then
check to see if the replacement instruction is also in the replacement
map. The optimizer continues this check until it either arrives at a
:obj:`None` value or an instruction that isn't in the replacement map.
Instructions that have a replacement that is :obj:`None` are deleted.
Instructions that have a non-:obj:`None` replacement are replaced.
Assignment instructions not in the replacement map are appended to the
new basic block with no changes made.
* When the instruction is a delete instruction, the rewrite checks to
see if it deletes a variable that may still be used by a later array
expression, or if it deletes a dead variable. Delete instructions for
used variables are added to a map of deferred delete instructions that
:func:`~RewriteArrayExprs.apply` uses to move them past any uses of
that variable. The loop copies delete instructions for non-dead
variables, and ignores delete instructions for dead variables
(effectively removing them from the basic block).
* All other instructions are appended to the new basic block.
Finally, the :func:`~RewriteArrayExprs.apply` method returns the new
basic block for lowering.
The :func:`~numba.npyufunc.array_exprs._lower_array_expr` function
------------------------------------------------------------------
If we left things at just the rewrite, then the lowering stage of the
compiler would fail, complaining it doesn't know how to lower
``arrayexpr`` operations. We start by hooking a lowering function
into the target context whenever the :class:`RewriteArrayExprs` class
is instantiated by the compiler. This hook causes the lowering pass to
call :func:`~numba.npyufunc.array_exprs._lower_array_expr` whenever it
encounters an ``arrayexr`` operator.
This function has two steps:
* Synthesize a Python function that implements the array expression:
This new Python function essentially behaves like a Numpy
:class:`~numpy.ufunc`, returning the result of the expression on
scalar values in the broadcasted array arguments. The lowering
function accomplishes this by translating from the array expression
tree into a Python AST.
* Compile the synthetic Python function into a kernel: At this point,
the lowering function relies on existing code for lowering ufunc and
DUFunc kernels, calling
:func:`numba.targets.numpyimpl.numpy_ufunc_kernel` after defining
how to lower calls to the synthetic function.
The end result is similar to loop lifting in Numba's object mode.
Conclusions and Caveats
=======================
We have seen how to implement rewrites in Numba, starting with the
interface, and ending with an actual optimization. The key points of
this section are:
* When writing a good plug-in, the matcher should try to get a
go/no-go result as soon as possible.
* The rewrite application portion can be more computationally
expensive, but should still generate code that won't cause infinite
loops in the compiler.
* We use object state to communicate any results of matching to the
rewrite application pass.
.. Copyright (c) 2017 Intel Corporation
SPDX-License-Identifier: BSD-2-Clause
.. _arch-stencil:
=================
Notes on stencils
=================
Numba provides the :ref:`@stencil decorator <numba-stencil>` to
represent stencil computations. This document explains how this
feature is implemented in the several different modes available in
Numba. Currently, calls to the stencil from non-jitted code is
supported as well as calls from jitted code, either with or without
the :ref:`parallel=True <parallel_jit_option>` option.
The stencil decorator
=====================
The stencil decorator itself just returns a ``StencilFunc`` object.
This object encapsulates the original stencil kernel function
as specified in the program and the options passed to the
stencil decorator. Also of note is that after the first compilation
of the stencil, the computed neighborhood of the stencil is
stored in the ``StencilFunc`` object in the ``neighborhood`` attribute.
Handling the three modes
========================
As mentioned above, Numba supports the calling of stencils
from inside or outside a ``@jit`` compiled function, with or
without the :ref:`parallel=True <parallel_jit_option>` option.
Outside jit context
-------------------
``StencilFunc`` overrides the ``__call__`` method so that calls
to ``StencilFunc`` objects execute the stencil::
def __call__(self, *args, **kwargs):
result = kwargs.get('out')
new_stencil_func = self._stencil_wrapper(result, None, *args)
if result is None:
return new_stencil_func.entry_point(*args)
else:
return new_stencil_func.entry_point(*args, result)
First, the presence of the optional :ref:`out <stencil-function-out>`
parameter is checked. If it is present then the output array is
stored in ``result``. Then, the call to ``_stencil_wrapper``
generates the stencil function given the result and argument types
and finally the generated stencil function is executed and its result
returned.
Jit without ``parallel=True``
-----------------------------
When constructed, a ``StencilFunc`` inserts itself into the typing
context's set of user functions and provides the ``_type_me``
callback. In this way, the standard Numba compiler is able to
determine the output type and signature of a ``StencilFunc``.
Each ``StencilFunc`` maintains a cache of previously seen combinations
of input argument types and keyword types. If previously seen,
the ``StencilFunc`` returns the computed signature. If not previously
computed, the ``StencilFunc`` computes the return type of the stencil
by running the Numba compiler frontend on the stencil kernel and
then performing type inference on the :term:`Numba IR` (IR) to get the scalar
return type of the kernel. From that, a Numpy array type is constructed
whose element type matches that scalar return type.
After computing the signature of the stencil for a previously
unseen combination of input and keyword types, the ``StencilFunc``
then :ref:`creates the stencil function <arch-stencil-create-function>` itself.
``StencilFunc`` then installs the new stencil function's definition
in the target context so that jitted code is able to call it.
Thus, in this mode, the generated stencil function is a stand-alone
function called like a normal function from within jitted code.
Jit with ``parallel=True``
--------------------------
When calling a ``StencilFunc`` from a jitted context with ``parallel=True``,
a separate stencil function as generated by :ref:`arch-stencil-create-function`
is not used. Instead, `parfors` (:ref:`parallel-accelerator`) are
created within the current function that implements the stencil.
This code again starts with the stencil kernel and does a similar kernel
size computation but then rather than standard Python looping syntax,
corresponding `parfors` are created so that the execution of the stencil
will take place in parallel.
The stencil to `parfor` translations can also be selectively disabled
by setting ``parallel={'stencil': False}``, among other sub-options
described in :ref:`parallel-accelerator`.
.. _arch-stencil-create-function:
Creating the stencil function
=============================
Conceptually, a stencil function is created from the user-specified
stencil kernel by adding looping code around the kernel, transforming
the relative kernel indices into absolute array indices based on the
loop indices, and replacing the kernel's ``return`` statement with
a statement to assign the computed value into the output array.
To accomplish this transformation, first, a copy of the stencil
kernel IR is created so that subsequent modifications of the IR
for different stencil signatures will not effect each other.
Then, an approach similar to how GUFunc's are created for `parfors`
is employed. In a text buffer, a Python function is created with
a unique name. The input array parameter is added to the function
definition and if the ``out`` argument type is present then an
``out`` parameter is added to the stencil function definition.
If the ``out`` argument is not present then first an output array
is created with ``numpy.zeros`` having the same shape as the
input array.
The kernel is then analyzed to compute the stencil size and the
shape of the boundary (or the ``neighborhood`` stencil decorator
argument is used for this purpose if present).
Then, one ``for`` loop for each dimension of the input array is
added to the stencil function definition. The range of each
loop is controlled by the stencil kernel size previously computed
so that the boundary of the output image is not modified but instead
left as is. The body of the innermost ``for`` loop is a single
``sentinel`` statement that is easily recognized in the IR.
A call to ``exec`` with the text buffer is used to force the
stencil function into existence and an ``eval`` is used to get
access to the corresponding function on which ``run_frontend`` is
used to get the stencil function IR.
Various renaming and relabeling is performed on the stencil function
IR and the kernel IR so that the two can be combined without conflict.
The relative indices in the kernel IR (i.e., ``getitem`` calls) are
replaced with expressions where the corresponding loop index variables
are added to the relative indices. The ``return`` statement in the
kernel IR is replaced with a ``setitem`` for the corresponding element
in the output array.
The stencil function IR is then scanned for the sentinel and the
sentinel replaced with the modified kernel IR.
Next, ``compile_ir`` is used to compile the combined stencil function
IR. The resulting compile result is cached in the ``StencilFunc`` so that
other calls to the same stencil do not need to undertake this process
again.
Exceptions raised
=================
Various checks are performed during stencil compilation to make sure
that user-specified options do not conflict with each other or with
other runtime parameters. For example, if the user has manually
specified a ``neighborhood`` to the stencil decorator, the length of
that neighborhood must match the dimensionality of the input array.
If this is not the case, a ``ValueError`` is raised.
If the neighborhood has not been specified then it must be inferred
and a requirement to infer the kernel is that all indices are constant
integers. If they are not, a ``ValueError`` is raised indicating that
kernel indices may not be non-constant.
Finally, the stencil implementation detects the output array type
by running Numba type inference on the stencil kernel. If the
return type of this kernel does not match the type of the value
passed to the ``cval`` stencil decorator option then a ``ValueError``
is raised.
==========================
Notes on Target Extensions
==========================
.. warning:: All features and APIs described in this page are in-development and
may change at any time without deprecation notices being issued.
Inheriting compiler flags from the caller
=========================================
Compiler flags, i.e. options such as ``fastmath``, ``nrt`` in
``@jit(nrt=True, fastmath=True))`` are specified per-function but their
effects are not well-defined---some flags affect the entire callgraph, some
flags affect only the current function. Sometimes it is necessary for callees
to inherit flags from the caller; for example the ``fastmath`` flag should be
infectious.
To address the problem, the following are needed:
1. Better definitions for the semantics of compiler flags. Preferably, all flags should
limit their effect to the current function. (TODO)
2. Allow compiler flags to be inherited from the caller. (Done)
3. Consider compiler flags in function resolution. (TODO)
:class:`numba.core.targetconfig.ConfigStack` is used to propagate the compiler flags
throughout the compiler. At the start of the compilation, the flags are pushed
into the ``ConfigStack``, which maintains a thread-local stack for the
compilation. Thus, callees can check the flags in the caller.
.. autoclass:: numba.core.targetconfig.ConfigStack
:members:
Compiler flags
--------------
`Compiler flags`_ are defined as a subclass of ``TargetConfig``:
.. _Compiler flags: https://github.com/numba/numba/blob/7e8538140ce3f8d01a5273a39233b5481d8b20b1/numba/core/compiler.py#L39
.. autoclass:: numba.core.targetconfig.TargetConfig
:members:
These are internal compiler flags and they are different from the user-facing
options used in the jit decorators.
Internally, `the user-facing options are mapped to the internal compiler flags <https://github.com/numba/numba/blob/7e8538140ce3f8d01a5273a39233b5481d8b20b1/numba/core/options.py#L72>`_
by :class:`numba.core.options.TargetOptions`. Each target can override the
default compiler flags and control the flag inheritance in
``TargetOptions.finalize``. `The CPU target overrides it.
<https://github.com/numba/numba/blob/7e8538140ce3f8d01a5273a39233b5481d8b20b1/numba/core/cpu.py#L259>`_
.. autoclass:: numba.core.options.TargetOptions
:members: finalize
In :meth:`numba.core.options.TargetOptions.finalize`,
use :meth:`numba.core.targetconfig.TargetConfig.inherit_if_not_set`
to request a compiler flag from the caller if it is not set for the current
function.
=========================================
Notes on Numba's threading implementation
=========================================
The execution of the work presented by the Numba ``parallel`` targets is
undertaken by the Numba threading layer. Practically, the "threading layer"
is a Numba built-in library that can perform the required concurrent execution.
At the time of writing there are three threading layers available, each
implemented via a different lower level native threading library. More
information on the threading layers and appropriate selection of a threading
layer for a given application/system can be found in the
:ref:`threading layer documentation <numba-threading-layer>`.
The pertinent information to note for the following sections is that the
function in the threading library that performs the parallel execution is the
``parallel_for`` function. The job of this function is to both orchestrate and
execute the parallel tasks.
The relevant source files referenced in this document are
- ``numba/np/ufunc/tbbpool.cpp``
- ``numba/np/ufunc/omppool.cpp``
- ``numba/np/ufunc/workqueue.c``
These files contain the TBB, OpenMP, and workqueue threadpool
implementations, respectively. Each includes the functions
``set_num_threads()``, ``get_num_threads()``, and ``get_thread_id()``, as
well as the relevant logic for thread masking in their respective
schedulers. Note that the basic thread local variable logic is duplicated in
each of these files, and not shared between them.
- ``numba/np/ufunc/parallel.py``
This file contains the Python and JIT compatible wrappers for
``set_num_threads()``, ``get_num_threads()``, and ``get_thread_id()``, as
well as the code that loads the above libraries into Python and launches the
threadpool.
- ``numba/parfors/parfor_lowering.py``
This file contains the main logic for generating code for the parallel
backend. The thread mask is accessed in this file in the code that generates
scheduler code, and passed to the relevant backend scheduler function (see
below).
Thread masking
--------------
As part of its design, Numba never launches new threads beyond the threads
that are launched initially with ``numba.np.ufunc.parallel._launch_threads()``
when the first parallel execution is run. This is due to the way threads were
already implemented in Numba prior to thread masking being implemented. This
restriction was kept to keep the design simple, although it could be removed
in the future. Consequently, it's possible to programmatically set the number
of threads, but only to less than or equal to the total number that have
already been launched. This is done by "masking" out unused threads, causing
them to do no work. For example, on a 16 core machine, if the user were to
call ``set_num_threads(4)``, Numba would always have 16 threads present, but
12 of them would sit idle for parallel computations. A further call to
``set_num_threads(16)`` would cause those same threads to do work in later
computations.
:ref:`Thread masking <numba-threading-layer-thread-masking>` was added to make
it possible for a user to programmatically alter the number of threads
performing work in the threading layer. Thread masking proved challenging to
implement as it required the development of a programming model that is suitable
for users, easy to reason about, and could be implemented safely, with
consistent behavior across the various threading layers.
Programming model
~~~~~~~~~~~~~~~~~
The programming model chosen is similar to that found in OpenMP. The reasons
for this choice were that it is familiar to a lot of users, restricted in
scope and also simple. The number of threads in use is specified by calling
``set_num_threads`` and the number of threads in use can be queried by calling
``get_num_threads``.These two functions are synonymous with their OpenMP
counterparts (with the above restriction that the mask must be less than or
equal to the number of launched threads). The execution semantics are also
similar to OpenMP in that once a parallel region is launched, altering the
thread mask has no impact on the currently executing region, but will have an
impact on parallel regions executed subsequently.
The Implementation
~~~~~~~~~~~~~~~~~~
So as to place no further restrictions on user code other than those that
already existed in the threading layer libraries, careful consideration of the
design of thread masking was required. The "thread mask" cannot be stored in a
global value as concurrent use of the threading layer may result in classic
forms of race conditions on the value itself. Numerous designs were discussed
involving various types of mutex on such a global value, all of which were
eventually broken through thought experiment alone. It eventually transpired
that, following some OpenMP implementations, the "thread mask" is best
implemented as a ``thread local``. This means each thread that executes a Numba
parallel function will have a thread local storage (TLS) slot that contains the
value of the thread mask to use when scheduling threads in the ``parallel_for``
function.
The above notion of TLS use for a thread mask is relatively easy to implement,
``get_num_threads`` and ``set_num_threads`` simply need to address the TLS slot
in a given threading layer. This also means that the execution schedule for a
parallel region can be derived from a run time call to ``get_num_threads``. This
is achieved via a well known and relatively easy to implement pattern of a ``C``
library function registration and wrapping it in the internal Numba
implementation.
In addition to satisfying the original upfront thread masking requirements, a
few more complicated scenarios needed consideration as follows.
Nested parallelism
******************
In all threading layers a "main thread" will invoke the ``parallel_for``
function and then in the parallel region, depending on the threading layer,
some number of additional threads will assist in doing the actual work.
If the work contains a call to another parallel function (i.e. nested
parallelism) it is necessary for the thread making the call to know what the
"thread mask" of the main thread is so that it can propagate it into the
``parallel_for`` call it makes when executing the nested parallel function.
The implementation of this behavior is threading layer specific but the general
principle is for the "main thread" to always "send" the value of the thread mask
from its TLS slot to all threads in the threading layer that are active in the
parallel region. These active threads then update their TLS slots with this
value prior to performing any work. The net result of this implementation detail
is that:
* thread masks correctly propagate into nested functions
* it's still possible for each thread in a parallel region to safely have a
different mask with which to call nested functions, if it's not set explicitly
then the inherited mask from the "main thread" is used
* threading layers which have dynamic scheduling with threads potentially
joining and leaving the active pool during a ``parallel_for`` execution are
successfully accommodated
* any "main thread" thread mask is entirely decoupled from the in-flux nature
of the thread masks of the threads in the active thread pool
Python threads independently invoking parallel functions
********************************************************
The threading layer launch sequence is heavily guarded to ensure that the
launch is both thread and process safe and run once per process. In a system
with numerous Python ``threading`` module threads all using Numba, the first
thread through the launch sequence will get its thread mask set appropriately,
but no further threads can run the launch sequence. This means that other
threads will need their initial thread mask set some other way. This is
achieved when ``get_num_threads`` is called and no thread mask is present, in
this case the thread mask will be set to the default. In the implementation,
"no thread mask is present" is represented by the value ``-1`` and the "default
thread mask" (unset) is represented by the value ``0``. The implementation also
immediately calls ``set_num_threads(NUMBA_NUM_THREADS)`` after doing this, so
if either ``-1`` or ``0`` is encountered as a result from ``get_num_threads()`` it
indicates a bug in the above processes.
OS ``fork()`` calls
*******************
The use of TLS was also in part driven by the Linux (the most popular
platform for Numba use by far) having a ``fork(2, 3P)`` call that will do TLS
propagation into child processes, see ``clone(2)``\ 's ``CLONE_SETTLS``.
Thread ID
*********
A private ``get_thread_id()`` function was added to each threading backend,
which returns a unique ID for each thread. This can be accessed from Python by
``numba.np.ufunc.parallel._get_thread_id()`` (it can also be used inside a
JIT compiled function). The thread ID function is useful for testing that the
thread masking behavior is correct, but it should not be used outside of the
tests. For example, one can call ``set_num_threads(4)`` and then collect all
unique ``_get_thread_id()``\ s in a parallel region to verify that only 4
threads are run.
Caveats
~~~~~~~
Some caveats to be aware of when testing thread masking:
- The TBB backend may choose to schedule fewer than the given mask number of
threads. Thus a test such as the one described above may return fewer than 4
unique threads.
- The workqueue backend is not threadsafe, so attempts to do multithreading
nested parallelism with it may result in deadlocks or other undefined
behavior. The workqueue backend will raise a SIGABRT signal if it detects
nested parallelism.
- Certain backends may reuse the main thread for computation, but this
behavior shouldn't be relied upon (for instance, if propagating exceptions).
Use in Code Generation
~~~~~~~~~~~~~~~~~~~~~~
The general pattern for using ``get_num_threads`` in code generation is
.. code:: python
from llvmlite import ir as llvmir
get_num_threads = cgutils.get_or_insert_function(builder.module
llvmir.FunctionType(llvmir.IntType(types.intp.bitwidth), []),
name="get_num_threads")
num_threads = builder.call(get_num_threads, [])
with cgutils.if_unlikely(builder, builder.icmp_signed('<=', num_threads,
num_threads.type(0))):
cgutils.printf(builder, "num_threads: %d\n", num_threads)
context.call_conv.return_user_exc(builder, RuntimeError,
("Invalid number of threads. "
"This likely indicates a bug in Numba.",))
# Pass num_threads through to the appropriate backend function here
See the code in ``numba/parfors/parfor_lowering.py``.
The guard against ``num_threads`` being <= 0 is not strictly necessary, but it
can protect against accidentally incorrect behavior in case the thread masking
logic contains a bug.
The ``num_threads`` variable should be passed through to the appropriate
backend function, such as ``do_scheduling`` or ``parallel_for``. If it's used
in some way other than passing it through to the backend function, the above
considerations should be taken into account to ensure the use of the
``num_threads`` variable is safe. It would probably be better to keep such
logic in the threading backends, rather than trying to do it in code
generation.
.. _chunk-details-label:
Parallel Chunksize Details
~~~~~~~~~~~~~~~~~~~~~~~~~~
There are some cases in which the actual parallel work chunk sizes may differ
from the requested
chunk size that is requested through :func:`numba.set_parallel_chunksize`.
First, if the number of required chunks based on the specified chunk size
is less than the number of configured threads then Numba will use all of the configured
threads to execute the parallel region. In this case, the actual chunk size will be
less than the requested chunk size. Second, due to truncation, in cases where the
iteration count is slightly less than a multiple of the chunk size
(e.g., 14 iterations and a specified chunk size of 5), the actual chunk size will be
larger than the specified chunk size. As in the given example, the number of chunks
would be 2 and the actual chunk size would be 7 (i.e. 14 / 2). Lastly, since Numba
divides an N-dimensional iteration space into N-dimensional (hyper)rectangular chunks,
it may be the case there are not N integer factors whose product is equal to the chunk
size. In this case, some chunks will have an area/volume larger than the chunk size
whereas others will be less than the specified chunk size.
Registering Extensions with Entry Points
========================================
Often, third party packages will have a user-facing API as well as define
extensions to the Numba compiler. In those situations, the new types and
overloads can registered with Numba when the package is imported by the user.
However, there are situations where a Numba extension would not normally be
imported directly by the user, but must still be registered with the Numba
compiler. An example of this is the `numba-scipy
<https://github.com/numba/numba-scipy>`_ package, which adds support for some
SciPy functions to Numba. The end user does not need to ``import
numba_scipy`` to enable compiler support for SciPy, the extension only needs
to be installed in the Python environment.
Numba discovers extensions using the `entry points
<https://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery-of-services-and-plugins>`_
feature of ``setuptools``. This allows a Python package to register an
initializer function that will be called before ``numba`` compiles for the
first time. The delay ensures that the cost of importing extensions is
deferred until it is necessary.
Adding Support for the "Init" Entry Point
-----------------------------------------
A package can register an initialization function with Numba by adding the
``entry_points`` argument to the ``setup()`` function call in ``setup.py``:
.. code-block:: python
setup(
...,
entry_points={
"numba_extensions": [
"init = numba_scipy:_init_extension",
],
},
...
)
Numba currently only looks for the ``init`` entry point in the
``numba_extensions`` group. The entry point should be a function (any name,
as long as it matches what is listed in ``setup.py``) that takes no arguments,
and the return value is ignored. This function should register types,
overloads, or call other Numba extension APIs. The order of initialization of
extensions is undefined.
Testing your Entry Point
------------------------
Numba loads all entry points when the first function is compiled. To test your
entry point, it is not sufficient to just ``import numba``; you have to define
and run a small function, like this:
.. code-block:: python
import numba; numba.njit(lambda x: x + 1)(123)
It is not necessary to import your module: entry points are identified by the
``entry_points.txt`` file in your library's ``*.egg-info`` directory.
The ``setup.py build`` command does not create eggs, but ``setup.py sdist``
(for testing in a local directory) and ``setup.py install`` do. All entry points
registered in eggs that are on the Python path are loaded. Be sure to check for
stale ``entry_points.txt`` when debugging.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment