ufunc.rst

CUDA Ufuncs and Generalized Ufuncs
==================================

This page describes the CUDA ufunc-like object.

To support the programming pattern of CUDA programs, CUDA Vectorize and
GUVectorize cannot produce a conventional ufunc.  Instead, a ufunc-like
object is returned.  This object is a close analog but not fully
compatible with a regular NumPy ufunc.  The CUDA ufunc adds support for
passing intra-device arrays (already on the GPU device) to reduce
traffic over the PCI-express bus.  It also accepts a `stream` keyword
for launching in asynchronous mode.

Example: Basic Example
------------------------

::

    import math
    from numba import vectorize, cuda
    import numpy as np

    @vectorize(['float32(float32, float32, float32)',
                'float64(float64, float64, float64)'],
               target='cuda')
    def cu_discriminant(a, b, c):
        return math.sqrt(b ** 2 - 4 * a * c)

    N = 10000
    dtype = np.float32

    # prepare the input
    A = np.array(np.random.sample(N), dtype=dtype)
    B = np.array(np.random.sample(N) + 10, dtype=dtype)
    C = np.array(np.random.sample(N), dtype=dtype)

    D = cu_discriminant(A, B, C)

    print(D)  # print result

Example: Calling Device Functions
----------------------------------

All CUDA ufunc kernels have the ability to call other CUDA device functions::

    from numba import vectorize, cuda

    # define a device function
    @cuda.jit('float32(float32, float32, float32)', device=True, inline=True)
    def cu_device_fn(x, y, z):
        return x ** y / z

    # define a ufunc that calls our device function
    @vectorize(['float32(float32, float32, float32)'], target='cuda')
    def cu_ufunc(x, y, z):
        return cu_device_fn(x, y, z)


Generalized CUDA ufuncs
-----------------------

Generalized ufuncs may be executed on the GPU using CUDA, analogous to
the CUDA ufunc functionality.  This may be accomplished as follows::

    from numba import guvectorize

    @guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], 
                 '(m,n),(n,p)->(m,p)', target='cuda')
    def matmulcore(A, B, C):
        ...


.. comment

    Example: A Chunk at a Time
    ---------------------------

    Partitioning your data into chunks allows computation and memory transfer
    to be overlapped.  This can increase the throughput of your ufunc and
    enables your ufunc to operate on data that is larger than the memory
    capacity of your GPU.  For example:

    ::

        import math
        from numba import vectorize, cuda
        import numpy as np

        # the ufunc kernel
        def discriminant(a, b, c):
            return math.sqrt(b ** 2 - 4 * a * c)

        cu_discriminant = vectorize(['float32(float32, float32, float32)',
                                     'float64(float64, float64, float64)'],
                                    target='cuda')(discriminant)

        N = int(1e+8)
        dtype = np.float32

        # prepare the input
        A = np.array(np.random.sample(N), dtype=dtype)
        B = np.array(np.random.sample(N) + 10, dtype=dtype)
        C = np.array(np.random.sample(N), dtype=dtype)
        D = np.empty(A.shape, dtype=A.dtype)

        # create a CUDA stream
        stream = cuda.stream()

        chunksize = 1e+6
        chunkcount = N // chunksize

        # partition NumPy arrays into chunks
        # no copying is performed
        sA = np.split(A, chunkcount)
        sB = np.split(B, chunkcount)
        sC = np.split(C, chunkcount)
        sD = np.split(D, chunkcount)

        device_ptrs = []

        with stream.auto_synchronize():
            # every operation in this context with be launched asynchronously
            # by using the CUDA stream

            # for each chunk
            for a, b, c, d in zip(sA, sB, sC, sD):
                # transfer to device
                dA = cuda.to_device(a, stream)
                dB = cuda.to_device(b, stream)
                dC = cuda.to_device(c, stream)
                dD = cuda.to_device(d, stream, copy=False) # no copying
                # launch kernel
                cu_discriminant(dA, dB, dC, out=dD, stream=stream)
                # retrieve result
                dD.copy_to_host(d, stream)
                # store device pointers to prevent them from freeing before
                # the kernel is scheduled
                device_ptrs.extend([dA, dB, dC, dD])

        # data is ready at this point inside D