external-memory-management.rst 34.5 KB
Newer Older
dugupeiwen's avatar
dugupeiwen committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
.. _nbep-7:

===============================================
NBEP 7: CUDA External Memory Management Plugins
===============================================

:Author: Graham Markall, NVIDIA
:Contributors: Thomson Comer, Peter Entschev, Leo Fang, John Kirkham, Keith Kraus
:Date: March 2020
:Status: Final

Background and goals
--------------------

The :ref:`CUDA Array Interface <cuda-array-interface>` enables sharing of data
between different Python libraries that access CUDA devices. However, each
library manages its own memory distinctly from the others. For example:


* `Numba <https://numba.pydata.org/>`_ internally manages memory for the creation
  of device and mapped host arrays.
* `The RAPIDS libraries <https://rapids.ai/>`_ (cuDF, cuML, etc.) use the `Rapids
  Memory Manager <https://github.com/rapidsai/rmm>`_ for allocating device
  memory.
* `CuPy <https://cupy.chainer.org/>`_ includes a `memory pool
  implementation <https://docs-cupy.chainer.org/en/stable/reference/memory.html>`_
  for both device and pinned memory.

The goal of this NBEP is to describe a plugin interface that enables Numba's
internal memory management to be replaced with an external memory manager by the
user. When the plugin interface is in use, Numba no longer directly allocates or
frees any memory when creating arrays, but instead requests allocations and
frees through the external manager.

Requirements
------------

Provide an *External Memory Manager (EMM)* interface in Numba.


* When the EMM is in use, Numba will make all memory allocation using the EMM.
  It will never directly call functions such as ``CuMemAlloc``\ , ``cuMemFree``\ , etc.
* When not using an *External Memory Manager (EMM)*\ , Numba's present behaviour
  is unchanged (at the time of writing, the current version is the 0.48
  release).

If an EMM is to be used, it will entirely replace Numba's internal memory
management for the duration of program execution. An interface for setting the
memory manager will be provided.

Device vs. Host memory
^^^^^^^^^^^^^^^^^^^^^^^

An EMM will always take responsibility for the management of device memory.
However, not all CUDA memory management libraries also support managing host
memory, so a facility for Numba to continue the management of host memory 
whilst ceding control of device memory to the EMM will be provided.

Deallocation strategies
^^^^^^^^^^^^^^^^^^^^^^^

Numba's internal memory management uses a :ref:`deallocation strategy
<deallocation-behavior>` designed to increase efficiency by deferring
deallocations until a significant quantity are pending. It also provides a
mechanism for preventing deallocations entirely during critical sections, using
the :func:`~numba.cuda.defer_cleanup` context manager.


* When the EMM is not in use, the deallocation strategy and operation of
  ``defer_cleanup`` remain unchanged.
* When the EMM is in use, the deallocation strategy is implemented by the EMM,
  and Numba's internal deallocation mechanism is not used. For example:

  * A similar strategy to Numba's could be implemented by the EMM, or
  * Deallocated memory might immediately be returned to a memory pool.

* The ``defer_cleanup`` context manager may behave differently with an EMM - an
  EMM should be accompanied by documentation of the behaviour of the
  ``defer_cleanup`` context manager when it is in use.

  * For example, a pool allocator could always immediately return memory to a
    pool even when the context manager is in use, but could choose
    not to free empty pools until ``defer_cleanup`` is not in use.

Management of other objects
^^^^^^^^^^^^^^^^^^^^^^^^^^^

In addition to memory, Numba manages the allocation and deallocation of
:ref:`events <events>`, :ref:`streams <streams>`, and modules (a module is a
compiled object, which is generated from ``@cuda.jit``\ -ted functions). The
management of streams, events, and modules should be unchanged by the presence
or absence of an EMM.

Asynchronous allocation / deallocation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

An asynchronous memory manager might provide the facility for an allocation or
free to take a CUDA stream and execute asynchronously. For freeing, this is
unlikely to cause issues since it operates at a layer beneath Python, but for
allocations this could be problematic if the user tries to then launch a kernel
on the default stream from this asynchronous memory allocation.

The interface described in this proposal will not be required to support
asynchronous allocation and deallocation, and as such these use cases will not
be considered further. However, nothing in this proposal should preclude the
straightforward addition of asynchronous operations in future versions of the
interface.

Non-requirements
^^^^^^^^^^^^^^^^

In order to minimise complexity and constrain this proposal to a reasonable
scope, the following will not be supported:


* Using different memory manager implementations for different contexts. All
  contexts will use the same memory manager implementation - either the Numba
  internal implementation or an external implementation.
* Changing the memory manager once execution has begun. It is not practical to
  change the memory manager and retain all allocations. Cleaning up the entire
  state and then changing to a different memory allocator (rather than starting
  a new process) appears to be a rather niche use case.
* Any changes to the ``__cuda_array_interface__`` to further define its semantics,
  e.g. for acquiring / releasing memory as discussed in `Numba Issue
  #4886 <https://github.com/numba/numba/issues/4886>`_ - these are independent,
  and can be addressed as part of separate proposals.
* Managed memory / UVM is not supported. At present Numba does not support UVM -
  see `Numba Issue #4362 <https://github.com/numba/numba/issues/4362>`_ for
  discussion of support.

Interface for Plugin developers
-------------------------------

New classes and functions will be added to ``numba.cuda.cudadrv.driver``:

* ``BaseCUDAMemoryManager`` and ``HostOnlyCUDAMemoryManager``\ : base classes for
  EMM plugin implementations.
* ``set_memory_manager``: a method for registering an external memory manager with
  Numba.

These will be exposed through the public API, in the ``numba.cuda`` module.
Additionally, some classes that are already part of the `driver` module will be
exposed as part of the public API:

* ``MemoryPointer``: used to encapsulate information about a pointer to device
  memory.
* ``MappedMemory``: used to hold information about host memory that is mapped into
  the device address space (a subclass of ``MemoryPointer``\ ).
* ``PinnedMemory``: used to hold information about host memory that is pinned (a
  subclass of ``mviewbuf.MemAlloc``\ , a class internal to Numba).

As an alternative to calling the ``set_memory_manager`` function, an environment
variable can be used to set the memory manager. The value of the environment
variable should be the name of the module containing the memory manager in its
global scope, named ``_numba_memory_manager``\ :

.. code-block::

   export NUMBA_CUDA_MEMORY_MANAGER="<module>"

When this variable is set, Numba will automatically use the memory manager from
the specified module. Calls to ``set_memory_manager`` will issue a warning, but
otherwise be ignored.

Plugin Base Classes
^^^^^^^^^^^^^^^^^^^

An EMM plugin is implemented by inheriting from the ``BaseCUDAMemoryManager``
class, which is defined as:

.. code-block:: python

   class BaseCUDAMemoryManager(object, metaclass=ABCMeta):
       @abstractmethod
       def memalloc(self, size):
           """
           Allocate on-device memory in the current context. Arguments:

           - `size`: Size of allocation in bytes

           Returns: a `MemoryPointer` to the allocated memory.
           """

       @abstractmethod
       def memhostalloc(self, size, mapped, portable, wc):
           """
           Allocate pinned host memory. Arguments:

           - `size`: Size of the allocation in bytes
           - `mapped`: Whether the allocated memory should be mapped into the CUDA
                       address space.
           - `portable`: Whether the memory will be considered pinned by all
                         contexts, and not just the calling context.
           - `wc`: Whether to allocate the memory as write-combined.

           Returns a `MappedMemory` or `PinnedMemory` instance that owns the
           allocated memory, depending on whether the region was mapped into
           device memory.
           """

       @abstractmethod
       def mempin(self, owner, pointer, size, mapped):
           """
           Pin a region of host memory that is already allocated. Arguments:

           - `owner`: An object owning the memory - e.g. a `DeviceNDArray`.
           - `pointer`: The pointer to the beginning of the region to pin.
           - `size`: The size of the region to pin.
           - `mapped`: Whether the region should also be mapped into device memory.

           Returns a `MappedMemory` or `PinnedMemory` instance that refers to the
           allocated memory, depending on whether the region was mapped into device
           memory.
           """

       @abstractmethod
       def initialize(self):
           """
           Perform any initialization required for the EMM plugin to be ready to
           use.
           """

       @abstractmethod
       def get_memory_info(self):
           """
           Returns (free, total) memory in bytes in the context
           """

       @abstractmethod
       def get_ipc_handle(self, memory):
           """
           Return an `IpcHandle` from a GPU allocation. Arguments:

           - `memory`: A `MemoryPointer` for which the IPC handle should be created.
           """

       @abstractmethod
       def reset(self):
           """
           Clear up all memory allocated in this context.
           """

       @abstractmethod
       def defer_cleanup(self):
           """
           Returns a context manager that ensures the implementation of deferred
           cleanup whilst it is active.
           """

       @property
       @abstractmethod
       def interface_version(self):
           """
           Returns an integer specifying the version of the EMM Plugin interface
           supported by the plugin implementation. Should always return 1 for
           implementations described in this proposal.
           """

All of the methods of an EMM plugin are called from within Numba - they never
need to be invoked directly by a Numba user.

The ``initialize`` method is called by Numba prior to any memory allocations
being requested. This gives the EMM an opportunity to initialize any data
structures, etc., that it needs for its normal operations. The method may be
called multiple times during the lifetime of the program - subsequent calls
should not invalidate or reset the state of the EMM.

The ``memalloc``\ , ``memhostalloc``\ , and ``mempin`` methods are called when Numba
requires an allocation of device or host memory, or pinning of host memory.
Device memory should always be allocated in the current context.

``get_ipc_handle`` is called when an IPC handle for an array is required. Note
that there is no method for closing an IPC handle - this is because the
``IpcHandle`` object constructed by ``get_ipc_handle`` contains a ``close()`` method
as part of its definition in Numba, which closes the handle by calling
``cuIpcCloseMemHandle``. It is expected that this is sufficient for general use
cases, so no facility for customising the closing of IPC handles is provided by
the EMM Plugin interface.

``get_memory_info`` may be called at any time after ``initialize``.

``reset`` is called as part of resetting a context. Numba does not normally call
reset spontaneously, but it may be called at the behest of the user. Calls to
``reset`` may even occur before ``initialize`` is called, so the plugin should be
robust against this occurrence.

``defer_cleanup`` is called when the ``numba.cuda.defer_cleanup`` context manager
is used from user code.

``interface_version`` is called by Numba when the memory manager is set, to
ensure that the version of the interface implemented by the plugin is
compatible with the version of Numba in use.

Representing pointers
^^^^^^^^^^^^^^^^^^^^^

Device Memory
~~~~~~~~~~~~~

The ``MemoryPointer`` class is used to represent a pointer to memory. Whilst there
are various details of its implementation, the only aspect relevant to EMM
plugin development is its initialization. The ``__init__`` method has the
following interface:

.. code-block:: python

   class MemoryPointer:
       def __init__(self, context, pointer, size, owner=None, finalizer=None):


* ``context``\ : The context in which the pointer was allocated.
* ``pointer``\ : A ``ctypes`` pointer (e.g. ``ctypes.c_uint64``\ ) holding the address of
  the memory.
* ``size``\ : The size of the allocation in bytes.
* ``owner``\ : The owner is sometimes set by the internals of the class, or used for
  Numba's internal memory management, but need not be provided by the writer of
  an EMM plugin - the default of ``None`` should always suffice.
* ``finalizer``\ : A method that is called when the last reference to the
  ``MemoryPointer`` object is released. Usually this will make a call to the
  external memory management library to inform it that the memory is no longer
  required, and that it could potentially be freed (though the EMM is not
  required to free it immediately).

Host Memory
~~~~~~~~~~~

Memory mapped into the CUDA address space (which is created when the
``memhostalloc`` or ``mempin`` methods are called with ``mapped=True``\ ) is managed
using the ``MappedMemory`` class:

.. code-block:: python

   class MappedMemory(AutoFreePointer):
       def __init__(self, context, pointer, size, owner, finalizer=None):


* ``context``\ : The context in which the pointer was allocated.
* ``pointer``\ : A ``ctypes`` pointer (e.g. ``ctypes.c_void_p``\ ) holding the address of
  the allocated memory.
* ``size``\ : The size of the allocated memory in bytes.
* ``owner``\ : A Python object that owns the memory, e.g. a ``DeviceNDArray``
  instance.
* ``finalizer``\ : A method that is called when the last reference to the
  ``MappedMemory`` object is released. For example, this method could call
  ``cuMemFreeHost`` on the pointer to deallocate the memory immediately.

Note that the inheritance from ``AutoFreePointer`` is an implementation detail and
need not concern the developer of an EMM plugin - ``MemoryPointer`` is higher in
the MRO of ``MappedMemory``.

Memory that is only in the host address space and has been pinned is represented
with the ``PinnedMemory`` class:

.. code-block:: python

   class PinnedMemory(mviewbuf.MemAlloc):
       def __init__(self, context, pointer, size, owner, finalizer=None):


* ``context``\ : The context in which the pointer was allocated.
* ``pointer``\ : A ``ctypes`` pointer (e.g. ``ctypes.c_void_p``\ ) holding the address of
  the pinned memory.
* ``size``\ : The size of the pinned region in bytes.
* ``owner``\ : A Python object that owns the memory, e.g. a ``DeviceNDArray``
  instance.
* ``finalizer``\ : A method that is called when the last reference to the
  ``PinnedMemory`` object is released. This method could e.g. call
  ``cuMemHostUnregister`` on the pointer to unpin the memory immediately.

Providing device memory management only
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some external memory managers will support management of on-device memory but
not host memory. To make it easy to implement an EMM plugin using one of these
managers, Numba will provide a memory manager class with implementations of the
``memhostalloc`` and ``mempin`` methods. An abridged definition of this class
follows:

.. code-block:: python

   class HostOnlyCUDAMemoryManager(BaseCUDAMemoryManager):
       # Unimplemented methods:
       #
       # - memalloc
       # - get_memory_info

       def memhostalloc(self, size, mapped, portable, wc):
           # Implemented.

       def mempin(self, owner, pointer, size, mapped):
           # Implemented.

       def initialize(self):
           # Implemented.
           #
           # Must be called by any subclass when its initialize() method is
           # called.

       def reset(self):
           # Implemented.
           #
           # Must be called by any subclass when its reset() method is
           # called.

       def defer_cleanup(self):
           # Implemented.
           #
           # Must be called by any subclass when its defer_cleanup() method is
           # called.

A class can subclass the ``HostOnlyCUDAMemoryManager`` and then it only needs to
add implementations of methods for on-device memory. Any subclass must observe
the following rules:


* If the subclass implements ``__init__``\ , then it must also call
  ``HostOnlyCUDAMemoryManager.__init__``\ , as this is used to initialize some of
  its data structures (\ ``self.allocations`` and ``self.deallocations``\ ).
* The subclass must implement ``memalloc`` and ``get_memory_info``.
* The ``initialize`` and ``reset`` methods perform initialisation of structures
  used by the ``HostOnlyCUDAMemoryManager``.

  * If the subclass has nothing to do on initialisation (possibly) or reset
    (unlikely) then it need not implement these methods. 
  * However, if it does implement these methods then it must also call the
    methods from ``HostOnlyCUDAMemoryManager`` in its own implementations.

* Similarly if ``defer_cleanup`` is implemented, it should enter the context
  provided by ``HostOnlyCUDAManager.defer_cleanup()`` prior to ``yield``\ ing (or in
  the ``__enter__`` method) and release it prior to exiting (or in the ``__exit__``
  method).

Import order
^^^^^^^^^^^^

The order in which Numba and the library implementing an EMM Plugin should not
matter. For example, if ``rmm`` were to implement and register an EMM Plugin,
then:

.. code-block:: python

   from numba import cuda
   import rmm

and

.. code-block:: python

   import rmm
   from numba import cuda

are equivalent - this is because Numba does not initialize CUDA or allocate any
memory until the first call to a CUDA function - neither instantiating and
registering an EMM plugin, nor importing ``numba.cuda`` causes a call to a CUDA
function.

Numba as a Dependency
^^^^^^^^^^^^^^^^^^^^^

Adding the implementation of an EMM Plugin to a library naturally makes Numba a
dependency of the library where it may not have been previously. In order to
make the dependency optional, if this is desired, one might conditionally
instantiate and register the EMM Plugin like:

.. code-block:: python

   try:
       import numba
       from mylib.numba_utils import MyNumbaMemoryManager
       numba.cuda.cudadrv.driver.set_memory_manager(MyNumbaMemoryManager)
   except:
       print("Numba not importable - not registering EMM Plugin")

so that ``mylib.numba_utils``\ , which contains the implementation of the EMM
Plugin, is only imported if Numba is already present. If Numba is not available,
then ``mylib.numba_utils`` (which necessarily imports ``numba``\ ), will never be
imported.

It is recommended that any library with an EMM Plugin includes at least some
environments with Numba for testing with the EMM Plugin in use, as well as some
environments without Numba, to avoid introducing an accidental Numba dependency.

Example implementation - A RAPIDS Memory Manager (RMM) Plugin
-------------------------------------------------------------

An implementation of an EMM plugin within the `Rapids Memory Manager
(RMM) <https://github.com/rapidsai/rmm>`_ is sketched out in this section. This is
intended to show an overview of the implementation in order to support the
descriptions above and to illustrate how the plugin interface can be used -
different choices may be made for a production-ready implementation.

The plugin implementation consists of additions to `python/rmm/rmm.py
<https://github.com/rapidsai/rmm/blob/d5831ac5ebb5408ee83f63b7c7d03d8870ecb361/python/rmm/rmm.py>`_:

.. code-block:: python

   # New imports:
   from contextlib import context_manager
   # RMM already has Numba as a dependency, so these imports need not be guarded
   # by a check for the presence of numba.
   from numba.cuda import (HostOnlyCUDAMemoryManager, MemoryPointer, IpcHandle,
                           set_memory_manager)


   # New class implementing the EMM Plugin:
   class RMMNumbaManager(HostOnlyCUDAMemoryManager):
       def memalloc(self, size):
           # Allocates device memory using RMM functions. The finalizer for the
           # allocated memory calls back to RMM to free the memory.
           addr = librmm.rmm_alloc(bytesize, 0)
           ctx = cuda.current_context()
           ptr = ctypes.c_uint64(int(addr))
           finalizer = _make_finalizer(addr, stream)
           return MemoryPointer(ctx, ptr, size, finalizer=finalizer)

      def get_ipc_handle(self, memory):
           """ 
           Get an IPC handle for the memory with offset modified by the RMM memory
           pool.
           """
           # This implementation provides a functional implementation and illustrates
           # what get_ipc_handle needs to do, but it is not a very "clean"
           # implementation, and it relies on borrowing bits of Numba internals to
           # initialise ipchandle. 
           #
           # A more polished implementation might make use of additional functions in
           # the RMM C++ layer for initialising IPC handles, and not use any Numba
           # internals.
           ipchandle = (ctypes.c_byte * 64)()  # IPC handle is 64 bytes
           cuda.cudadrv.memory.driver_funcs.cuIpcGetMemHandle(
               ctypes.byref(ipchandle),
               memory.owner.handle,
           )
           source_info = cuda.current_context().device.get_device_identity()
           ptr = memory.device_ctypes_pointer.value
           offset = librmm.rmm_getallocationoffset(ptr, 0)
           return IpcHandle(memory, ipchandle, memory.size, source_info,
                            offset=offset)

       def get_memory_info(self):
           # Returns a tuple of (free, total) using RMM functionality.
           return get_info() # Function defined in rmm.py

       def initialize(self):
           # Nothing required to initialize RMM here, but this method is added
           # to illustrate that the super() method should also be called.
           super().initialize()

       @contextmanager
       def defer_cleanup(self):
           # Does nothing to defer cleanup - a full implementation may choose to
           # implement a different policy.
           with super().defer_cleanup():
               yield

       @property
       def interface_version(self):
           # As required by the specification
           return 1

   # The existing _make_finalizer function is used by RMMNumbaManager:
   def _make_finalizer(handle, stream):
       """
       Factory to make the finalizer function.
       We need to bind *handle* and *stream* into the actual finalizer, which
       takes no args.
       """

       def finalizer():
           """
           Invoked when the MemoryPointer is freed
           """
           librmm.rmm_free(handle, stream)

       return finalizer

   # Utility function register `RMMNumbaManager` as an EMM:
   def use_rmm_for_numba():
       set_memory_manager(RMMNumbaManager)

   # To support `NUMBA_CUDA_MEMORY_MANAGER=rmm`:
   _numba_memory_manager = RMMNumbaManager

Example usage
^^^^^^^^^^^^^

A simple example that configures Numba to use RMM for memory management and
creates a device array is as follows:

.. code-block:: python

   # example.py
   import rmm 
   import numpy as np

   from numba import cuda

   rmm.use_rmm_for_numba()

   a = np.zeros(10)
   d_a = cuda.to_device(a)
   del(d_a)
   print(rmm.csv_log())

Running this should result in output similar to the following:

.. code-block::

   Event Type,Device ID,Address,Stream,Size (bytes),Free Memory,Total Memory,Current Allocs,Start,End,Elapsed,Location
   Alloc,0,0x7fae06600000,0,80,0,0,1,1.10549,1.1074,0.00191666,<path>/numba/numba/cuda/cudadrv/driver.py:683
   Free,0,0x7fae06600000,0,0,0,0,0,1.10798,1.10921,0.00122238,<path>/numba/numba/utils.py:678

Note that there is some scope for improvement in RMM for detecting the line
number at which the allocation / free occurred, but this is outside the scope of
the example in this proposal.

Setting the memory manager through the environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Rather than calling ``rmm.use_rmm_for_numba()`` in the example above, the memory
manager could also be set to use RMM globally with an environment variable, so
the Python interpreter is invoked to run the example as:

.. code-block::

   NUMBA_CUDA_MEMORY_MANAGER="rmm.RMMNumbaManager" python example.py

Numba internal changes
----------------------

This section is intended primarily for Numba developers - those with an interest
in the external interface for implementing EMM plugins may choose to skip over
this section.

Current model / implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

At present, memory management is implemented in the
:class:`~numba.cuda.cudadrv.driver.Context` class. It maintains lists of
allocations and deallocations:

* ``allocations`` is a ``numba.core.utils.UniqueDict``, created at context
  creation time.
* ``deallocations`` is an instance of the ``_PendingDeallocs`` class, and is created
  when ``Context.prepare_for_use()`` is called.

These are used to track allocations and deallocations of:

* Device memory
* Pinned memory
* Mapped memory
* Streams
* Events
* Modules

The ``_PendingDeallocs`` class implements the deferred deallocation strategy -
cleanup functions (such as ``cuMemFree``\ ) for the items above are added to its
list of pending deallocations by the finalizers of objects representing
allocations. These finalizers are run when the objects owning them are
garbage-collected by the Python interpreter. When the addition of a new
cleanup function to the deallocation list causes the number or size of pending
deallocations to exceed a configured ratio, the ``_PendingDeallocs`` object runs
deallocators for all items it knows about and then clears its internal pending
list.

See :ref:`deallocation-behavior` for more details of this implementation.

Proposed changes
^^^^^^^^^^^^^^^^

This section outlines the major changes that will be made to support the EMM
plugin interface - there will be various small changes to other parts of Numba
that will be required in order to adapt to these changes; an exhaustive list of
these is not provided.

Context changes
~~~~~~~~~~~~~~~

The ``numba.cuda.cudadrv.driver.Context`` class will no longer directly allocate
and free memory. Instead, the context will hold a reference to a memory manager
instance, and its memory allocation methods will call into the memory manager,
e.g.:

.. code-block:: python

       def memalloc(self, size):
           return self.memory_manager.memalloc(size)

       def memhostalloc(self, size, mapped=False, portable=False, wc=False):
           return self.memory_manager.memhostalloc(size, mapped, portable, wc)

       def mempin(self, owner, pointer, size, mapped=False):
           if mapped and not self.device.CAN_MAP_HOST_MEMORY:
               raise CudaDriverError("%s cannot map host memory" % self.device)
           return self.memory_manager.mempin(owner, pointer, size, mapped)

       def prepare_for_use(self):
           self.memory_manager.initialize()

       def get_memory_info(self):
           self.memory_manager.get_memory_info()

       def get_ipc_handle(self, memory):
           return self.memory_manager.get_ipc_handle(memory)

       def reset(self):
           # ... Already-extant reset logic, plus:
           self._memory_manager.reset()

The ``memory_manager`` member is initialised when the context is created.

The ``memunpin`` method (not shown above but currently exists in the ``Context``
class) has never been implemented - it presently raises a ``NotImplementedError``.
This method arguably un-needed - pinned memory is immediately unpinned by its
finalizer, and unpinning before a finalizer runs would invalidate the state of
``PinnedMemory`` objects for which references are still held. It is proposed that
this is removed when making the other changes to the ``Context`` class.

The ``Context`` class will still instantiate ``self.allocations`` and
``self.deallocations`` as before - these will still be used by the context to
manage the allocations and deallocations of events, streams, and modules, which
are not handled by the EMM plugin.

New components of the ``driver`` module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


* ``BaseCUDAMemoryManager``\ : An abstract class, as defined in the plugin interface
  above.
* ``HostOnlyCUDAMemoryManager``\ : A subclass of ``BaseCUDAMemoryManager``\ , with the
  logic from ``Context.memhostalloc`` and ``Context.mempin`` moved into it. This
  class will also create its own ``allocations`` and ``deallocations`` members,
  similarly to how the ``Context`` class creates them. These are used to manage
  the allocations and deallocations of pinned and mapped host memory.
* ``NumbaCUDAMemoryManager``\ : A subclass of ``HostOnlyCUDAMemoryManager``\ , which
  also contains an implementation of ``memalloc`` based on that presently existing
  in the ``Context`` class. This is the default memory manager, and its use
  preserves the behaviour of Numba prior to the addition of the EMM plugin
  interface - that is, all memory allocation and deallocation for Numba arrays
  is handled within Numba.

  * This class shares the ``allocations`` and ``deallocations`` members with its
    parent class ``HostOnlyCUDAMemoryManager``\ , and it uses these for the
    management of device memory that it allocates.

* The ``set_memory_manager`` function, which sets a global pointing to the memory
  manager class. This global initially holds ``NumbaCUDAMemoryManager`` (the
  default).

Staged IPC
~~~~~~~~~~

Staged IPC should not take ownership of the memory that it allocates. When the
default internal memory manager is in use, the memory allocated for the staging
array is already owned. When an EMM plugin is in use, it is not legitimate to
take ownership of the memory.

This change can be made by applying the following small patch, which has been
tested to have no effect on the CUDA test suite:

.. code-block:: diff

   diff --git a/numba/cuda/cudadrv/driver.py b/numba/cuda/cudadrv/driver.py
   index 7832955..f2c1352 100644
   --- a/numba/cuda/cudadrv/driver.py
   +++ b/numba/cuda/cudadrv/driver.py
   @@ -922,7 +922,11 @@ class _StagedIpcImpl(object):
            with cuda.gpus[srcdev.id]:
                impl.close()

   -        return newmem.own()
   +        return newmem

Testing
~~~~~~~

Alongside the addition of appropriate tests for new functionality, there will be
some refactoring of existing tests required, but these changes are not
substantial. Tests of the deallocation strategy (e.g. ``TestDeallocation``\ ,
``TestDeferCleanup``\ ) will need to be modified to ensure that they are
examining the correct set of deallocations. When an EMM plugin is in use, they
will need to be skipped.

Prototyping / experimental implementation
-----------------------------------------

Some prototype / experimental implementations have been produced to guide the
designs presented in this document. The current implementations can be found in:


* Numba branch: https://github.com/gmarkall/numba/tree/grm-numba-nbep-7.
* RMM branch: https://github.com/gmarkall/rmm/tree/grm-numba-nbep-7.
* CuPy implementation:
  https://github.com/gmarkall/nbep-7/blob/master/nbep7/cupy_mempool.py - uses
  an unmodified CuPy.

  * See `CuPy memory management
    docs <https://docs-cupy.chainer.org/en/stable/reference/memory.html>`_.

Current implementation status
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

RMM Plugin
~~~~~~~~~~

For a minimal example, a simple allocation and free using RMM works as expected.
For the example code (similar to the RMM example above):

.. code-block:: python

   import rmm
   import numpy as np

   from numba import cuda

   rmm.use_rmm_for_numba()

   a = np.zeros(10)
   d_a = cuda.to_device(a)
   del(d_a)
   print(rmm.csv_log())

We see the following output:

.. code-block::

   Event Type,Device ID,Address,Stream,Size (bytes),Free Memory,Total Memory,Current Allocs,Start,End,Elapsed,Location
   Alloc,0,0x7f96c7400000,0,80,0,0,1,1.13396,1.13576,0.00180059,<path>/numba/numba/cuda/cudadrv/driver.py:686
   Free,0,0x7f96c7400000,0,0,0,0,0,1.13628,1.13723,0.000956004,<path>/numba/numba/utils.py:678

This output is similar to the expected output from the example usage presented
above (though note that the pointer addresses and timestamps vary compared to
the example), and provides some validation of the example use case.

CuPy Plugin
~~~~~~~~~~~

.. code-block:: python

   from nbep7.cupy_mempool import use_cupy_mm_for_numba
   import numpy as np

   from numba import cuda

   use_cupy_mm_for_numba()

   a = np.zeros(10)
   d_a = cuda.to_device(a)
   del(d_a)

The prototype CuPy plugin has somewhat primitive logging, so we see the output:

.. code-block::

   Allocated 80 bytes at 7f004d400000
   Freeing 80 bytes at 7f004d400000

Numba CUDA Unit tests
^^^^^^^^^^^^^^^^^^^^^

As well as providing correct execution of a simple example, all relevant Numba
CUDA unit tests also pass with the prototype branch, for both the internal memory
manager and the RMM EMM Plugin.

RMM
~~~

The unit test suite can be run with the RMM EMM Plugin with:

.. code-block::

   NUMBA_CUDA_MEMORY_MANAGER=rmm python -m numba.runtests numba.cuda.tests

A summary of the unit test suite output is:

.. code-block::

   Ran 564 tests in 142.211s

   OK (skipped=11)

When running with the built-in Numba memory management, the output is:

.. code-block::

   Ran 564 tests in 133.396s

   OK (skipped=5)

i.e. the changes for using an external memory manager do not break the built-in
Numba memory management. There are an additional 6 skipped tests, from:


* ``TestDeallocation``\ : skipped as it specifically tests Numba's internal
  deallocation strategy.
* ``TestDeferCleanup``\ : skipped as it specifically tests Numba's implementation of
  deferred cleanup.
* ``TestCudaArrayInterface.test_ownership``\ : skipped as Numba does not own memory
  when an EMM Plugin is used, but ownership is assumed by this test case.

CuPy
~~~~

The test suite can be run with the CuPy plugin using:

.. code-block::

   NUMBA_CUDA_MEMORY_MANAGER=nbep7.cupy_mempool python -m numba.runtests numba.cuda.tests

This plugin implementation is presently more primitive than the RMM
implementation, and results in some errors with the unit test suite:

.. code-block::

   Ran 564 tests in 111.699s

   FAILED (errors=8, skipped=11)

The 8 errors are due to a lack of implementation of ``get_ipc_handle`` in the
CuPy EMM Plugin implementation. It is expected that this implementation will be
re-visited and completed so that CuPy can be used stably as an allocator for
Numba in the future.