"# Build a local ray cluster. The head node and worker node are on this machine\n",
"ray.init()"
]
},
{
"cell_type": "markdown",
"id": "a127e4e4",
"metadata": {},
"source": [
"Implement an Accumulator class."
]
},
{
"cell_type": "code",
"execution_count": 147,
"id": "20e7b9a3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"@ray.remote\n",
"class Accumulator:\n",
" def __init__(self):\n",
" self.value = 0\n",
"\n",
" def add(self, x):\n",
" self.value += x\n",
"\n",
" def get_value(self):\n",
" return self.value"
]
},
{
"cell_type": "code",
"execution_count": 148,
"id": "3b80098c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Instantiate an accumulator. Accumulator can be viewed as a process, acting as an RPC service.\n",
"accumulator = Accumulator.remote()"
]
},
{
"cell_type": "code",
"execution_count": 149,
"id": "b14b1009",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n"
]
}
],
"source": [
"value_ref = accumulator.get_value.remote() # Check the current value. Note that this function returns immediately and does not actually wait for the remote execution to complete.\n",
"# Get the value\n",
"value = ray.get(value_ref)\n",
"print(value)"
]
},
{
"cell_type": "code",
"execution_count": 150,
"id": "513a84b3",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10\n"
]
}
],
"source": [
"# Accumulate, then check the result.\n",
"accumulator.add.remote(10) # Similarly, the 'add' here will return immediately.\n",
"## Chapter 2: Resource Pool and RayWorkerGroup\n",
"In the previous example, it was a simple single-process worker. \n",
"In this example, we implement a worker with a GPU and form a RayWorkerGroup. Within this RayWorkerGroup, we implement a simple operation of an accumulator."
"The principle of parameter passing: The input parameter is a list of length world_size, where each element in the list is dispatched respectively to each worker in the RayWorkerGroup. \n",
"The return parameter is also a list, corresponding to the return value of each worker."
]
},
{
"cell_type": "markdown",
"id": "d25c2412",
"metadata": {},
"source": [
"### GPU Resource Sharing"
]
},
{
"cell_type": "markdown",
"id": "f74f6d24",
"metadata": {},
"source": [
"RayWorkerGroups mapped to the same resource pool share the GPU. In this example, we implement three resource pools: the first occupies 4 GPUs, the second also occupies 4 GPUs, and the last occupies all 8 GPUs. Among them, the first resource pool reuses the resource pool mentioned above."
]
},
{
"cell_type": "code",
"execution_count": 155,
"id": "49f9c06f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Create a new resource pool and then merge the newly created resource pool with the previous one.\n",
"## Chapter 3: Data Dispatch, Execution and Collection"
]
},
{
"cell_type": "markdown",
"id": "acb22d9d",
"metadata": {},
"source": [
"In the above example, we used the `execute_all_sync` function in the RayWorkerGroup to dispatch data from the driver to each worker. This is very inconvenient for coding. \n",
"In this chapter, we use the form of function decorators to allow RayWorkerGroup to directly call functions written in the Worker, and to greatly simplify parameter passing."
"# As we can see, 10 is automatically dispatched to each Worker in this RayWorkerGroup.\n",
"print(gpu_accumulator_decorator.add(x=10))"
]
},
{
"cell_type": "markdown",
"id": "540ee6ad",
"metadata": {},
"source": [
"### Custom Dispatch, Collection\n",
"Users can customize `dispatch` and `collection` function. You only need to write the `dispatch_fn` and `collect_fn` functions yourself. We also support executing RPC only on rank_zero, with specific examples provided below."
"Due to the Ray issue, we can only support max_colocate_count=1 in RayResourcePool for now. \n",
"This means that each GPU can only have one process.\n",
"We can support max_colocate > 1 when applying this pull request: https://github.com/ray-project/ray/pull/44385"
]
},
{
"cell_type": "markdown",
"id": "92724419",
"metadata": {},
"source": [
"Therefore, we need to restart the ray and initialize a new resource_pool to demonstrate the **NVMegatronRayWorkerGroup**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9b038538",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Build a local ray cluster. The head node and worker node are on this machine\n",
"ray.init()"
]
},
{
"cell_type": "markdown",
"id": "ebfd8798",
"metadata": {},
"source": [
"Finally, we implement a `NVMegatronRayWorkerGroup`, within which we create a Megatron and then run a tensor parallel (tp) split Llama mlp layer. Here, we use a complex dispatch mode, `Megatron_COMPUTE`. This dispatch mode assumes that user passes the data partitioned by DP dimension. The data is dispatched to all tp/pp ranks within the same dp group, and ultimately only collects output data from tp=0 and the last pp. In this way, for users that only write code on the driver, the Megatron behind the RPC becomes transparent."