"vscode:/vscode.git/clone" did not exist on "9fb51534877d16597cfd94c18890d87af0879d65"
diffusion-lm.md 2.47 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Diffusion LM
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---

# Running Diffusion LMs with SGLang

Diffusion Language Models (Diffusion LMs) are a class of generative models that use diffusion processes for text generation. This guide shows how to deploy diffusion models like LLaDA2.0 using SGLang as the backend with Dynamo. Diffusion LMs work differently from autoregressive models - they iteratively refine generated text through a diffusion process.

## Launch the Deployment

### Using the Launch Script (Recommended)

The easiest way to start the diffusion LM service is using the provided launch script:

```bash
bash examples/backends/sglang/launch/diffusion_llada.sh
```

### Manual Launch Steps

If you prefer to launch components manually:

**Start frontend**
```bash
python -m dynamo.frontend --http-port 8001 &
```

**Run diffusion worker**
```bash
export CUDA_VISIBLE_DEVICES=0,1
python -m dynamo.sglang \
  --model-path inclusionAI/LLaDA2.0-mini-preview \
  --tp-size 2 \
  --skip-tokenizer-init \
  --trust-remote-code \
  --endpoint dyn://dynamo.backend.generate \
  --enable-metrics \
  --disable-cuda-graph \
  --disable-overlap-schedule \
  --attention-backend triton \
  --dllm-algorithm LowConfidence
```

## Diffusion Algorithms

The diffusion worker uses the **LowConfidence** algorithm for the iterative refinement process. This algorithm refines tokens with low confidence scores, progressively replacing masked tokens with the model's predictions until confidence thresholds are met.

50
For more details on diffusion algorithms and configuration options, refer to the [SGLang Diffusion Language Models documentation](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/text_generation/diffusion_language_models.md).
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83


## Testing the Deployment

Once deployed, you can test the service using curl:

```bash
curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inclusionAI/LLaDA2.0-mini-preview",
    "messages": [
      {
        "role": "user",
        "content": "Hello! How are you?"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'
```

Or use the completions endpoint:

```bash
curl -X POST http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inclusionAI/LLaDA2.0-mini-preview",
    "prompt": "Once upon a time",
    "max_tokens": 256
  }'
```