parameter_breakdown.md 4.56 KB
Newer Older
raojy's avatar
first  
raojy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# Parameter Breakdown

`SenseNova-U1-8B-MoT` contains roughly **8B understanding parameters** and
**8B generation parameters**. To avoid confusion caused by the naming and to
present the architecture more accurately, we provide a small inspection
script that parses parameter names of the loaded checkpoint and reports a
detailed parameter breakdown.

## Run the script

```bash
python scripts/inspect_model_params.py \
    --model_path sensenova/SenseNova-U1-8B-MoT
```

Useful argments:

- `--dtype {float32,float16,bfloat16}` (default: `bfloat16`) — load dtype. It
  does **not** affect parameter counts; it only affects the reported `memory`
  column, since each `bf16/fp16` element occupies 2 bytes versus 4 bytes for
  `fp32`.
- `--show_groups <name1,name2>` (default: `shared`) — list member parameters
  of the specified groups. Use `all` for every group, or an empty string to
  disable.
- `--custom_groups_json <path>` — override the default grouping rules with a
  JSON file of the form `{"group_name": ["prefix1", "prefix2"]}`.

## Example output

```text
Model: sensenova/SenseNova-U1-8B-MoT
Load dtype:   bfloat16
Total params: 17.552B
Total memory: 35.105GB (bfloat16)
---------------------------------------------------------------------
group                              params memory (bfloat16)      ratio
---------------------------------------------------------------------
generation_transformer             8.186B         16.373GB     46.64%
understanding_transformer          8.121B         16.243GB     46.27%
shared                             1.245B          2.489GB      7.09%
---------------------------------------------------------------------
Pathway breakdown (shared counted in both):
---------------------------------------------------------------------
pathway                            params memory (bfloat16)      ratio
---------------------------------------------------------------------
understanding pathway              9.366B         18.732GB     53.36%
generation pathway                 9.431B         18.862GB     53.73%

---------------------------------------------------------------------
Members of group 'shared' (2 params, 1.245B total, 2.489GB @ bfloat16)
---------------------------------------------------------------------
param name                                                  numel    dtype
---------------------------------------------------------------------
language_model.model.embed_tokens.weight                 622.330M bfloat16
language_model.lm_head.weight                            622.330M bfloat16
```

## How to read it

### 1. Parameters (mutually exclusive, sums to 100%)

Each parameter is counted exactly once and assigned to one of three groups
based on its module path:

- `understanding_transformer`**8.12B (46%)** — vision und.
  (`vision_model.*`) plus the LLM expert without `_mot_gen` suffix
  (`language_model.*` minus the generation expert and the shared text I/O).
- `generation_transformer`**8.19B (47%)** — generation-side modules
  (`fm_modules.*`: vision gen., flow-matching head, timestep / noise
  embedders) plus the LLM expert with `_mot_gen` suffix
  (`language_model.*` containing `_mot_gen`).
- `shared`**1.25B (7%)** — text-token I/O reused by both pathways:
  `language_model.model.embed_tokens` and `language_model.lm_head`.

### 2. Pathway coverage (forward activations, ratios sum to >100%)

A *pathway* sums the parameters that are actually traversed during the forward pass of one task.
Because both tasks reuse the `shared` group, the ratios overlap and add up to more than 100%.

- **Understanding pathway**`understanding_transformer + shared`**9.37B (53%)**.
  Image goes through `vision_model` → tokens go through `embed_tokens`
  the LLM runs on the `non-_mot_gen` expert → `lm_head` produces text logits.

- **Generation pathway** (single-turn thinking interleave) ≈
  `generation_transformer + shared`**9.43B (54%)**.
  The condition image goes through `fm_modules.vision_model_mot_gen`, while
  the text prompt goes through `embed_tokens` → the LLM runs on the
  `_mot_gen` expert → text reasoning is produced via `lm_head` and
  the image is decoded via `fm_modules.fm_head`.

### Why `embed_tokens` and `lm_head` are "shared", not "understanding-only"

`embed_tokens` is needed by every text token and is therefore obviously
shared. `lm_head` is also exercised by the generation pathway in some scenarios,
e.g., t2i-reasoning runs a thinking phase that emits text tokens **before** any image token is produced,
so `lm_head` is on the critical path of both pathways — hence the "shared" label.