Commit 876a36a4 authored by raojy's avatar raojy
Browse files

first

parent eda2afb8
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# SenseNova-SI # SenseNova-SI
## 论文
[SenseNova-SI](https://arxiv.org/abs/2511.13719)
## 模型简介
SenseNova-SI 是开源多模态空间智能模型系列,旨在补齐传统多模态模型在三维空间感知与几何推理上的不足,该模型基于 InternVL3、Qwen3-VL、BAGEL 三大基座打造,拥有 2B、8B 等主流参数量版本,其中 1.3 系列综合空间能力最优,多项基准达成同规模开源模型 SOTA,1.4 版本强化目标定位与深度估计,1.5 版本擅长立体几何解答;它可胜任方位判断、三维解析等各类空间任务,整体性能领先同量级开源模型,部分能力比肩主流闭源模型,且全系开源,支持单图与多图输入,并配套完整的推理和微调方案。
<div align=center>
<img src="./doc/1.png"/>
</div>
## 环境依赖
| 软件 | 版本 |
| :------: |:-----------------------------------------:|
| DTK | 26.04 |
| Python | 3.11.9 |
| Transformers | 4.57.1 |
| Torch | 2.5.1+das.opt1.dtk2604 |
| Flash_attn | 2.8.3+das.opt1.dtk2604.torch251 |
推荐使用镜像: harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm011-ubuntu22.04-dtk26.04-SenseNova
```bash
docker run -it \
--shm-size 256g \
--network=host \
--name nova \
--privileged \
--device=/dev/kfd \
--device=/dev/dri \
--device=/dev/mkfd \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-u root \
-v /opt/hyhal/:/opt/hyhal/:ro \
-v /path/your_code_data/:/path/your_code_data/ \
harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm011-ubuntu22.04-dtk26.04-SenseNova bash
```
更多镜像可前往[光源](https://sourcefind.cn/#/service-list)下载使用。
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
## 预训练权重
| 模型名称 | 权重大小 | 数据类型 |支持的DCU型号 | 最低卡数需求 | 下载地址 |
|:------:|:----:|:----:|:----------:|:------:|:---------------------:|
| SenseNova-SI-1.1-InternVL3-8B | 8B | BF16 | BW1000 | 1 | [Modelscope](https://modelscope.cn/models/SenseNova/SenseNova-SI-1.1-InternVL3-8B) |
| SenseNova-SI-1.1-BAGEL-7B-MoT | 8B | BF16 | BW1000 | 1 | [Modelscope](https://modelscope.cn/models/SenseNova/SenseNova-SI-1.1-BAGEL-7B-MoT) |
## 数据集
暂无
## 训练
暂无
## 推理
### Transformers
#### 单机推理
##### Example for BAGEL generation
```
cd sensenova-si
python example_bagel.py \
--model_path sensenova/SenseNova-SI-1.1-BAGEL-7B-MoT \
--prompt "A chubby cat made of 3D point clouds, stretching its body, translucent with a soft glow." \
--mode generate
```
##### Example 1
```
python example.py \
--image_paths examples/Q1_1.png \
--question "Question: Consider the real-world 3D locations of the objects. Which is closer to the sink, the toilet paper or the towel?\nOptions: \nA. toilet paper\nB. towel\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
##### Example 2
```
python example.py \
--image_paths examples/Q2_1.png examples/Q2_2.png \
--question "If the landscape painting is on the east side of the bedroom, where is the window located in the bedroom?\nOptions: A. North side, B. South side, C. West side, D. East side\nAnswer with the option's letter from the given choices directly. Enclose the option's letter within ``." \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
##### Example 3
```
python example.py \
--image_paths examples/Q3_1.png examples/Q3_2.png examples/Q3_3.png \
--question "The robot is making tea. What is the order in which the pictures were taken?" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
##### Example 4
python example.py \
--image_paths examples/Q4.png \
--question "Please provide the bounding box coordinate of the region this sentence describes: <ref>blue shirt lady</ref>" \
--model_path sensenova/SenseNova-SI-1.4-InternVL3-8B
## 效果展示
<div align=center>
<img src="./doc/2.jpg"/>
</div>
<div align=center>
<img src="./doc/3.png"/>
</div>
<div align=center>
<img src="./doc/4.png"/>
</div>
<div align=center>
<img src="./doc/5.png"/>
</div>
### 精度
DCU与GPU精度一致,推理框架:pytorch。
## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/sensenova-si
## 参考资料
- https://github.com/OpenSenseNova/SenseNova-SI
# File created using '.gitignore Generator' for Visual Studio Code: https://bit.ly/vscode-gig
# Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode,linux,python
# Edit at https://www.toptal.com/developers/gitignore?templates=visualstudiocode,linux,python
### Linux ###
*~
# temporary files which can be created if a process still has a handle open of a deleted file
.fuse_hidden*
# KDE directory preferences
.directory
# Linux trash folder which might appear on any partition or disk
.Trash-*
# .nfs files are created when an open file is removed but is still being accessed
.nfs*
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
### Python Patch ###
# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
poetry.toml
# ruff
.ruff_cache/
# LSP config files
pyrightconfig.json
### VisualStudioCode ###
.vscode/*
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
!.vscode/*.code-snippets
# Local History for Visual Studio Code
.history/
# Built Visual Studio Code Extensions
*.vsix
### VisualStudioCode Patch ###
# Ignore all local history of files
.history
.ionide
# End of https://www.toptal.com/developers/gitignore/api/visualstudiocode,linux,python
# Custom rules (everything added below won't be overriden by 'Generate .gitignore File' if you use 'Update' option)
### Examples ###
examples/*.jsonl
examples/*.png
### Training data and results ###
training/pretrained_models/
training/data/
training/results/
\ No newline at end of file
[submodule "training/lmms-engine"]
path = training/lmms-engine
url = https://github.com/EvolvingLMMs-Lab/lmms-engine
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.14.4
hooks:
# Run the linter.
- id: ruff-check
args: [ --fix, --select, I ]
# Run the formatter.
- id: ruff-format
\ No newline at end of file
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
<div align="center">
# SenseNova-SI: Scaling Spatial Intelligence with Multimodal Foundation Models
</div>
<div align="center">
English | [简体中文](README_CN.md)
<p align="center">
<a href="https://arxiv.org/abs/2511.13719" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-SenseNova_SI-red?logo=arxiv" height="20" />
</a>
<a href="https://huggingface.co/collections/sensenova/sensenova-si" target="_blank">
<img alt="SenseNova-SI" src="https://img.shields.io/badge/%F0%9F%A4%97%20_SenseNova_SI-Models-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
<a href="https://huggingface.co/datasets/sensenova/SenseNova-SI-8M" target="_blank">
<img alt="SenseNova-SI-8M" src="https://img.shields.io/badge/%F0%9F%A4%97%20_SenseNova_SI_8M-Data-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
<a href="https://modelscope.cn/collections/SenseNova-SI-a1d78333be8d42" target="_blank">
<img alt="SenseNova-SI" src="https://img.shields.io/badge/🤖 ModelScope-Models-blue" height="20" />
</a>
<a href="https://easi.lmms-lab.com/leaderboard/" target="_blank">
<img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_EASI-Leaderboard-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
<a href="https://github.com/EvolvingLMMs-Lab/EASI" target="_blank">
<img alt="Code" src="https://img.shields.io/badge/EASI-Code-100000?style=flat-square&logo=github&logoColor=white" height="20" />
</a>
<a href="https://github.com/OpenSenseNova/SenseNova-SI/blob/main/LICENSE"><img src="https://img.shields.io/github/license/OpenSenseNova/SenseNova-SI"></a>
</p>
</div>
## Overview
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence.
In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the [**SenseNova-SI family**](https://huggingface.co/collections/sensenova/sensenova-si),
built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel).
We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M:
eight million diverse data samples under a rigorous taxonomy of spatial capabilities.
SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks, while maintaining strong general multimodal understanding.
More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training,
analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously.
All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
*In the future, SenseNova-SI will be integrated with larger-scale in-house models.*
## News
- [2026-05-12] We have released official full-scale training dataset of the SenseNova-SI series, [**SenseNova-SI-8M**](https://huggingface.co/datasets/sensenova/SenseNova-SI-8M). SenseNova-SI-8M contains ~8.16 million carefully curated training samples spanning ~2.72 million unique images.
- [2026-04-13] We have released [**SenseNova-SI-1.3-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.3-Qwen3-VL-8B), built on **Qwen3-VL** with **14M** SI training data, demonstrating strong spatial intelligence across benchmarks and improved **open-ended spatial question-answering** compared to earlier SenseNova-SI Qwen variants.
- [2026-04-01] We have released [**SenseNova-SI-1.5-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.5-InternVL3-8B), which significantly improves **solid geometric** question-answering and analyzing capabilities, achieving an accuracy of **63.5** on SolidGeo MCQ.
- [2026-03-27] We have released [**SenseNova-SI-1.4-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.4-InternVL3-8B), which significantly improves **grounding** and **depth estimation** capabilities, achieving **89.21** on RefCOCO avg and **78.64** on CountBench.
- [2026-02-21] Our work got accepted to CVPR 2026! A paper is just a step. what truly matters is continuing to push the boundaries of spatial intelligence models and sharing our work with the community.
- [2026-01-09] We have released [**SenseNova-SI-1.3-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.3-InternVL3-8B), which improves open-ended spatial question-answering capabilities.
- [2025-12-06] As a first step, we have released a highly effective data subset, [**SenseNova-SI-800K**](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K), as well as [**SenseNova-SI-1.1-InternVL3-8B-800K**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B-800K), a model trained exclusively on the **SenseNova-SI-800K** subset.
- [2025-12-06] We present models starting from more base models, namely[**SenseNova-SI-1.2-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.2-InternVL3-8B), [**SenseNova-SI-1.1-Qwen2.5-VL-3B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B), [**SenseNova-SI-1.1-Qwen2.5-VL-7B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-7B), and [**SenseNova-SI-1.1-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen3-VL-8B). **SenseNova-SI-1.2-InternVL3-8B** achieve SOTA across eight recent spatial intelligence benchmarks.
- [2025-11-15] We have released [**SenseNova-SI-1.1-InternVL3-2B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B) and
[**SenseNova-SI-1.1-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B),
which achieve state-of-the-art(SOTA) performance among open-source models of comparable size across five recent spatial intelligence benchmarks:
**VSI**, **MMSI**, **MindCube**, **ViewSpatial** and **SITE**.
## Models Zoo
<table>
<thead>
<tr>
<th>Model</th>
<th>Base Architecture</th>
<th>SI Dataset Scale</th>
<th>EASI-8</th>
<th>Other Remarks</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.5-InternVL3-8B/">
SenseNova-SI-1.5-InternVL3-8B
</a>
</td>
<td>SenseNova-SI-1.4-InternVL3-8B</td>
<td>1.5M</td>
<td>64.4</td>
<td>Enhanced capability in solid geometry</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.4-InternVL3-8B/">
SenseNova-SI-1.4-InternVL3-8B
</a>
</td>
<td>InternVL3</td>
<td>29M</td>
<td>63.7</td>
<td>Enhanced capability in grounding and depth estimation</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.3-InternVL3-8B/">
SenseNova-SI-1.3-InternVL3-8B
</a>
</td>
<td>InternVL3</td>
<td>14M</td>
<td>65.2</td>
<td>Best in spatial intelligence, with enhanced capabilities for open-ended short QA</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.3-Qwen3-VL-8B/">
SenseNova-SI-1.3-Qwen3-VL-8B
</a>
</td>
<td>Qwen3-VL</td>
<td>14M</td>
<td>61.4</td>
<td>Enhanced capability for open-ended short QA</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.2-InternVL3-8B/">
SenseNova-SI-1.2-InternVL3-8B
</a>
</td>
<td>InternVL3</td>
<td>10M</td>
<td>64.5</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B/">
SenseNova-SI-1.1-InternVL3-8B
</a>
</td>
<td>InternVL3</td>
<td>8M</td>
<td>61.5</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B/">
SenseNova-SI-1.1-InternVL3-2B
</a>
</td>
<td>InternVL3</td>
<td>8M</td>
<td>49.4</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen3-VL-8B/">
SenseNova-SI-1.1-Qwen3-VL-8B
</a>
</td>
<td>Qwen3-VL</td>
<td>8M</td>
<td>58.1</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-7B">
SenseNova-SI-1.1-Qwen2.5-VL-7B
</a>
</td>
<td>Qwen2.5-VL</td>
<td>8M</td>
<td>51.0</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B/">
SenseNova-SI-1.1-Qwen2.5-VL-3B
</a>
</td>
<td>Qwen2.5-VL</td>
<td>8M</td>
<td>45.7</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-BAGEL-7B-MoT">
SenseNova-SI-1.1-BAGEL-7B-MoT
</a>
</td>
<td>BAGEL</td>
<td>8M</td>
<td>48.6</td>
<td>Unified understanding and generation model</td>
</tr>
</tbody>
</table>
## Release Information
### Models
Currently, we build SenseNova-SI upon popular open-source foundation models to maximize compatibility with existing research pipelines.
In this release, we present
[**SenseNova-SI-1.5-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.5-InternVL3-8B),
[**SenseNova-SI-1.4-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.4-InternVL3-8B),
[**SenseNova-SI-1.3-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.3-InternVL3-8B),
[**SenseNova-SI-1.3-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.3-Qwen3-VL-8B),
[**SenseNova-SI-1.2-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.2-InternVL3-8B),
[**SenseNova-SI-1.1-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B),
[**SenseNova-SI-1.1-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen3-VL-8B),
[**SenseNova-SI-1.1-Qwen2.5-VL-7B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-7B),
[**SenseNova-SI-1.1-Qwen2.5-VL-3B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B), and
[**SenseNova-SI-1.1-InternVL3-2B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B), of which **SenseNova-SI-1.3-InternVL3-8B** achieve state-of-the-art performance among open-source models of comparable size across eight recent spatial intelligence benchmarks: VSI, MMSI, MindCube, ViewSpatial, SITE, BLINK, 3DSRBench, EmbSpatial-Bench. It also improves open-ended spatial question-answering capabilities compared to previous versions.
**SenseNova-SI-1.4-InternVL3-8B** demonstrates strong spatial intelligence across a wide range of benchmarks, with improved **grounding** performance, achieving an average score of **89.21** across all RefCOCO splits and **78.64** on CountBench. On our depth estimation task constructed from the Ibims dataset, it reaches **95.56** in relative depth and **80.31** in absolute depth.
**SenseNova-SI-1.5-InternVL3-8B** exhibits strong spatial intelligence as well as notable improvements in analyzing and solving **solid geometric** problems, achieving an accuracy of **63.5** on SolidGeo MCQ. On our internal benchmarks constructed from K12 question banks, SolidMath and Math3D, it reaches an accuracy of **72.7** and **68.9** respectively.
<table>
<thead>
<tr>
<th>Model</th>
<th>VSI</th>
<th>MMSI</th>
<th>MindCube-Tiny</th>
<th>ViewSpatial</th>
<th>SITE</th>
<th>BLINK</th>
<th>3DSRBench</th>
<th>EmbSpatial-Bench</th>
</tr>
</thead>
<tbody>
<tr style="background:#F2F0EF;font-weight:700;text-align:center;">
<td colspan="9"><em>Open-source Models (~2B)</em></td>
</tr>
<tr>
<td>InternVL3-2B</td><td>32.9</td><td>26.5</td><td>37.5</td><td>32.5</td><td>30.0</td><td>50.8</td><td>47.7</td><td>60.1</td>
</tr>
<tr>
<td>Qwen3-VL-2B-Instruct</td><td>50.3</td><td>28.9</td><td>34.5</td><td>36.9</td><td>35.6</td><td>53.2</td><td>47.5</td><td>70.1</td>
</tr>
<tr>
<td>MindCube-3B-RawQA-SFT</td><td>17.2</td><td>1.7</td><td>51.7</td><td>24.1</td><td>6.3</td><td>35.1</td><td>2.8</td><td>37.0</td>
</tr>
<tr>
<td>SpatialLadder-3B</td><td>44.8</td><td>27.4</td><td>43.4</td><td>39.8</td><td>27.9</td><td>43.0</td><td>42.8</td><td>58.2</td>
</tr>
<tr>
<td>SpatialMLLM-4B</td><td>46.3</td><td>26.1</td><td>33.4</td><td>34.6</td><td>18.0</td><td>40.5</td><td>36.2</td><td>50.0</td>
</tr>
<tr>
<td>VST-3B-SFT</td><td>57.9</td><td>30.2</td><td>35.9</td><td>52.8</td><td>35.8</td><td>58.8</td><td>54.1</td><td>69.0</td>
</tr>
<tr>
<td>Cambrian-S-3B</td><td>57.3</td><td>25.2</td><td>32.5</td><td>39.0</td><td>28.3</td><td>37.7</td><td>50.9</td><td>63.5</td>
</tr>
<tr style="background:#F2F0EF;font-weight:700;text-align:center;">
<td colspan="9"><em>Open-source Models (~8B)</em></td>
</tr>
<tr>
<td>InternVL3-8B</td><td>42.1</td><td>28.0</td><td>41.5</td><td>38.6</td><td>41.1</td><td>53.5</td><td>44.3</td><td>76.4</td>
</tr>
<tr>
<td>Qwen3-VL-8B-Instruct</td><td>57.9</td><td>31.1</td><td>29.4</td><td>42.2</td><td>45.8</td><td>66.7</td><td>53.9</td><td>77.7</td>
</tr>
<tr>
<td>BAGEL-7B-MoT</td><td>31.4</td><td>31.0</td><td>34.7</td><td>41.3</td><td>37.0</td><td>63.7</td><td>50.2</td><td>73.1</td>
</tr>
<tr>
<td>SpaceR-7B</td><td>41.5</td><td>27.4</td><td>37.9</td><td>35.8</td><td>34.2</td><td>49.6</td><td>40.5</td><td>66.9</td>
</tr>
<tr>
<td>ViLaSR-7B</td><td>44.6</td><td>30.2</td><td>35.1</td><td>35.7</td><td>38.7</td><td>51.4</td><td>46.6</td><td>67.3</td>
</tr>
<tr>
<td>VST-7B-SFT</td><td>60.6</td><td>32.0</td><td>39.7</td><td>50.5</td><td>39.6</td><td>61.9</td><td>54.6</td><td>73.7</td>
</tr>
<tr>
<td>Cambrian-S-7B</td><td>67.5</td><td>25.8</td><td>39.6</td><td>40.9</td><td>33.0</td><td>37.9</td><td>54.8</td><td>72.8</td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.3-InternVL3-8B</strong></td>
<td><strong>68.6</strong></td>
<td><strong>42.5</strong></td>
<td><strong>89.9</strong></td>
<td><strong>61.3</strong></td>
<td><strong>47.5</strong></td>
<td><strong>68.0</strong></td>
<td><strong>62.4</strong></td>
<td><strong>81.0</strong></td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.3-Qwen3-VL-8B</strong></td>
<td><strong>67.8</strong></td>
<td><strong>39.5</strong></td>
<td><strong>68.3</strong></td>
<td><strong>55.8</strong></td>
<td><strong>57.5</strong></td>
<td><strong>63.0</strong></td>
<td><strong>57.3</strong></td>
<td><strong>82.1</strong></td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.4-InternVL3-8B</strong></td>
<td><strong>66.6</strong></td>
<td><strong>40.1</strong></td>
<td><strong>88.8</strong></td>
<td><strong>55.7</strong></td>
<td><strong>47.9</strong></td>
<td><strong>68.1</strong></td>
<td><strong>60.4</strong></td>
<td><strong>81.7</strong></td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.5-InternVL3-8B</strong></td>
<td><strong>67.3</strong></td>
<td><strong>38.3</strong></td>
<td><strong>92.1</strong></td>
<td><strong>59.0</strong></td>
<td><strong>47.5</strong></td>
<td><strong>69.5</strong></td>
<td><strong>61.3</strong></td>
<td><strong>80.3</strong></td>
</tr>
<tr style="background:#F2F0EF;color:#6b7280;font-weight:600;text-align:center;">
<td colspan="9"><em>Proprietary Models</em></td>
</tr>
<tr style="color:#6b7280;">
<td>Gemini-2.5-pro-2025-06</td><td>53.5</td><td>38.0</td><td>57.6</td><td>46.0</td><td>57.0</td><td>73.5</td><td>59.3</td><td>78.9</td>
</tr>
<tr style="color:#6b7280;">
<td>Grok-4-2025-07-09</td><td>47.9</td><td>37.8</td><td>63.5</td><td>43.2</td><td>47.0</td><td>56.4</td><td>54.9</td><td>75.7</td>
</tr>
<tr style="color:#6b7280;">
<td>GPT-5-2025-08-07</td><td>55.0</td><td>41.8</td><td>56.3</td><td>45.5</td><td>61.8</td><td>68.0</td><td>60.3</td><td>81.6</td>
</tr>
</tbody>
</table>
For grounding and depth estimation benchmarks, we report the following results.
RefCOCO and CountBench are reproduced using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), while the depth estimation results are evaluated on our internally constructed test set.
<table>
<thead>
<tr>
<th>Model</th>
<th>RefCOCO avg</th>
<th>CountBench</th>
<th>Ibims Relative Depth</th>
<th>Ibims Absolute Depth</th>
</tr>
</thead>
<tbody>
<tr>
<td>InternVL3-8B</td><td>89.01</td><td>81.31</td><td>52.22</td><td>13.45</td>
</tr>
<tr>
<td>SenseNova-SI-1.3-InternVL3-8B</td><td>83.85</td><td>73.92</td><td>68.60</td><td>59.23</td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.4-InternVL3-8B</strong></td>
<td><strong>89.21</strong></td>
<td><strong>78.64</strong></td>
<td><strong>95.56</strong></td>
<td><strong>80.31</strong></td>
</tr>
</tbody>
</table>
For solid geometry benchmarks, we report the following results.
SolidGeo MCQ contains multiple choice questions extracted from [SolidGeo](https://huggingface.co/datasets/SolidGeo/SolidGeo).
SolidMath and Math3D are internally benchmarks constructed from K12 question banks, containing multiple-choice problems in Chinese on solid geometry. SolidMath is built from in-domain data and Math3D is derived from out-of-domain data.
<table>
<thead>
<tr>
<th>Model</th>
<th>SolidGeo MCQ</th>
<th>SpatialViz-Bench</th>
<th>SolidMath</th>
<th>Math3D</th>
</tr>
</thead>
<tbody>
<tr>
<td>InternVL3-8B</td><td>36.4</td><td>32.0</td><td>42.5</td><td>43.7</td>
</tr>
<tr>
<td>SenseNova-SI-1.3-InternVL3-8B</td><td>36.5</td><td>29.6</td><td>39.6</td><td>40.3</td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.5-InternVL3-8B</strong></td>
<td><strong>63.5</strong></td>
<td><strong>33.0</strong></td>
<td><strong>72.7</strong></td>
<td><strong>68.9</strong></td>
</tr>
</tbody>
</table>
### Datasets
To further facilitate the research in spatial intelligence, we have released a highly effective subset, [SenseNova-SI-800K](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K).
Since SenseNova-SI is designed to study scaling laws, we observe that this initial release captures a substantial portion of the gains.
<table>
<thead>
<tr>
<th>Model</th>
<th>SI Dataset</th>
<th>VSI</th>
<th>MMSI</th>
<th>MindCube-Tiny</th>
<th>ViewSpatial</th>
<th>SITE</th>
</tr>
</thead>
<tbody>
<tr>
<td>InternVL3-8B</td><td>-</td><td>42.1</td><td>28.0</td><td>41.5</td><td>38.6</td><td>41.1</td>
</tr>
<tr>
<td>VST-7B-SFT</td><td>VST-P-4.1M</td><td>60.6</td><td>32.0</td><td>39.7</td><td>50.5</td><td>39.6</td>
</tr>
<tr>
<td>Cambrian-S-7B</td><td>VSI-590K</td><td>67.5</td><td>25.8</td><td>39.6</td><td>40.9</td><td>33.0</td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B-800K/">*SenseNova-SI-1.1-InternVL3-8B-800K</a></strong></td>
<td><strong><a href="https://huggingface.co/datasets/sensenova/SenseNova-SI-800K">SenseNova-SI-800K</a></strong></td>
<td><strong>60.9</strong></td>
<td><strong>36.4</strong></td>
<td><strong>56.9</strong></td>
<td><strong>52.5</strong></td>
<td><strong>47.7</strong></td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B/">SenseNova-SI-1.1-InternVL3-8B</a></strong></td>
<td><strong>SenseNova-SI-8M</strong></td>
<td><strong>68.7</strong></td>
<td><strong>43.3</strong></td>
<td><strong>85.6</strong></td>
<td><strong>54.6</strong></td>
<td><strong>47.7</strong></td>
</tr>
</tbody>
</table>
Note that *SenseNova-SI-1.1-InternVL3-8B-800K is trained on the SenseNova-SI-800K subset to provide a reference for researchers working with the 800K-scale dataset. It is released exclusively for scaling-law analysis and research validation, and is not intended to serve as a primary recommended model of the SenseNova-SI series.
#### Data Format
Our data is stored in the **SenseNova-SI-800K.jsonl** file using the JSONL (JSON Lines) format, where each line represents an independent data entry. Each entry is a dictionary organized in the following format,containing three main fields: **`id`**, **`conversations`**, and **`image`**.
- The `id` serves as a unique identifier for each data sample.
- The `image` field is a list of strings specifying image paths, all given as paths relative to the root data directory.
- The `conversations` field is a list of dialogue turns, where each turn is a dictionary with two key-value pairs: `from`, indicating the speaker identity (e.g., human or gpt), and `value`, indicating the textual content. Within `value`, the `<image>` placeholder marks where images are inserted, and the number of `<image>` placeholders match the number of images listed in the `image` field.
```json
{
"id": 0,
"conversations": [
{"from": "human", "value": "<image>\nuser input <image>\nuser input"},
{"from": "gpt", "value": "assistant output"},
{"from": "human", "value": "<image>\nuser input"},
{"from": "gpt", "value": "assistant output"}
],
"image": ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"],
}
```
## 🛠️ QuickStart
### Inference Installation
We recommend using [uv](https://docs.astral.sh/uv/) to manage the environment.
> uv installation guide: <https://docs.astral.sh/uv/getting-started/installation/#installing-uv>
```bash
git clone git@github.com:OpenSenseNova/SenseNova-SI.git
cd SenseNova-SI/
uv sync --extra cu124 # or one of [cu118|cu121|cu124|cu126|cu128|cu129], depending on your CUDA version
source .venv/bin/activate
```
#### Hello World
A simple image-free test to verify environment setup and download the model.
```bash
python example.py \
--question "Hello" \
--model_path sensenova/SenseNova-SI-1.4-InternVL3-8B
```
#### Switching Between Supported Models
We fully support multiple model architectures.
To use a different model, simply change the value of the --model_path argument, no other code changes are required.
To use BAGEL-MoT:
```bash
--model_path sensenova/SenseNova-SI-1.1-BAGEL-7B-MoT
```
To use Qwen3-VL:
```bash
--model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
```
### Examples
For more examples, see [example](docs/en/example.md).
#### Example for BAGEL generation
To run the image generation example specifically for the BAGEL-7B-MoT structure, use the following command:
```bash
python example_bagel.py \
--model_path sensenova/SenseNova-SI-1.1-BAGEL-7B-MoT \
--prompt "A chubby cat made of 3D point clouds, stretching its body, translucent with a soft glow." \
--mode generate
```
Use `--mode think_generate` to activate the thinking before generation. Below is a comparison of two modes for the same prompt:
<table>
<tr>
<th>mode=generate</th>
<th>mode=think_generate</th>
</tr>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/bagel-generate-example.jpg" alt="First image" width="100%">
</td>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/bagel-think_generate-example.jpg" alt="Second image" width="100%">
</td>
</tr>
</table>
#### Example 1
This example is from [SITE-Bench](https://github.com/wenqi-wang20/SITE-Bench):
```bash
python example.py \
--image_paths examples/Q1_1.png \
--question "Question: Consider the real-world 3D locations of the objects. Which is closer to the sink, the toilet paper or the towel?\nOptions: \nA. toilet paper\nB. towel\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
# --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
```
<!-- Example 1 -->
<details open>
<summary><strong>Details of Example 1</strong></summary>
<p><strong>Q: </strong>Question: Consider the real-world 3D locations of the objects. Which is closer to the sink, the toilet paper or the towel?\nOptions: \nA. toilet paper\nB. towel\nGive me the answer letter directly. The best answer is:</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q1_1.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: A</strong></p>
</details>
#### Example 2
This example is from [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench):
```bash
python example.py \
--image_paths examples/Q2_1.png examples/Q2_2.png \
--question "If the landscape painting is on the east side of the bedroom, where is the window located in the bedroom?\nOptions: A. North side, B. South side, C. West side, D. East side\nAnswer with the option's letter from the given choices directly. Enclose the option's letter within ``." \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
# --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
```
<!-- Example 2 -->
<details open>
<summary><strong>Details of Example 2</strong></summary>
<p><strong>Q: </strong>If the landscape painting is on the east side of the bedroom, where is the window located in the bedroom?\nOptions: A. North side, B. South side, C. West side, D. East side\nAnswer with the option's letter from the given choices directly. Enclose the option's letter within ``.</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q2_1.png" alt="First image" width="100%">
</td>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q2_2.png" alt="Second image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: C</strong></p>
</details>
#### Example 3
This example is from [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench) and test the model's capability in open-ended short-answer questions:
```bash
python example.py \
--image_paths examples/Q3_1.png examples/Q3_2.png examples/Q3_3.png \
--question "The robot is making tea. What is the order in which the pictures were taken?" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<!-- Example 3 -->
<details open>
<summary><strong>Details of Example 3</strong></summary>
<p><strong>Q: </strong>The robot is making tea. What is the order in which the pictures were taken?</p>
<table>
<tr>
<td align="center" width="33%" style="padding:4px;">
<img src="./examples/Q3_1.png" alt="First image" width="100%">
</td>
<td align="center" width="33%" style="padding:4px;">
<img src="./examples/Q3_2.png" alt="Second image" width="100%">
</td>
<td align="center" width="33%" style="padding:4px;">
<img src="./examples/Q3_3.png" alt="Third image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: Second, first, third</strong></p>
</details>
#### Example 4
This example demonstrates the model's **grounding** capability, from [RefCOCO](https://github.com/lichengunc/refer):
```bash
python example.py \
--image_paths examples/Q4.png \
--question "Please provide the bounding box coordinate of the region this sentence describes: <ref>blue shirt lady</ref>" \
--model_path sensenova/SenseNova-SI-1.4-InternVL3-8B
```
<!-- Example 4 -->
<details open>
<summary><strong>Details of Example 4</strong></summary>
<p><strong>Q: </strong>Please provide the bounding box coordinate of the region this sentence describes: &lt;ref&gt;blue shirt lady&lt;/ref&gt;</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q4.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: [0.096234, 0.161229, 0.436516, 1.000000]</strong></p>
</details>
#### Example 5
This example demonstrates the model's **depth estimation** capability:
```bash
python example.py \
--image_paths examples/Q5.png \
--question "Identify the minimal distance between the point and the camera, in meters." \
--model_path sensenova/SenseNova-SI-1.4-InternVL3-8B
```
<!-- Example 5 -->
<details open>
<summary><strong>Details of Example 5</strong></summary>
<p><strong>Q: </strong>Identify the minimal distance between the point and the camera, in meters.</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q5.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: 4.4</strong></p>
</details>
#### Example 6
This example demonstrates the model's capability in **solid geometry(Three views)**:
```bash
python example.py \
--image_paths examples/Q6.png \
--question "Enclose your thinking process in <think> </think> tags and your final answer in <answer> </answer>" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 6 -->
<details open>
<summary><strong>Details of Example 6</strong></summary>
<p><strong>Q: </strong>Enclose your thinking process in &lt;think> &lt;/think> tags and your final answer in &lt;answer> &lt;/answer></p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q6.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: D</strong></p>
</details>
#### Example 7
This example demonstrates the model's capability in **solid geometry(Nets of 3D Shapes)**:
```bash
python example.py \
--image_paths examples/Q7.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 7 -->
<details open>
<summary><strong>Details of Example 7</strong></summary>
<p><strong>Q: </strong>Enclose your thinking process in &lt;think> &lt;/think> tags and your final answer in &lt;answer> &lt;/answer></p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q7.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: D</strong></p>
</details>
#### Test Multiple Questions in a Single Run
Prepare a file similar to [examples/examples.jsonl](examples/examples.jsonl), where each line represents a single question.
The model is loaded once and processes questions sequentially. The questions remain independent of each other.
> For more details on the `jsonl` format, refer to the documentation for [Single-Image Data](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#single-image-data) and [Multi-Image Data](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#multi-image-data).
```bash
python example.py \
--jsonl_path examples/examples.jsonl \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
### Training
#### 1. Download Dataset
User may choose to download [SenseNova-SI-800K](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K) (a downsampled subset of SenseNova-SI-8M, specifically designed for studying scaling laws) or [SenseNova-SI-8M](https://huggingface.co/datasets/sensenova/SenseNova-SI-8M) (official full-scale training dataset).
Download [SenseNova-SI-800K](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K) into `training/data/`:
```bash
pip install huggingface_hub
huggingface-cli download sensenova/SenseNova-SI-800K --repo-type dataset --local-dir training/data/SenseNova-SI-800K
```
Download [SenseNova-SI-8M](https://huggingface.co/datasets/sensenova/SenseNova-SI-8M) into `training/data/`:
```bash
pip install huggingface_hub
huggingface-cli download sensenova/SenseNova-SI-8M --repo-type dataset --local-dir training/data/SenseNova-SI-8M
```
#### 2(a). Training with InternVL
**Download pretrained model**
Download [InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) into `training/pretrained_models/`:
```bash
huggingface-cli download OpenGVLab/InternVL3-8B --local-dir training/pretrained_models/OpenGVLab/InternVL3-8B
```
**Install dependencies**
```bash
conda create -n internvl python=3.10 -y
conda activate internvl
pip install uv
uv pip install -r training/intern_vl/requirements.txt
uv pip install flash-attn==2.3.6
```
**Run training**
```bash
bash training/intern_vl/internvl_chat/shell/sensenova_si_800K_internvl3_8b.sh #Train with SenseNova-SI-800K data
bash training/intern_vl/internvl_chat/shell/sensenova_si_8M_internvl3_8b.sh # Or train with SenseNova-SI-8M data
```
#### 2(b). Training with Qwen3-VL
The training framework is [lmms-engine](https://github.com/EvolvingLMMs-Lab/lmms-engine), included as a git submodule under `training/lmms-engine/`.
**Download pretrained model**
Download [Qwen3VL-8B](https://github.com/QwenLM/Qwen3-VL) into `training/pretrained_models/`:
```bash
huggingface-cli download Qwen/Qwen3-VL-8B-Instruct --local-dir training/pretrained_models/Qwen/Qwen3-VL-8B-Instruct
```
**Install dependencies**
```bash
# Initialize the lmms-engine submodule (first time only)
git submodule update --init --recursive
conda create -n qwen3vl python=3.10 -y
uv pip install -e training/lmms-engine
# Optional: Performance optimizations
uv pip install flash-attn --no-build-isolation
uv pip install liger-kernel
```
**Preprocess dataset**
Convert `SenseNova-SI-800K.jsonl` and `SenseNova-SI-8M.jsonl` to Qwen3-VL training format:
```bash
python training/qwen3_vl/preprocess_sensenova_si_dataset.py \
--src data/SenseNova-SI-800K/SenseNova-SI-800K.jsonl \
--dst data/SenseNova-SI-800K/SenseNova-SI-800K_qwen3vl_format.jsonl #Preprocess SenseNova-SI-800K data
python training/qwen3_vl/preprocess_sensenova_si_dataset.py \
--src data/SenseNova-SI-8M/SenseNova-SI-8M.jsonl \
--dst data/SenseNova-SI-8M/SenseNova-SI-8M_qwen3vl_format.jsonl #Preprocess SenseNova-SI-8M data
```
**Prepare dataset YAML**
see [training/qwen3_vl/data_800K.yaml](training/qwen3_vl/data_800K.yaml) and [training/qwen3_vl/data_8M.yaml](training/qwen3_vl/data_8M.yaml)
**Configure training**
See [training/qwen3_vl/train_config_800K.yaml](training/qwen3_vl/train_config_800K.yaml) and [training/qwen3_vl/train_config_8M.yaml](training/qwen3_vl/train_config_8M.yaml)
**Run training**
```bash
# Single node, 8 GPUs (default)
bash training/qwen3_vl/run.sh 800K #Train with SenseNova-SI-800K data
bash training/qwen3_vl/run.sh 8M # Or train with SenseNova-SI-8M data
```
#### 2(c). Training with Bagel
**Download pretrained model**
Download [BAGEL-7B-MoT](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) into `training/pretrained_models/`:
```bash
huggingface-cli download ByteDance-Seed/BAGEL-7B-MoT --local-dir training/pretrained_models/BAGEL-7B-MoT
```
**Install dependencies**
```bash
conda create -n bagel python=3.10 -y
conda activate bagel
pip install uv
uv pip install -r training/bagel/requirements.txt
uv pip install flash_attn==2.5.8 --no-build-isolation
```
**Run training**
```bash
bash training/bagel/scripts/train_sensenova_si_800K.sh #Train with SenseNova-SI-800K data
bash training/bagel/scripts/train_sensenova_si_8M.sh # Or train with SenseNova-SI-8M data
```
For details on training hyperparameters (learning rate, batch size, FSDP config, etc.), refer to [training/bagel/TRAIN.md](training/bagel/TRAIN.md).
### Evaluation
To reproduce the benchmark results above, please refer to [EASI](https://github.com/EvolvingLMMs-Lab/EASI) to evaluate SenseNova-SI on mainstream spatial intelligence benchmarks.
EASI supports over 20 spatial intelligence models and more than 20 spatial benchmarks, offering Docker for one-click spatial intelligence evaluation.
## Acknowledgements
This project includes code that is modified from the original code by the BAGEL, InternVL, lmms-engine team.
* Source repository: [BAGEL](https://github.com/bytedance-seed/BAGEL), [InternVL](https://github.com/opengvlab/internvl), [lmms-engine](https://github.com/EvolvingLMMs-Lab/lmms-engine)
We gratefully acknowledge the authors and contributors for their work.
Please refer to the original repositories for full details, updates, and licensing information.
## 🖊️ Citation
```bib
@InProceedings{sensenova-si,
title = {Scaling Spatial Intelligence with Multimodal Foundation Models},
author = {Cai, Zhongang and Wang, Ruisi and Gu, Chenyang and Pu, Fanyi and Xu, Junxiang and Wang, Yubo and Yin, Wanqi and Yang, Zhitao and Wei, Chen and Sun, Qingping and Zhou, Tongxi and Li, Jiaqi and Pang, Hui En and Qian, Oscar and Wei, Yukun and Lin, Zhiqian and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Pan, Liang and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
```
<div align="center">
# SenseNova-SI: 探索空间智能在多模态基座模型上的尺度效应
</div>
<div align="center">
[English](README.md) | 简体中文
<p align="center">
<a href="https://arxiv.org/abs/2511.13719" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-SenseNova_SI-red?logo=arxiv" height="20" />
</a>
<a href="https://huggingface.co/collections/sensenova/sensenova-si" target="_blank">
<img alt="SenseNova-SI" src="https://img.shields.io/badge/%F0%9F%A4%97%20_SenseNova_SI-Models-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
<a href="https://huggingface.co/datasets/sensenova/SenseNova-SI-8M" target="_blank">
<img alt="SenseNova-SI-8M" src="https://img.shields.io/badge/%F0%9F%A4%97%20_SenseNova_SI_8M-Data-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
<a href="https://modelscope.cn/collections/SenseNova-SI-a1d78333be8d42" target="_blank">
<img alt="SenseNova-SI" src="https://img.shields.io/badge/🤖 ModelScope-Models-blue" height="20" />
</a>
<a href="https://easi.lmms-lab.com/leaderboard/" target="_blank">
<img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_EASI-Leaderboard-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
<a href="https://github.com/EvolvingLMMs-Lab/EASI" target="_blank">
<img alt="Code" src="https://img.shields.io/badge/EASI-Code-100000?style=flat-square&logo=github&logoColor=white" height="20" />
</a>
<a href="https://github.com/OpenSenseNova/SenseNova-SI/blob/main/LICENSE"><img src="https://img.shields.io/github/license/OpenSenseNova/SenseNova-SI"></a>
</p>
</div>
## 概览
尽管多模态基础模型已取得显著进展,但在空间智能方面仍存在明显不足。
本研究基于成熟的多模态基础,包括视觉理解模型(如Qwen3-VL、InternVL3)和统一理解生成模型(如Bagel),从尺度效应(Scaling)的视角构建了[**SenseNova-SI系列模型**](https://huggingface.co/collections/sensenova/sensenova-si)
我们采用系统化方法构建了包含800万样本的SenseNova-SI-8M数据集,通过严格的空间能力分类体系培养高性能、高鲁棒性的空间能力。
该系列模型在多项空间智能基准测试中取得突破性表现,同时保持强大的通用多模态理解能力。
本研究进一步分析了数据规模的影响,揭示了多样化数据训练带来的涌现泛化能力,探讨了过拟合与语言捷径的风险,提出了空间思维链推理的初步研究,并验证了下游应用潜力。
SenseNova-SI是一个持续迭代的项目,所有新训练的多模态空间智能基础模型均将陆续开源,以推动空间智能领域的研究发展。
*后续 SenseNova-SI 将与更大规模的内部模型进行集成。*
## 新闻
- [2026-05-12] 我们发布了 SenseNova-SI 系列的正式全量训练数据,[**SenseNova-SI-8M**](https://huggingface.co/datasets/sensenova/SenseNova-SI-8M)。SenseNova-SI-8M 包含 约 812 万 条精心整理的训练样本,覆盖 约 276 万 张唯一图像。
- [2026-04-13] 我们发布了 [**SenseNova-SI-1.3-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.3-Qwen3-VL-8B),基于 **Qwen3-VL****14M** 规模 SI 数据训练,EASI-8 得分 **61.4**,在广泛空间智能基准上表现强劲,并相较此前 Qwen 系 SenseNova-SI 版本进一步提升了**开放式空间简答题**能力。
- [2026-04-01] 我们发布了 [**SenseNova-SI-1.5-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.5-InternVL3-8B)。在多项空间智能基准上保持强劲表现, 并相较此前版本,显著提升了对**立体几何**问题的分析与解答能力,在SolidGeo MCQ 上达到**63.5** 的准确率。
- [2026-03-27] 我们发布了 [**SenseNova-SI-1.4-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.4-InternVL3-8B)。训练数据扩展至 **29M** 规模,在多项空间智能基准上保持强劲表现,并相较此前版本在 **grounding****深度估计** 能力上有显著提升,在 RefCOCO avg 上达到 **89.21**、CountBench 上达到 **78.64**
- [2026-02-21] 我们的工作被收录在 CVPR 2026!一篇论文只是一个阶段性的成果,更重要的是继续推动空间智能模型的边界,并将我们的成果与社区分享。
- [2026-01-09] 我们发布了 [**SenseNova-SI-1.3-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.3-InternVL3-8B),提升了开放式空间简答题能力。
- [2025-12-06] 为推进空间智能领域的研究,我们先发布一个高效的数据子集, [**SenseNova-SI-800K**](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K), 以及发布模型 [**SenseNova-SI-1.1-InternVL3-8B-800K**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B-800K)。该模型仅使用 SenseNova-SI-800K 子集进行训练,为使用 800K 规模数据进行实验的研究者提供参考。
- [2025-12-06] 在本次发布中,我们推出[**SenseNova-SI-1.2-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.2-InternVL3-8B), [**SenseNova-SI-1.1-Qwen2.5-VL-3B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B), [**SenseNova-SI-1.1-Qwen2.5-VL-7B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-7B), 与[**SenseNova-SI-1.1-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen3-VL-8B). **SenseNova-SI-1.2-InternVL3-8B** 在八个近期发布的空间智能基准测试(VSI、MMSI、MindCube、ViewSpatial、SITE、BLINK、3DSRBench、EmbSpatial-Bench)上, 在同等模型规模下均取得了开源模型的最新最优性能。
- [2025-11-15] 我们发布了 [**SenseNova-SI-1.1-InternVL3-2B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B)[**SenseNova-SI-1.1-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B), 在五个近期发布的空间智能基准测试(VSI、MMSI、MindCube、ViewSpatial、SITE)上, 在同等模型规模下均取得了开源模型的最新最优性能(state-of-the-art)。
## 模型库
<table>
<thead>
<tr>
<th>模型</th>
<th>基础架构</th>
<th>数据集规模</th>
<th>EASI-8</th>
<th>其他说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.5-InternVL3-8B/">
SenseNova-SI-1.5-InternVL3-8B
</a>
</td>
<td>SenseNova-SI-1.4-InternVL3-8B</td>
<td>1.5M</td>
<td>64.4</td>
<td>增强立体几何能力</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.4-InternVL3-8B/">
SenseNova-SI-1.4-InternVL3-8B
</a>
</td>
<td>InternVL3</td>
<td>29M</td>
<td>63.7</td>
<td>增强grounding与深度估计任务能力</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.3-InternVL3-8B/">
SenseNova-SI-1.3-InternVL3-8B
</a>
</td>
<td>InternVL3</td>
<td>14M</td>
<td>65.2</td>
<td>空间智能最优模型,增强开放式简答题回答能力</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.3-Qwen3-VL-8B/">
SenseNova-SI-1.3-Qwen3-VL-8B
</a>
</td>
<td>Qwen3-VL</td>
<td>14M</td>
<td>61.4</td>
<td>增强开放式简答题回答能力</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.2-InternVL3-8B/">
SenseNova-SI-1.2-InternVL3-8B
</a>
</td>
<td>InternVL3</td>
<td>10M</td>
<td>64.5</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B/">
SenseNova-SI-1.1-InternVL3-8B
</a>
</td>
<td>InternVL3</td>
<td>8M</td>
<td>61.5</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B/">
SenseNova-SI-1.1-InternVL3-2B
</a>
</td>
<td>InternVL3</td>
<td>8M</td>
<td>49.4</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen3-VL-8B/">
SenseNova-SI-1.1-Qwen3-VL-8B
</a>
</td>
<td>Qwen3-VL</td>
<td>8M</td>
<td>58.1</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-7B">
SenseNova-SI-1.1-Qwen2.5-VL-7B
</a>
</td>
<td>Qwen2.5-VL</td>
<td>8M</td>
<td>51.0</td>
<td>-</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B/">
SenseNova-SI-1.1-Qwen2.5-VL-3B
</a>
</td>
<td>Qwen2.5-VL</td>
<td>8M</td>
<td>45.7</td>
</tr>
<tr>
<td>
<a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-BAGEL-7B-MoT">
SenseNova-SI-1.1-BAGEL-7B-MoT
</a>
</td>
<td>BAGEL</td>
<td>8M</td>
<td>48.6</td>
<td>统一的理解与生成模型</td>
</tr>
</tbody>
</table>
## 发布信息
### 模型
目前,我们基于流行的开源基础模型构建 SenseNova-SI,以最大化与现有研究流程的兼容性。
在本次发布中,我们推出
[**SenseNova-SI-1.5-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.5-InternVL3-8B),
[**SenseNova-SI-1.4-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.4-InternVL3-8B),
[**SenseNova-SI-1.3-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.3-InternVL3-8B),
[**SenseNova-SI-1.3-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.3-Qwen3-VL-8B),
[**SenseNova-SI-1.2-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.2-InternVL3-8B),
[**SenseNova-SI-1.1-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B),
[**SenseNova-SI-1.1-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen3-VL-8B),
[**SenseNova-SI-1.1-Qwen2.5-VL-7B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-7B),
[**SenseNova-SI-1.1-Qwen2.5-VL-3B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B), 与
[**SenseNova-SI-1.1-InternVL3-2B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B)
其中 **SenseNova-SI-1.4-InternVL3-8B** 在广泛的空间智能基准上表现强劲,在 **grounding** 任务上进一步提升,在 RefCOCO 全部划分上的平均分达到 **89.21**,在 CountBench 上达到 **78.64**。在我们基于 Ibims 数据集构造的深度估计任务中,相对深度达到 **95.56**,绝对深度达到 **80.31**
**SenseNova-SI-1.5-InternVL3-8B** 展现出较强的空间智能的同时,显著增强了分析和解决 **立体几何** 问题方面的能力。在SolidGeo MCQ上达到了 **63.5** 的准确率. 在基于 K12 题库构建的内部基准SolidMath和Math3D中,分别达到了 **72.7****68.9**
<table>
<thead>
<tr>
<th>Model</th>
<th>VSI</th>
<th>MMSI</th>
<th>MindCube-Tiny</th>
<th>ViewSpatial</th>
<th>SITE</th>
<th>BLINK</th>
<th>3DSRBench</th>
<th>EmbSpatial-Bench</th>
</tr>
</thead>
<tbody>
<tr style="background:#F2F0EF;font-weight:700;text-align:center;">
<td colspan="9"><em>Open-source Models (~2B)</em></td>
</tr>
<tr>
<td>InternVL3-2B</td><td>32.9</td><td>26.5</td><td>37.5</td><td>32.5</td><td>30.0</td><td>50.8</td><td>47.7</td><td>60.1</td>
</tr>
<tr>
<td>Qwen3-VL-2B-Instruct</td><td>50.3</td><td>28.9</td><td>34.5</td><td>36.9</td><td>35.6</td><td>53.2</td><td>47.5</td><td>70.1</td>
</tr>
<tr>
<td>MindCube-3B-RawQA-SFT</td><td>17.2</td><td>1.7</td><td>51.7</td><td>24.1</td><td>6.3</td><td>35.1</td><td>2.8</td><td>37.0</td>
</tr>
<tr>
<td>SpatialLadder-3B</td><td>44.8</td><td>27.4</td><td>43.4</td><td>39.8</td><td>27.9</td><td>43.0</td><td>42.8</td><td>58.2</td>
</tr>
<tr>
<td>SpatialMLLM-4B</td><td>46.3</td><td>26.1</td><td>33.4</td><td>34.6</td><td>18.0</td><td>40.5</td><td>36.2</td><td>50.0</td>
</tr>
<tr>
<td>VST-3B-SFT</td><td>57.9</td><td>30.2</td><td>35.9</td><td>52.8</td><td>35.8</td><td>58.8</td><td>54.1</td><td>69.0</td>
</tr>
<tr>
<td>Cambrian-S-3B</td><td>57.3</td><td>25.2</td><td>32.5</td><td>39.0</td><td>28.3</td><td>37.7</td><td>50.9</td><td>63.5</td>
</tr>
<tr style="background:#F2F0EF;font-weight:700;text-align:center;">
<td colspan="9"><em>Open-source Models (~8B)</em></td>
</tr>
<tr>
<td>InternVL3-8B</td><td>42.1</td><td>28.0</td><td>41.5</td><td>38.6</td><td>41.1</td><td>53.5</td><td>44.3</td><td>76.4</td>
</tr>
<tr>
<td>Qwen3-VL-8B-Instruct</td><td>57.9</td><td>31.1</td><td>29.4</td><td>42.2</td><td>45.8</td><td>66.7</td><td>53.9</td><td>77.7</td>
</tr>
<tr>
<td>BAGEL-7B-MoT</td><td>31.4</td><td>31.0</td><td>34.7</td><td>41.3</td><td>37.0</td><td>63.7</td><td>50.2</td><td>73.1</td>
</tr>
<tr>
<td>SpaceR-7B</td><td>41.5</td><td>27.4</td><td>37.9</td><td>35.8</td><td>34.2</td><td>49.6</td><td>40.5</td><td>66.9</td>
</tr>
<tr>
<td>ViLaSR-7B</td><td>44.6</td><td>30.2</td><td>35.1</td><td>35.7</td><td>38.7</td><td>51.4</td><td>46.6</td><td>67.3</td>
</tr>
<tr>
<td>VST-7B-SFT</td><td>60.6</td><td>32.0</td><td>39.7</td><td>50.5</td><td>39.6</td><td>61.9</td><td>54.6</td><td>73.7</td>
</tr>
<tr>
<td>Cambrian-S-7B</td><td><strong>67.5</strong></td><td>25.8</td><td>39.6</td><td>40.9</td><td>33.0</td><td>37.9</td><td>54.8</td><td>72.8</td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.3-InternVL3-8B</strong></td>
<td><strong>68.6</strong></td>
<td><strong>42.5</strong></td>
<td><strong>89.9</strong></td>
<td><strong>61.3</strong></td>
<td><strong>47.5</strong></td>
<td><strong>68.0</strong></td>
<td><strong>62.4</strong></td>
<td><strong>81.0</strong></td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.3-Qwen3-VL-8B</strong></td>
<td><strong>67.8</strong></td>
<td><strong>39.5</strong></td>
<td><strong>68.3</strong></td>
<td><strong>55.8</strong></td>
<td><strong>57.5</strong></td>
<td><strong>63.0</strong></td>
<td><strong>57.3</strong></td>
<td><strong>82.1</strong></td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.4-InternVL3-8B</strong></td>
<td>66.6</td>
<td><strong>40.1</strong></td>
<td><strong>88.8</strong></td>
<td><strong>55.7</strong></td>
<td><strong>47.9</strong></td>
<td><strong>68.1</strong></td>
<td><strong>60.4</strong></td>
<td><strong>81.7</strong></td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.5-InternVL3-8B</strong></td>
<td><strong>67.3</strong></td>
<td><strong>38.3</strong></td>
<td><strong>92.1</strong></td>
<td><strong>59.0</strong></td>
<td><strong>47.5</strong></td>
<td><strong>69.5</strong></td>
<td><strong>61.3</strong></td>
<td><strong>80.3</strong></td>
</tr>
<tr style="background:#F2F0EF;color:#6b7280;font-weight:600;text-align:center;">
<td colspan="9"><em>Proprietary Models</em></td>
</tr>
<tr style="color:#6b7280;">
<td>Gemini-2.5-pro-2025-06</td><td>53.5</td><td>38.0</td><td>57.6</td><td>46.0</td><td>57.0</td><td>73.5</td><td>59.3</td><td>78.9</td>
</tr>
<tr style="color:#6b7280;">
<td>Grok-4-2025-07-09</td><td>47.9</td><td>37.8</td><td>63.5</td><td>43.2</td><td>47.0</td><td>56.4</td><td>54.9</td><td>75.7</td>
</tr>
<tr style="color:#6b7280;">
<td>GPT-5-2025-08-07</td><td>55.0</td><td>41.8</td><td>56.3</td><td>45.5</td><td>61.8</td><td>68.0</td><td>60.3</td><td>81.6</td>
</tr>
</tbody>
</table>
在 grounding 与深度估计基准上,我们报告如下结果。如需复现 RefCOCO 与 CountBench 结果,请参考 [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval);深度估计结果基于我们内部构造的测试集评测。
<table>
<thead>
<tr>
<th>Model</th>
<th>RefCOCO avg</th>
<th>CountBench</th>
<th>Ibims Relative Depth</th>
<th>Ibims Absolute Depth</th>
</tr>
</thead>
<tbody>
<tr>
<td>InternVL3-8B</td><td>89.01</td><td>81.31</td><td>52.22</td><td>13.45</td>
</tr>
<tr>
<td>SenseNova-SI-1.3-InternVL3-8B</td><td>83.85</td><td>73.92</td><td>68.60</td><td>59.23</td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.4-InternVL3-8B</strong></td>
<td><strong>89.21</strong></td>
<td><strong>78.64</strong></td>
<td><strong>95.56</strong></td>
<td><strong>80.31</strong></td>
</tr>
</tbody>
</table>
在立体几何问题基准上,结果如下。
SolidGeo MCQ 包括[SolidGeo](https://huggingface.co/datasets/SolidGeo/SolidGeo)中的单项选择题.
SolidMath 与 Math3D 基准数据集构建自K12题库,收录了中文立体几何选择题。其中SolidMath从同源数据中构建,Math3D从非同源数据中构建。
<table>
<thead>
<tr>
<th>Model</th>
<th>SolidGeo MCQ</th>
<th>SpatialViz-Bench</th>
<th>SolidMath</th>
<th>Math3D</th>
</tr>
</thead>
<tbody>
<tr>
<td>InternVL3-8B</td><td>36.4</td><td>32.0</td><td>42.5</td><td>43.7</td>
</tr>
<tr>
<td>SenseNova-SI-1.3-InternVL3-8B</td><td>36.5</td><td>29.6</td><td>39.6</td><td>40.3</td>
</tr>
<tr>
<td><strong>SenseNova-SI-1.5-InternVL3-8B</strong></td>
<td><strong>63.5</strong></td>
<td><strong>33.0</strong></td>
<td><strong>72.7</strong></td>
<td><strong>68.9</strong></td>
</tr>
</tbody>
</table>
### 数据集
为推进空间智能领域的研究,我们先发布一个高效的子集 [SenseNova-SI-800K](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K)
由于 SenseNova-SI 专为研究扩展规律而设计,我们观察到这个子集已经取得了显著的性能提升。
<table>
<thead>
<tr>
<th>Model</th>
<th>SI Dataset</th>
<th>VSI</th>
<th>MMSI</th>
<th>MindCube-Tiny</th>
<th>ViewSpatial</th>
<th>SITE</th>
</tr>
</thead>
<tbody>
<tr>
<td>InternVL3-8B</td><td>-</td><td>42.1</td><td>28.0</td><td>41.5</td><td>38.6</td><td>41.1</td>
</tr>
<tr>
<td>VST-7B-SFT</td><td>VST-P-4.1M</td><td>60.6</td><td>32.0</td><td>39.7</td><td>50.5</td><td>39.6</td>
</tr>
<tr>
<td>Cambrian-S-7B</td><td>VSI-590K</td><td>67.5</td><td>25.8</td><td>39.6</td><td>40.9</td><td>33.0</td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B-800K/">*SenseNova-SI-1.1-InternVL3-8B-800K</a></strong></td>
<td><strong><a href="https://huggingface.co/datasets/sensenova/SenseNova-SI-800K">SenseNova-SI-800K</a></strong></td>
<td><strong>60.9</strong></td>
<td><strong>36.4</strong></td>
<td><strong>56.9</strong></td>
<td><strong>52.5</strong></td>
<td><strong>47.7</strong></td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B/">SenseNova-SI-1.1-InternVL3-8B</a></strong></td>
<td><strong>SenseNova-SI-8M</strong></td>
<td><strong>68.7</strong></td>
<td><strong>43.3</strong></td>
<td><strong>85.6</strong></td>
<td><strong>54.6</strong></td>
<td><strong>47.7</strong></td>
</tr>
</tbody>
</table>
请注意,*SenseNova-SI-1.1-InternVL3-8B-800K 是基于 SenseNova-SI-800K 子集训练的,旨在为研究人员提供 800K 规模训练数据的性能参考。该模型仅用于规模定律分析和研究验证,不作为 SenseNova-SI 系列的主要推荐模型。
#### 数据格式
我们的数据存储在 **SenseNova-SI-800K.jsonl** 文件中,采用 JSONL(JSON Lines)格式,其中每一行表示一个独立的数据条目。每个条目是一个包含以下三个主要字段的字典:**`id`**, **`conversations`**, and **`image`**.
- `id`: 每条数据的唯一标识符。
- `image`: 一个字符串列表,指定图像路径,路径相对于数据根目录。
- `conversations`: 一个对话轮次列表,每轮对话是一个包含两个键值对的字典:
- `from`: 表示说话者身份(例如 human 或 gpt)。
- `value`: i表示文本内容。在`value`中,`<image>`占位符表示插入图像的位置,且`<image>`的数量与 image 字段中列出的图像数量相匹配。
```json
{
"id": 0,
"conversations": [
{"from": "human", "value": "<image>\nuser input <image>\nuser input"},
{"from": "gpt", "value": "assistant output"},
{"from": "human", "value": "<image>\nuser input"},
{"from": "gpt", "value": "assistant output"}
],
"image": ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"],
}
```
## 🛠️ 快速上手
### 推理环境安装
我们推荐使用 [uv](https://docs.astral.sh/uv/) 来管理环境。
> uv 安装指南: <https://docs.astral.sh/uv/getting-started/installation/#installing-uv>
```bash
git clone git@github.com:OpenSenseNova/SenseNova-SI.git
cd SenseNova-SI/
uv sync --extra cu124 # 或以下值之一: [cu118|cu121|cu124|cu126|cu128|cu129], 取决于您的 CUDA 版本
source .venv/bin/activate
```
#### Hello World
无需图像的简单测试,以验证环境是否正确配置,并下载模型。
```bash
python example.py \
--question "Hello" \
--model_path sensenova/SenseNova-SI-1.4-InternVL3-8B
```
#### 切换已支持的模型
我们已**完整支持多种模型架构**。如需使用不同模型,仅需修改 `--model_path` 参数,其余代码无需任何改动。
使用 **BAGEL-MoT** 模型:
```bash
--model_path sensenova/SenseNova-SI-1.1-BAGEL-7B-MoT
```
使用 **Qwen3-VL** 模型:
```bash
--model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
```
### 示例
更多示例请参见 [示例](docs/zh/example.md)
#### BAGEL 图像生成示例
若要运行针对 BAGEL-7B-MoT 架构的图像生成示例,请使用以下命令:
```bash
python example_bagel.py \
--model_path sensenova/SenseNova-SI-1.1-BAGEL-7B-MoT \
--prompt "A chubby cat made of 3D point clouds, stretching its body, translucent with a soft glow." \
--mode generate
```
如果想要开启thinking模型进行生成,可以使用`--mode think_generate`。相同的Prompt生成的效果对比:
<table>
<tr>
<th>mode=generate</th>
<th>mode=think_generate</th>
</tr>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/bagel-generate-example.jpg" alt="First image" width="100%">
</td>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/bagel-think_generate-example.jpg" alt="Second image" width="100%">
</td>
</tr>
</table>
#### 示例1
该例题源自[SITE-Bench](https://github.com/wenqi-wang20/SITE-Bench):
```bash
python example.py \
--image_paths examples/Q1_1.png \
--question "Question: Consider the real-world 3D locations of the objects. Which is closer to the sink, the toilet paper or the towel?\nOptions: \nA. toilet paper\nB. towel\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
# --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
```
<!-- Example 1 -->
<details open>
<summary><strong>示例1详情</strong></summary>
<p><strong>Q: </strong>Question: Consider the real-world 3D locations of the objects. Which is closer to the sink, the toilet paper or the towel?\nOptions: \nA. toilet paper\nB. towel\nGive me the answer letter directly. The best answer is:</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q1_1.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: A</strong></p>
</details>
#### 示例2
该例题源自[MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench):
```bash
python example.py \
--image_paths examples/Q2_1.png examples/Q2_2.png \
--question "If the landscape painting is on the east side of the bedroom, where is the window located in the bedroom?\nOptions: A. North side, B. South side, C. West side, D. East side\nAnswer with the option's letter from the given choices directly. Enclose the option's letter within ``." \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
# --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
```
<!-- Example 2 -->
<details open>
<summary><strong>示例2详情</strong></summary>
<p><strong>Q: </strong>If the landscape painting is on the east side of the bedroom, where is the window located in the bedroom?\nOptions: A. North side, B. South side, C. West side, D. East side\nAnswer with the option's letter from the given choices directly. Enclose the option's letter within ``.</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q2_1.png" alt="First image" width="100%">
</td>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q2_2.png" alt="Second image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: C</strong></p>
</details>
#### 示例3
该例题源自 [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench),测试模型在开放式简答题上的能力:
```bash
python example.py \
--image_paths examples/Q3_1.png examples/Q3_2.png examples/Q3_3.png \
--question "The robot is making tea. What is the order in which the pictures were taken?" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<!-- Example 3 -->
<details open>
<summary><strong>示例3详情</strong></summary>
<p><strong>Q: </strong>The robot is making tea. What is the order in which the pictures were taken?</p>
<table>
<tr>
<td align="center" width="33%" style="padding:4px;">
<img src="./examples/Q3_1.png" alt="First image" width="100%">
</td>
<td align="center" width="33%" style="padding:4px;">
<img src="./examples/Q3_2.png" alt="Second image" width="100%">
</td>
<td align="center" width="33%" style="padding:4px;">
<img src="./examples/Q3_3.png" alt="Third image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: Second, first, third</strong></p>
</details>
#### 示例4
该例题展示模型的 **grounding** 能力,数据来自 [RefCOCO](https://github.com/lichengunc/refer):
```bash
python example.py \
--image_paths examples/Q4.png \
--question "Please provide the bounding box coordinate of the region this sentence describes: <ref>blue shirt lady</ref>" \
--model_path sensenova/SenseNova-SI-1.4-InternVL3-8B
```
<!-- Example 4 -->
<details open>
<summary><strong>示例4详情</strong></summary>
<p><strong>Q: </strong>Please provide the bounding box coordinate of the region this sentence describes: &lt;ref&gt;blue shirt lady&lt;/ref&gt;</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q4.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: [0.096234, 0.161229, 0.436516, 1.000000]</strong></p>
</details>
#### 示例5
该例题展示模型的 **深度估计** 能力:
```bash
python example.py \
--image_paths examples/Q5.png \
--question "Identify the minimal distance between the point and the camera, in meters." \
--model_path sensenova/SenseNova-SI-1.4-InternVL3-8B
```
<!-- Example 5 -->
<details open>
<summary><strong>示例5详情</strong></summary>
<p><strong>Q: </strong>Identify the minimal distance between the point and the camera, in meters.</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q5.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: 4.4</strong></p>
</details>
#### 示例6
此示例展示模型的 **立体几何(三视图)** 能力:
```bash
python example.py \
--image_paths examples/Q6.png \
--question "Enclose your thinking process in <think> </think> tags and your final answer in <answer> </answer>" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 6 -->
<details open>
<summary><strong>示例6详情</strong></summary>
<p><strong>Q: </strong>Enclose your thinking process in &lt;think> &lt;/think> tags and your final answer in &lt;answer> &lt;/answer></p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q6.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: D</strong></p>
</details>
#### 示例7
此示例展示模型的 **立体几何(展开图)** 能力:
```bash
python example.py \
--image_paths examples/Q7.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 7 -->
<details open>
<summary><strong>示例7详情</strong></summary>
<p><strong>问题:</strong>请将你的思考过程放在&lt;think> &lt;/think>标签内,并将你的最终答案放在&lt;answer> &lt;/answer>标签内。</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="./examples/Q7.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: D</strong></p>
</details>
#### 一次测试多个问题
构建类似于[examples/examples.jsonl](examples/examples.jsonl)的文件,每一行代表一个问题。
模型只加载一次,按逐行的顺序逐个回答问题,问题之间互不干扰。
> `jsonl`更详细的格式可以参考[单图数据](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#single-image-data)和[多图数据](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#multi-image-data)
```bash
python example.py \
--jsonl_path examples/examples.jsonl \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
### 训练
#### 1. 下载数据集
用户可选择下载 [SenseNova-SI-800K](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K) (一个下采样子集,专门用于研究尺度效应)或 [SenseNova-SI-8M](https://huggingface.co/datasets/sensenova/SenseNova-SI-8M) (官方全量训练数据集).
将 [SenseNova-SI-800K](https://huggingface.co/datasets/sensenova/SenseNova-SI-800K) 下载到 `training/data/` 目录:
```bash
pip install huggingface_hub
huggingface-cli download sensenova/SenseNova-SI-800K --repo-type dataset --local-dir training/data/SenseNova-SI-800K
```
将 [SenseNova-SI-8M](https://huggingface.co/datasets/sensenova/SenseNova-SI-8M) 下载到 `training/data/` 目录:
```bash
pip install huggingface_hub
huggingface-cli download sensenova/SenseNova-SI-8M --repo-type dataset --local-dir training/data/SenseNova-SI-8M
```
#### 2(a). 训练InternVL架构模型
**载预训练模型**
将 [InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) 下载到 training/pretrained_models/:
```bash
huggingface-cli download OpenGVLab/InternVL3-8B --local-dir training/pretrained_models/OpenGVLab/InternVL3-8B
```
**安装依赖**
```bash
conda create -n internvl python=3.10 -y
conda activate internvl
pip install uv
uv pip install -r training/InternVL/requirements.txt
uv pip install flash-attn==2.3.6
```
**开始训练**
```bash
bash training/InternVL/internvl_chat/shell/sensenova_si_800K_internvl3_8b.sh #用SenseNova-SI-800K数据训练
bash training/intern_vl/internvl_chat/shell/sensenova_si_8M_internvl3_8b.sh #或者用SenseNova-SI-8M数据训练
```
#### 2(b). 训练Qwen3-VL架构模型
训练框架为 [lmms-engine](https://github.com/EvolvingLMMs-Lab/lmms-engine),作为一个 Git 子模块包含在 `training/pretrained_models/` 目录下。
**下载预训练模型**
将 [Qwen3VL-8B](https://github.com/QwenLM/Qwen3-VL) 下载到 `training/pretrained_models/`:
```bash
huggingface-cli download Qwen/Qwen3-VL-8B-Instruct --local-dir training/pretrained_models/Qwen/Qwen3-VL-8B-Instruct
```
**安装依赖**
```bash
# Initialize the lmms-engine submodule (first time only)
git submodule update --init --recursive
conda create -n qwen3vl python=3.10 -y
uv pip install -e training/lmms-engine
# Optional: Performance optimizations
uv pip install flash-attn --no-build-isolation
uv pip install liger-kernel
```
**数据预处理**
先将 `SenseNova-SI-800K.jsonl` 和 `SenseNova-SI-8M.jsonl` 转换为 Qwen3-VL 训练数据格式:
```bash
python training/qwen3_vl/preprocess_sensenova_si_dataset.py \
--src data/SenseNova-SI-800K.jsonl \
--dst data/SenseNova-SI-800K_qwen3vl_format.jsonl #预处理 SenseNova-SI-800K数据
python training/qwen3_vl/preprocess_sensenova_si_dataset.py \
--src data/SenseNova-SI-8M.jsonl \
--dst data/SenseNova-SI-8M_qwen3vl_format.jsonl #预处理 SenseNova-SI-8M数据
```
**准备数据 YAML**
参考 [training/qwen3_vl/data_800K.yaml](training/qwen3_vl/data_800K.yaml) 和 [training/qwen3_vl/data_8M.yaml](training/qwen3_vl/data_8M.yaml)
**配置训练参数**
参考 [training/qwen3_vl/train_config_800K.yaml](training/qwen3_vl/train_config_800K.yaml) 和 [training/qwen3_vl/train_config_8M.yaml](training/qwen3_vl/train_config_8M.yaml)
**开始训练**
```bash
# Single node, 8 GPUs (default)
bash training/qwen3_vl/run.sh 800K #用SenseNova-SI-800K数据训练
bash training/qwen3_vl/run.sh 8M #或者用SenseNova-SI-8M数据训练
```
#### 2(c). 训练BAGEL架构模型
**下载预训练模型**
将 [BAGEL-7B-MoT](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) 下载到 training/pretrained_models/:
```bash
huggingface-cli download ByteDance-Seed/BAGEL-7B-MoT --local-dir training/pretrained_models/BAGEL-7B-MoT
```
**安装依赖**
```bash
conda create -n bagel python=3.10 -y
conda activate bagel
pip install uv
uv pip install -r training/Bagel/requirements.txt
uv pip install flash_attn==2.5.8 --no-build-isolation
```
**开始训练**
```bash
bash training/Bagel/scripts/train_sensenova_si_800K.sh #用SenseNova-SI-800K数据训练
bash training/bagel/scripts/train_sensenova_si_8M.sh #或者用SenseNova-SI-8M数据训练
```
有关训练超参数(如学习率、batch size、FSDP 配置等)的详细信息,请参考 [training/Bagel/TRAIN.md](training/Bagel/TRAIN.md)。
### 评测
如需复现上述基准测试结果,请参考 [EASI](https://github.com/EvolvingLMMs-Lab/EASI) 在主流空间智能基准上评估 SenseNova-SI 的表现。
EASI 支持超过 20 种空间智能模型和 20 多种空间基准,并提供 Docker 实现一键式空间智能评估。
### 致谢
本项目包含基于 BAGEL、InternVL、lmms-engine 团队原始代码修改的代码。
* 源代码仓库:[BAGEL](https://github.com/bytedance-seed/BAGEL)、[InternVL](https://github.com/opengvlab/internvl)、[lmms-engine](https://github.com/EvolvingLMMs-Lab/lmms-engine)
我们衷心感谢原作者及贡献者的工作。
请参阅原始仓库以获取完整细节、更新及许可信息。
## 🖊️ 引用
```bib
@InProceedings{sensenova-si,
title = {Scaling Spatial Intelligence with Multimodal Foundation Models},
author = {Cai, Zhongang and Wang, Ruisi and Gu, Chenyang and Pu, Fanyi and Xu, Junxiang and Wang, Yubo and Yin, Wanqi and Yang, Zhitao and Wei, Chen and Sun, Qingping and Zhou, Tongxi and Li, Jiaqi and Pang, Hui En and Qian, Oscar and Wei, Yukun and Lin, Zhiqian and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Pan, Liang and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
```
default_generation_config:
do_sample: False
max_new_tokens: 8192
top_p: 1.0
temperature: 0.0
repetition_penalty: 1
num_beams: 1
\ No newline at end of file
# More Examples
This document lists more examples beyond those in the main [README](../../README.md). To run all of them in one go, use [examples/examples.jsonl](../../examples/examples.jsonl) with the `--jsonl_path` option (see the README section [Test Multiple Questions in a Single Run](../../README.md#test-multiple-questions-in-a-single-run)).
---
#### Example 8
This example is from [MindCube](https://github.com/mll-lab-nu/MindCube):
```bash
python example.py \
--image_paths examples/Q8_1.jpg examples/Q8_2.jpg examples/Q8_3.jpg examples/Q8_4.jpg \
--question "Based on these four images (image 1, 2, 3, and 4) showing the pink bottle from different viewpoints (front, left, back, and right), with each camera aligned with room walls and partially capturing the surroundings: From the viewpoint presented in image 4, what is to the left of the pink bottle?\nOptions: A. Pink plush toy and headboard B. Window and blue curtain C. Closet and door D. White wall\nAnswer with the option's letter from the given choices directly." \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>Details of Example 8</strong></summary>
<p><strong>Q: </strong>Based on these four images (image 1, 2, 3, and 4) showing the pink bottle from different viewpoints (front, left, back, and right), with each camera aligned with room walls and partially capturing the surroundings: From the viewpoint presented in image 4, what is to the left of the pink bottle?\nOptions: A. Pink plush toy and headboard B. Window and blue curtain C. Closet and door D. White wall\nAnswer with the option's letter from the given choices directly.</p>
<table>
<tr>
<td align="center" width="25%" style="padding:4px;">
<img src="../../examples/Q8_1.jpg" alt="Image 1" width="100%">
</td>
<td align="center" width="25%" style="padding:4px;">
<img src="../../examples/Q8_2.jpg" alt="Image 2" width="100%">
</td>
<td align="center" width="25%" style="padding:4px;">
<img src="../../examples/Q8_3.jpg" alt="Image 3" width="100%">
</td>
<td align="center" width="25%" style="padding:4px;">
<img src="../../examples/Q8_4.jpg" alt="Image 4" width="100%">
</td>
</tr>
</table>
<p><strong>GT: C</strong></p>
</details>
---
#### Example 9
This example is from [SITE-Bench](https://github.com/wenqi-wang20/SITE-Bench):
```bash
python example.py \
--image_paths examples/Q9.jpg \
--question "Question: Consider the real-world 3D locations and orientations of the objects. Which side of the bus in the center is facing the bus stop?\nOptions: \nA. front\nB. left\nC. back\nD. right\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>Details of Example 9</strong></summary>
<p><strong>Q: </strong>Question: Consider the real-world 3D locations and orientations of the objects. Which side of the bus in the center is facing the bus stop?\nOptions: \nA. front\nB. left\nC. back\nD. right\nGive me the answer letter directly. The best answer is:</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q9.jpg" alt="Image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: D</strong></p>
</details>
---
#### Example 10
This example is from [SITE-Bench](https://github.com/wenqi-wang20/SITE-Bench):
```bash
python example.py \
--image_paths examples/Q10.jpg \
--question "Question: Consider the real-world 3D orientations of the objects. Are the arrow on street sign and the taxi facing same or similar directions, or very different directions?\nOptions: \nA. same or similar directions\nB. very different directions\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>Details of Example 10</strong></summary>
<p><strong>Q: </strong>Question: Consider the real-world 3D orientations of the objects. Are the arrow on street sign and the taxi facing same or similar directions, or very different directions? Options: A. same or similar directions, B. very different directions. Give me the answer letter directly. The best answer is:</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q10.jpg" alt="Image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: A</strong></p>
</details>
---
#### Example 11
This example is from [SITE-Bench](https://github.com/wenqi-wang20/SITE-Bench):
```bash
python example.py \
--image_paths examples/Q11.jpg \
--question "Question: What shape are all the men standing in?\nOptions: A. circle B. rectangle C. triangle D. square\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>Details of Example 11</strong></summary>
<p><strong>Q: </strong>Question: What shape are all the men standing in?\nOptions: A. circle B. rectangle C. triangle D. square\nGive me the answer letter directly. The best answer is:</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q11.jpg" alt="Image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: A</strong></p>
</details>
---
#### Example 12
This example is from [ViewSpatial-Bench](https://github.com/ZJU-REAL/ViewSpatial-Bench):
```bash
python example.py \
--image_paths examples/Q12.jpg \
--question "From the perspective of this man who doesn't wear glasses, where is the man wearing glasses located beside him?\nOptions: A. left B. back-right C. front D. right\nAnswer with the option's letter from the given choices directly." \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>Details of Example 12</strong></summary>
<p><strong>Q: </strong>From the perspective of this man who doesn't wear glasses, where is the man wearing glasses located beside him? Options: A. left, B. back-right, C. front, D. right. Answer with the option's letter from the given choices directly.</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q12.jpg" alt="Image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: A</strong></p>
</details>
---
#### Example 13
This example is from [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench) and test the model's capability in open-ended short-answer questions:
```bash
python example.py \
--image_paths examples/Q13_1.png examples/Q13_2.png \
--question "The iMac is in the northern part of the room. In which direction is the area where students do their homework?" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>Details of Example 13</strong></summary>
<p><strong>Q: </strong>The iMac is in the northern part of the room. In which direction is the area where students do their homework?</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q13_1.png" alt="First image" width="100%">
</td>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q13_2.png" alt="Second image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: Northwest corner</strong></p>
</details>
---
#### Example 14
This example is from [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench) and test the model's capability in open-ended short-answer questions:
```bash
python example.py \
--image_paths examples/Q14_1.png examples/Q14_2.png \
--question "How many building models are captured in total in these two pictures?" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>Details of Example 14</strong></summary>
<p><strong>Q: </strong>How many building models are captured in total in these two pictures?</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q14_1.png" alt="First image" width="100%">
</td>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q14_2.png" alt="Second image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: 4</strong></p>
</details>
---
#### Example 15
This example demonstrates the model's capability in **solid geometry(Three views)**:
```bash
python example.py \
--image_paths examples/Q15.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 5 -->
<details open>
<summary><strong>Details of Example 15</strong></summary>
<p><strong>Q:</strong> Enclose your thinking process in &lt;think> &lt;/think> tags and your final answer in &lt;answer> &lt;/answer></p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q15.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: B</strong></p>
</details>
---
#### Example 16
This example demonstrates the model's capability in **solid geometry(Three views)**:
```bash
python example.py \
--image_paths examples/Q16.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 6 -->
<details open>
<summary><strong>Details of Example 16</strong></summary>
<p><strong>Q:</strong> Enclose your thinking process in &lt;think> &lt;/think> tags and your final answer in &lt;answer> &lt;/answer></p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q16.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: C</strong></p>
</details>
---
#### Example 17
This example demonstrates the model's capability in **solid geometry(3D graphic reasoning)**:
```bash
python example.py \
--image_paths examples/Q17.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 7 -->
<details open>
<summary><strong>Details of Example 17</strong></summary>
<p><strong>Q:</strong> Enclose your thinking process in &lt;think> &lt;/think> tags and your final answer in &lt;answer> &lt;/answer></p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q17.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: C</strong></p>
</details>
---
#### Example 18
This example demonstrates the model's capability in **solid geometry(Three views)**:
```bash
python example.py \
--image_paths examples/Q18.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 3 -->
<details open>
<summary><strong>Details of Example 18</strong></summary>
<p><strong>Q:</strong> Enclose your thinking process in &lt;think> &lt;/think> tags and your final answer in &lt;answer> &lt;/answer></p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q18.png" alt="First image" width="100%">
</td>
</tr>
</table>
<p><strong>GT: A</strong></p>
</details>
\ No newline at end of file
# 更多示例
本文档展示了 [README](../../README_CN.md) 之外的更多示例。若需一次性运行全部示例,可使用 [examples/examples.jsonl](../../examples/examples.jsonl) 并配合 `--jsonl_path` 参数(参见 README 中[「一次测试多个问题」](../../README_CN.md#一次测试多个问题)小节)。
---
#### 示例8
该例题源自 [MindCube](https://github.com/mll-lab-nu/MindCube)
```bash
python example.py \
--image_paths examples/Q8_1.jpg examples/Q8_2.jpg examples/Q8_3.jpg examples/Q8_4.jpg \
--question "Based on these four images (image 1, 2, 3, and 4) showing the pink bottle from different viewpoints (front, left, back, and right), with each camera aligned with room walls and partially capturing the surroundings: From the viewpoint presented in image 4, what is to the left of the pink bottle?\nOptions: A. Pink plush toy and headboard B. Window and blue curtain C. Closet and door D. White wall\nAnswer with the option's letter from the given choices directly." \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>示例8详情</strong></summary>
<p><strong>Q: </strong>Based on these four images (image 1, 2, 3, and 4) showing the pink bottle from different viewpoints (front, left, back, and right), with each camera aligned with room walls and partially capturing the surroundings: From the viewpoint presented in image 4, what is to the left of the pink bottle?\nOptions: A. Pink plush toy and headboard B. Window and blue curtain C. Closet and door D. White wall\nAnswer with the option's letter from the given choices directly.</p>
<table>
<tr>
<td align="center" width="25%" style="padding:4px;">
<img src="../../examples/Q8_1.jpg" alt="Image 1" width="100%">
</td>
<td align="center" width="25%" style="padding:4px;">
<img src="../../examples/Q8_2.jpg" alt="Image 2" width="100%">
</td>
<td align="center" width="25%" style="padding:4px;">
<img src="../../examples/Q8_3.jpg" alt="Image 3" width="100%">
</td>
<td align="center" width="25%" style="padding:4px;">
<img src="../../examples/Q8_4.jpg" alt="Image 4" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: C</strong></p>
</details>
---
#### 示例9
该例题源自 [SITE-Bench](https://github.com/wenqi-wang20/SITE-Bench)
```bash
python example.py \
--image_paths examples/Q9.jpg \
--question "Question: Consider the real-world 3D locations and orientations of the objects. Which side of the bus in the center is facing the bus stop?\nOptions: \nA. front\nB. left\nC. back\nD. right\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>示例9详情</strong></summary>
<p><strong>Q: </strong>Question: Consider the real-world 3D locations and orientations of the objects. Which side of the bus in the center is facing the bus stop?\nOptions: \nA. front\nB. left\nC. back\nD. right\nGive me the answer letter directly. The best answer is:</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q9.jpg" alt="Image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: D</strong></p>
</details>
---
#### 示例10
该例题源自 [SITE-Bench](https://github.com/wenqi-wang20/SITE-Bench):
```bash
python example.py \
--image_paths examples/Q10.jpg \
--question "Question: Consider the real-world 3D orientations of the objects. Are the arrow on street sign and the taxi facing same or similar directions, or very different directions?\nOptions: \nA. same or similar directions\nB. very different directions\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>示例10详情</strong></summary>
<p><strong>Q: </strong>Question: Consider the real-world 3D orientations of the objects. Are the arrow on street sign and the taxi facing same or similar directions, or very different directions? Options: A. same or similar directions, B. very different directions. Give me the answer letter directly. The best answer is:</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q10.jpg" alt="Image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: A</strong></p>
</details>
---
#### 示例11
该例题源自 [SITE-Bench](https://github.com/wenqi-wang20/SITE-Bench):
```bash
python example.py \
--image_paths examples/Q11.jpg \
--question "Question: What shape are all the men standing in?\nOptions: A. circle B. rectangle C. triangle D. square\nGive me the answer letter directly. The best answer is:" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>示例11详情</strong></summary>
<p><strong>Q: </strong>Question: What shape are all the men standing in?\nOptions: A. circle B. rectangle C. triangle D. square\nGive me the answer letter directly. The best answer is:</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q11.jpg" alt="Image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: A</strong></p>
</details>
---
#### 示例12
该例题源自 [ViewSpatial-Bench](https://github.com/ZJU-REAL/ViewSpatial-Bench)
```bash
python example.py \
--image_paths examples/Q12.jpg \
--question "From the perspective of this man who doesn't wear glasses, where is the man wearing glasses located beside him?\nOptions: A. left B. back-right C. front D. right\nAnswer with the option's letter from the given choices directly." \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>示例12详情</strong></summary>
<p><strong>Q: </strong>From the perspective of this man who doesn't wear glasses, where is the man wearing glasses located beside him? Options: A. left, B. back-right, C. front, D. right. Answer with the option's letter from the given choices directly.</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q12.jpg" alt="Image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: A</strong></p>
</details>
---
#### 示例13
该例题源自 [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench),测试模型在开放式简答题上的能力:
```bash
python example.py \
--image_paths examples/Q13_1.png examples/Q13_2.png \
--question "The iMac is in the northern part of the room. In which direction is the area where students do their homework?" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>示例13详情</strong></summary>
<p><strong>Q: </strong>The iMac is in the northern part of the room. In which direction is the area where students do their homework?</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q13_1.png" alt="First image" width="100%">
</td>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q13_2.png" alt="Second image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: Northwest corner</strong></p>
</details>
---
#### 示例14
该例题源自 [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench),测试模型在开放式简答题上的能力:
```bash
python example.py \
--image_paths examples/Q14_1.png examples/Q14_2.png \
--question "How many building models are captured in total in these two pictures?" \
--model_path sensenova/SenseNova-SI-1.3-InternVL3-8B
```
<details open>
<summary><strong>示例14详情</strong></summary>
<p><strong>Q: </strong>How many building models are captured in total in these two pictures?</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q14_1.png" alt="First image" width="100%">
</td>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q14_2.png" alt="Second image" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案: 4</strong></p>
</details>
---
#### 示例 15
此示例展示模型的 **立体几何(三视图)** 能力:
```bash
python example.py \
--image_paths examples/Q15.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 15 -->
<details open>
<summary><strong>示例 15 详情</strong></summary>
<p><strong>问题:</strong>请将你的思考过程放在&lt;think> &lt;/think>标签内,并将你的最终答案放在&lt;answer> &lt;/answer>标签内。</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q15.png" alt="第一张图片" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案:B</strong></p>
</details>
---
#### 示例 16
此示例展示模型的 **立体几何(三视图)** 能力:
```bash
python example.py \
--image_paths examples/Q16.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 6 -->
<details open>
<summary><strong>示例 16 详情</strong></summary>
<p><strong>问题:</strong>请将你的思考过程放在&lt;think> &lt;/think>标签内,并将你的最终答案放在&lt;answer> &lt;/answer>标签内。</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q16.png" alt="第一张图片" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案:C</strong></p>
</details>
---
#### 示例 17
此示例展示模型的 **立体几何(3D图形推理)** 能力:
```bash
python example.py \
--image_paths examples/Q17.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 7 -->
<details open>
<summary><strong>示例 17 详情</strong></summary>
<p><strong>问题:</strong>请将你的思考过程放在&lt;think> &lt;/think>标签内,并将你的最终答案放在&lt;answer> &lt;/answer>标签内。</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q17.png" alt="第一张图片" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案:C</strong></p>
</details>
---
#### 示例 18
此示例展示模型的 **立体几何(三视图)** 能力:
```bash
python example.py \
--image_paths examples/Q18.png \
--question "请将你的思考过程放在<think></think>标签内,并将你的最终答案放在<answer></answer>标签内。" \
--model_path sensenova/SenseNova-SI-1.5-InternVL3-8B
```
<!-- Example 8 -->
<details open>
<summary><strong>示例 18 详情</strong></summary>
<p><strong>问题:</strong>请将你的思考过程放在&lt;think> &lt;/think>标签内,并将你的最终答案放在&lt;answer> &lt;/answer>标签内。</p>
<table>
<tr>
<td align="center" width="50%" style="padding:4px;">
<img src="../../examples/Q18.png" alt="第一张图片" width="100%">
</td>
</tr>
</table>
<p><strong>正确答案:A</strong></p>
</details>
\ No newline at end of file
import argparse
import json
import torch
from sensenova_si import get_model
def set_seed(seed=42):
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
if __name__ == "__main__":
set_seed()
parser = argparse.ArgumentParser(
description="Examples for SenseNova-SI single-run MCQ"
)
parser.add_argument(
"--model_path",
type=str,
default="sensenova/SenseNova-SI-1.3-InternVL3-8B",
help="Model path",
)
parser.add_argument(
"--image_paths",
type=str,
nargs="+",
default=[],
help="Path to image files, can specify multiple",
)
parser.add_argument(
"--question",
type=str,
default="Please describe the image in detail.",
help="Question to ask the model",
)
parser.add_argument(
"--jsonl_path",
type=str,
default=None,
help="Path to jsonl file containing examples",
)
parser.add_argument(
"--model_type",
type=str,
default="auto",
choices=["qwen", "internvl", "auto"],
help="Model type",
)
args = parser.parse_args()
model_path = args.model_path
print(f"Model path: {model_path}")
model = get_model(model_path, model_type=args.model_type)
if args.jsonl_path:
with open(args.jsonl_path, "r") as f:
for line in f:
entry = json.loads(line.strip())
image_paths = entry.get("image", [])
conversations = entry.get("conversations", [])
if conversations:
question = conversations[0].get("value", "")
else:
question = ""
id_ = entry.get("id", "")
gt = entry.get("GT", "")
if not image_paths or not question:
print(f"Skipping invalid entry id {id_}")
continue
print(f"Processing question id: {id_}")
response = model.generate(question, images=image_paths)
print(f"User: {question}")
print(f"Assistant: {response}")
print(f"Ground Truth: {gt}")
print("-" * 50)
else:
question = args.question
response = model.generate(question, images=args.image_paths)
print(f"User: {question}")
print(f"Assistant: {response}")
import argparse
import torch
from sensenova_si import SenseNovaSIBagelModel
def set_seed(seed=42):
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
def main():
parser = argparse.ArgumentParser(
description="BAGEL image generation example - generate image from text prompt"
)
parser.add_argument(
"--model_path",
type=str,
default="sensenova/SenseNova-SI-1.1-BAGEL-7B-MoT",
help="BAGEL model path",
)
parser.add_argument(
"--prompt",
type=str,
default="A chubby cat made of 3D point clouds, stretching its body, translucent with a soft glow.",
help="Text prompt used to generate an image",
)
parser.add_argument(
"--mode",
type=str,
default="generate",
choices=["generate", "think_generate"],
help="BAGEL mode: generate or think_generate",
)
parser.add_argument(
"--out_img_dir",
type=str,
default="./output_images/test_bagel/",
help="Directory to save generated images",
)
parser.add_argument(
"--dtype",
type=str,
default="bf16",
choices=["bf16"],
help="Model precision type",
)
args = parser.parse_args()
# Set random seed for reproducibility
set_seed()
print(f"Model path: {args.model_path}")
print(f"Mode: {args.mode}")
print(f"Prompt: {args.prompt}")
print("-" * 50)
# Initialize BAGEL model with generate mode
print("Loading model...")
model = SenseNovaSIBagelModel(
model_path=args.model_path,
mode=args.mode,
out_img_dir=args.out_img_dir,
dtype=args.dtype,
)
print("Generating image...")
# Call generate with the prompt; images not needed for generate mode
generated_image_path = model.generate(question=args.prompt, images=None)
print("-" * 50)
print("Done!")
print(f"Image saved to: {generated_image_path}")
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment