中文 | English

MiniCPM Paper | Technical Blog | MiniCPM Wiki (in Chinese) | MiniCPM-V Repo | Join our discord and WeChat | Join Us

## Changelog🔥 - [2025.06.06] Released [**MiniCPM4**](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b)! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! - [2024.09.28] **[LLMxMapReduce](https://github.com/thunlp/LLMxMapReduce) is open source and enables MiniCPM3-4B to process text of any length.** - [2024.09.18] **[SGLang](https://github.com/sgl-project/sglang) now supports MiniCPM3-4B. Thanks to inference optimizations made to the MLA structure (used in MiniCPM3) in SGLang v0.3, throughput has improved by 70% compared to vLLM!** [[Usage](#sglang-recommended)] - [2024.09.16] [llama.cpp](https://github.com/ggerganov/llama.cpp/releases/tag/b3765) now officially supports MiniCPM3-4B! [[GGUF Model](https://huggingface.co/openbmb/MiniCPM3-4B-GGUF) | [Usage](#llamacpp)] - [2024.09.05] We release [**MiniCPM3-4B**](https://huggingface.co/openbmb/MiniCPM3-4B)! This model outperforms Phi-3.5-mini-instruct and GPT-3.5-Turbo-0125 and is comparable to several models with 7B-9B parameters like Llama3.1-8B-Instruct, Qwen2-7B-Instruct, and GLM-4-9B-Chat. - [2024.07.09] MiniCPM-2B has been supported by [SGLang](#sglang-inference)! - [2024.07.05] Released [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)! This model achieves an average sparsity of 87.89% in the FFN layer, reducing FFN FLOPs by 84%, while maintaining downstream task performance. - [2024.04.11] Released [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k), [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)! Click [here](https://openbmb.vercel.app/) to read our technical blog. - [2024.03.16] Intermediate checkpoints of MiniCPM-2B were released [here](https://huggingface.co/openbmb/MiniCPM-2B-history)! - [2024.02.01] Released [**MiniCPM-2B**](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)! This model performs similarly to Mistral-7B on public benchmarks (with better performance in Chinese, math, and code abilities) and overall outperforms models like Llama2-13B, MPT-30B, and Falcon-40B. ## Quick Links - [Changelog🔥](#changelog) - [Quick Links](#quick-links) - [Model Downloads](#model-downloads) - [MiniCPM 4.0](#minicpm-40) - [Evaluation Results](#evaluation-results) - [Efficiency Evaluation](#efficiency-evaluation) - [Comprehensive Evaluation](#comprehensive-evaluation) - [Long Text Evaluation](#long-text-evaluation) - [BitCPM4: Quantization](#bitcpm4-quantization) - [BitCPM4 Evaluation](#bitcpm4-evaluation) - [BitCPM4 Inference](#bitcpm4-inference) - [MiniCPM4 Application](#minicpm4-application) - [MiniCPM4-Survey: Trustworthy Survey Generation](#minicpm4-survey-trustworthy-survey-generation) - [MiniCPM4-MCP: Tool Use with Model Context Pr](#minicpm4-mcp-tool-use-with-model-context-pr) - [MiniCPM Intel AIPC Client: A New Edge Large Model Powerhouse](#minicpm-intel-aipc-client-a-new-edge-large-model-powerhouse) - [Inference](#inference) - [CPM.cu](#cpmcu) - [HuggingFace](#huggingface) - [vLLM](#vllm) - [SGLang](#sglang) - [MiniCPM 3.0](#minicpm-30) - [MiniCPM 2.0](#minicpm-20) - [MiniCPM 1.0](#minicpm-10) - [LICENSE](#license) - [Institutions](#institutions) - [Citation](#citation) ## Model Downloads | HuggingFace | ModelScope | |-------------|------------| | [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B) | [MiniCPM4-8B](https://www.modelscope.cn/models/OpenBMB/MiniCPM4-8B) | | [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B) | [MiniCPM4-0.5B](https://www.modelscope.cn/models/OpenBMB/MiniCPM4-0.5B) | | [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B) | [BitCPM4-1B](https://www.modelscope.cn/models/OpenBMB/BitCPM4-1B) | | [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B) | [BitCPM4-0.5B](https://www.modelscope.cn/models/OpenBMB/BitCPM4-0.5B) | | [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec) | [MiniCPM4-8B-Eagle-FRSpec](https://www.modelscope.cn/models/OpenBMB/MiniCPM4-8B-Eagle-FRSpec) | | [MiniCPM4-8B-Eagle-FRSpec-QAT](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT) | [MiniCPM4-8B-Eagle-FRSpec-QAT](https://www.modelscope.cn/models/OpenBMB/MiniCPM4-8B-Eagle-FRSpec-QAT) | | [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM) | [MiniCPM4-8B-Eagle-vLLM](https://www.modelscope.cn/models/OpenBMB/MiniCPM4-8B-Eagle-vLLM) | | [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM) | [MiniCPM4-8B-marlin-Eagle-vLLM](https://www.modelscope.cn/models/OpenBMB/MiniCPM4-8B-marlin-Eagle-vLLM) | | [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey) | [MiniCPM4-Survey](https://www.modelscope.cn/models/OpenBMB/MiniCPM4-Survey) | | [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP) | [MiniCPM4-MCP](https://www.modelscope.cn/models/OpenBMB/MiniCPM4-MCP) | |[MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B)|[MiniCPM3-4B](https://www.modelscope.cn/models/OpenBMB/MiniCPM3-4B)| |[MiniCPM-2B-sft](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[MiniCPM-2B-sft](https://modelscope.cn/models/OpenBMB/miniCPM-bf16)| |[MiniCPM-2B-dpo](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)|[MiniCPM-2B-dpo](https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-bf16/summary)| |[MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k) |[MiniCPM-2B-128k](https://modelscope.cn/models/openbmb/MiniCPM-2B-128k/summary)| |[MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) |[MiniCPM-MoE-8x2B](https://modelscope.cn/models/OpenBMB/MiniCPM-MoE-8x2B)| |[MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) | [MiniCPM-1B](https://modelscope.cn/models/OpenBMB/MiniCPM-1B-sft-bf16) | |[MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft)|[MiniCPM-S-1B](https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft)| Note: More model versions can be found [here](https://huggingface.co/collections/openbmb/minicpm-2b-65d48bf958302b9fd25b698f). ## MiniCPM 4.0 MiniCPM 4 is an extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements. - 🏗️ **Efficient Model Architecture:** - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts - 🧠 **Efficient Learning Algorithms:** - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy - 📚 **High-Quality Training Data:** - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data - ⚡ **Efficient Inference and Deployment System:** - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities ### Evaluation Results #### Efficiency Evaluation On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement. ![benchmark](./assets/minicpm4/efficiency.png) #### Comprehensive Evaluation MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories. ![benchmark](./assets/minicpm4/benchmark.png) #### Long Text Evaluation MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance. ![long-niah](./assets/minicpm4/128k-niah.png) ### BitCPM4: Quantization BitCPM4 are ternary quantized models derived from the MiniCPM series models through quantization-aware training (QAT), achieving significant improvements in both training efficiency and model parameter efficiency. - Improvements of the training method - Searching hyperparameters with a wind-tunnel on a small model. - Using a two-stage training method: training in high-precision first and then QAT, making the best of the trained high-precision models and significantly reducing the computational resources required for the QAT phase. - High parameter efficiency - Achieving comparable performance to full-precision models of similar parameter models with a bit width of only 1.58 bits, demonstrating high parameter efficiency. #### BitCPM4 Evaluation BitCPM4's performance is comparable with other full-precision models in same model size. ![bitcpm-benchmark](./assets/minicpm4/bitcpm4-benchmark.png) #### BitCPM4 Inference BitCPM4's parameters are stored in a fake-quantized format, which supports direct inference within the Huggingface framework. ### MiniCPM4 Application #### MiniCPM4-Survey: Trustworthy Survey Generation **MiniCPM4-Survey** is an open-source LLM agent model jointly developed by [THUNLP](https://nlp.csai.tsinghua.edu.cn), Renmin University of China and [ModelBest](https://modelbest.cn/en). Built on MiniCPM4-8B, it accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers. Key features include: - **Plan-Retrieve-Write Survey Generation Framework** — We propose a multi-agent generation framework, which operates through three core stages: planning (defining the overall structure of the survey), retrieval (generating appropriate retrieval keywords), and writing (synthesizing the retrieved information to generate coherent section-level content). - **High-Quality Dataset Construction** — We gather and process lots of expert-written survey papers to construct a high-quality training dataset. Meanwhile, we collect a large number of research papers to build a retrieval database. - **Multi-Aspect Reward Design** — We carefully design a reward system with three aspects (structure, content, and citations) to evaluate the quality of the surveys, which is used as the reward function in the RL training stage. - **Multi-Step RL Training Strategy** — We propose a *Context Manager* to ensure retention of essential information while facilitating efficient reasoning, and we construct *Parallel Environment* to maintain efficient RL training cycles. ##### Demo and Quick Start See [here](./demo/minicpm4/SurveyGeneration/README.md) ##### Performance Evaluation | Method | Relevance | Coverage | Depth | Novelty | Avg. | Fact Score | |---------------------------------------------|-----------|----------|-------|---------|-------|------------| | Naive RAG (driven by G2FT) | 3.25 | 2.95 | 3.35 | 2.60 | 3.04 | 43.68 | | AutoSurvey (driven by G2FT) | 3.10 | 3.25 | 3.15 | **3.15**| 3.16 | 46.56 | | Webthinker (driven by WTR1-7B) | 3.30 | 3.00 | 2.75 | 2.50 | 2.89 | -- | | Webthinker (driven by QwQ-32B) | 3.40 | 3.30 | 3.30 | 2.50 | 3.13 | -- | | OpenAI Deep Research (driven by GPT-4o) | 3.50 |**3.95** | 3.55 | 3.00 | **3.50** | -- | | MiniCPM-4-Survey | 3.45 | 3.70 | **3.85** | 3.00 | **3.50** | **68.73** | |    *w/o* RL | **3.55** | 3.35 | 3.30 | 2.25 | 3.11 | 50.24 | *Performance comparison of the survey generation systems. "G2FT" stands for Gemini-2.0-Flash-Thinking, and "WTR1-7B" denotes Webthinker-R1-7B. FactScore evaluation was omitted for Webthinker, as it does not include citation functionality, and for OpenAI Deep Research, which does not provide citations when exporting the results.* #### MiniCPM4-MCP: Tool Use with Model Context Protocol **MiniCPM4-MCP** is an open-source on-device LLM agent model jointly developed by [THUNLP](https://nlp.csai.tsinghua.edu.cn), Renmin University of China and [ModelBest](https://modelbest.cn/en), built on [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-8B) with 8 billion parameters. It is capable of solving a wide range of real-world tasks by interacting with various tool and data resources through MCP. As of now, MiniCPM4-MCP supports the following: - Utilization of tools across 16 MCP servers: These servers span various categories, including office, lifestyle, communication, information, and work management. - Single-tool-calling capability: It can perform single- or multi-step tool calls using a single tool that complies with the MCP. - Cross-tool-calling capability: It can perform single- or multi-step tool calls using different tools that complies with the MCP. ##### Demo Demo is available in this [link](./demo/minicpm4/MCP/README_en.md). ##### Performance Evaluation | MCP Server | | gpt-4o | | | qwen3 | | | minicpm4 | | |-----------------------|----------------|--------------|--------------|---------------|--------------|--------------|----------------|--------------|--------------| | | func | param | value | func | param | value | func | param | value | | Airbnb | 89.3 | 67.9 | 53.6 | 92.8 | 60.7 | 50.0 | 96.4 | 67.9 | 50.0 | | Amap-Maps | 79.8 | 77.5 | 50.0 | 74.4 | 72.0 | 41.0 | 89.3 | 85.7 | 39.9 | | Arxiv-MCP-Server | 85.7 | 85.7 | 85.7 | 81.8 | 54.5 | 50.0 | 57.1 | 57.1 | 52.4 | | Calculator | 100.0 | 100.0 | 20.0 | 80.0 | 80.0 | 13.3 | 100.0 | 100.0 | 6.67 | | Computor-Control-MCP | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 86.7 | | Desktop-Commander | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | | Filesystem | 63.5 | 63.5 | 31.3 | 69.7 | 69.7 | 26.0 | 83.3 | 83.3 | 42.7 | |Github | 92.0 | 80.0 | 58.0 | 80.5 | 50.0 | 27.7 | 62.8 | 25.7 | 17.1 | | Gaode | 71.1 | 55.6 | 17.8 | 68.8 | 46.6 | 24.4 | 68.9 | 46.7 | 15.6 | | MCP-Code-Executor | 85.0 | 80.0 | 70.0 | 80.0 | 80.0 | 70.0 | 90.0 | 90.0 | 65.0 | | MCP-Docx | 95.8 | 86.7 | 67.1 | 94.9 | 81.6 | 60.1 | 95.1 | 86.6 | 76.1 | | PPT | 72.6 | 49.8 | 40.9 | 85.9 | 50.7 | 37.5 | 91.2 | 72.1 | 56.7 | | PPTx | 64.2 | 53.7 | 13.4 | 91.0 | 68.6 | 20.9 | 91.0 | 58.2 | 26.9 | | Simple-Time-Server | 90.0 | 70.0 | 70.0 | 90.0 | 90.0 | 90.0 | 90.0 | 60.0 | 60.0 | | Slack | 100.0 | 90.0 | 70.0 | 100.0 | 100.0 | 65.0 | 100.0 | 100.0 | 100.0 | | Whisper | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 90.0 | 30.0 | | **Average** | **80.2** | **70.2** | **49.1** | **83.5** | **67.7** | **43.8** | **88.3** | **76.1** | **51.2** | #### MiniCPM Intel AIPC Client: A New Edge Large Model Powerhouse Developed in collaboration between Mianbi Intelligence and Intel, the MiniCPM Intel AIPC Client is an edge large model client specially designed for devices equipped with Intel Core Ultra series processors. It delivers a low-latency, high-efficiency, and privacy-preserving local large model experience for developers, researchers, and AI enthusiasts. Its core features include: ### Key Features - Deep Intel Hardware Adaptation Fully compatible with Intel Core Ultra series processors, enabling deep integration with hardware to unleash peak performance. Users can run large models smoothly on local devices without relying on cloud services. - Extreme Optimization Based on OpenVINO Deeply optimized with the OpenVINO inference framework, it significantly boosts inference efficiency, reaching up to **80 tokens per second**. This ensures rapid model response for both quick queries and complex task processing. - Privacy and Security Assurance Adopting local deployment, all data processing is completed on the device, eliminating privacy risks from cloud uploads. This provides users with peace of mind, especially for scenarios with high data privacy requirements. - Catering to Diverse User Groups Whether for developers chasing cutting-edge technologies, researchers focused on academic studies, or enthusiasts eager to explore AI applications, the MiniCPM Intel AIPC Client enables easy access to the power of local large models, opening the door to personalized AI exploration. ### System Requirements - Recommended processor: Intel Core Ultra 7 or higher (mobile version) - Recommended RAM: 32GB or above ### Download [download](https://github.com/OpenBMB/MiniCPM/releases/tag/2.4.2) ### Inference #### CPM.cu We **recommend** using [CPM.cu](https://github.com/OpenBMB/CPM.cu) for the inference of MiniCPM4. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4. You can install CPM.cu by running the following command: ```bash git clone https://github.com/OpenBMB/CPM.cu.git --recursive cd CPM.cu python3 setup.py install ``` You can run the following command to test the speed of the model. ```bash python3 tests/long_prompt_gen.py # generate prompt.txt python3 tests/test_generate.py --prompt-file prompt.txt ``` For more details about CPM.cu, please refer to the repo of [CPM.cu](https://github.com/OpenBMB/CPM.cu). #### HuggingFace ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed(0) path = 'openbmb/MiniCPM4-8B' device = "cuda" tokenizer = AutoTokenizer.from_pretrained(path) model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True) # User can directly use the chat interface # responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7) # print(responds) # User can also use the generate interface messages = [ {"role": "user", "content": "Write an article about Artificial Intelligence."}, ] prompt_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device) model_outputs = model.generate( **model_inputs, max_new_tokens=1024, top_p=0.7, temperature=0.7 ) output_token_ids = [ model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids'])) ] responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0] print(responses) ``` This model supports InfLLM v2, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library. You can install it by running the following command: ```bash git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git cd infllmv2_cuda_impl git submodule update --init --recursive pip install -e . # or python setup.py install ``` To enable InfLLM v2, you need to add the `sparse_config` field in `config.json`: ```json { ..., "sparse_config": { "kernel_size": 32, "kernel_stride": 16, "init_blocks": 1, "block_size": 64, "window_size": 2048, "topk": 64, "use_nope": false, "dense_len": 8192 } } ``` These parameters control the behavior of InfLLM v2: * `kernel_size` (default: 32): The size of semantic kernels. * `kernel_stride` (default: 16): The stride between adjacent kernels. * `init_blocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence. * `block_size` (default: 64): The block size for key-value blocks. * `window_size` (default: 2048): The size of the local sliding window. * `topk` (default: 64): The specifies that each token computes attention with only the top-k most relevant key-value blocks. * `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance. * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length. Minicpm4 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor. You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields. ```json { ..., "rope_scaling": { "rope_type": "longrope", "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116], "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116], "original_max_position_embeddings": 32768 } } ``` #### vLLM - Install vLLM Reference vLLM [official repository](https://github.com/vllm-project/vllm), install the latest version through *source code*. ``` pip install -U vllm \ --pre \ --extra-index-url https://wheels.vllm.ai/nightly ``` - Inference MiniCPM4-8B with vLLM: ```python from transformers import AutoTokenizer from vllm import LLM, SamplingParams model_name = "openbmb/MiniCPM4-8B" prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}] tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) llm = LLM( model=model_name, trust_remote_code=True, max_num_batched_tokens=32768, dtype="bfloat16", gpu_memory_utilization=0.8, ) sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02) outputs = llm.generate(prompts=input_text, sampling_params=sampling_params) print(outputs[0].outputs[0].text) ``` - Use Eagle Speculative Decoding in vLLM: initialize the inference engine as follows. ```python llm = LLM( model=model_name, trust_remote_code=True, max_num_batched_tokens=32768, dtype="bfloat16", gpu_memory_utilization=0.8, speculative_config={ "method": "eagle", "model": "openbmb/MiniCPM4-8B-Eagle-vLLM", "num_speculative_tokens": 2, "max_model_len": 32768, }, ) ``` - Inference quantized MiniCPM4-8B: initialize the inference engine as follows. ```python llm = LLM( model="openbmb/MiniCPM4-8B-marlin-vLLM", trust_remote_code=True, max_num_batched_tokens=32768, dtype="bfloat16", gpu_memory_utilization=0.8, ) ``` - Use Eagle Speculative Decoding for quantized MiniCPM4-8B: initialize the inference engine as follows. ```python llm = LLM( model="openbmb/MiniCPM4-8B-marlin-vLLM", trust_remote_code=True, max_num_batched_tokens=32768, dtype="bfloat16", gpu_memory_utilization=0.8, speculative_config={ "method": "eagle", "model": "openbmb/MiniCPM4-8B-marlin-Eagle-vLLM", "num_speculative_tokens": 2, "max_model_len": 32768, }, ) ``` > **Note**: If you're using an OpenAI-compatible server in vLLM, the `chat` API sets `add_special_tokens=False` by default. This will result in missing special tokens—such as the beginning-of-sequence (BOS) token—which are required for proper prompt formatting in **MiniCPM4**. To ensure correct behavior, you must explicitly set `extra_body={"add_special_tokens": True}` in your API call, like below: ```python import openai client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY") response = client.chat.completions.create( model="openbmb/MiniCPM4-8B", messages=[ {"role": "user", "content": "Write an article about Artificial Intelligence."}, ], temperature=0.7, max_tokens=1024, extra_body={"add_special_tokens": True}, # Ensures special tokens like BOS are added ) print(response.choices[0].message.content) ``` #### SGLang - Install SGLang Reference SGLang [official repository](https://github.com/sgl-project/sglang), install through *source code*. ``` git clone -b openbmb https://github.com/sgl-project/sglang.git cd sglang pip install --upgrade pip pip install -e "python[all]" ``` - Start inference service ```shell python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml ``` - Then, users can use the chat interface by running the following command: ```python import openai client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None") response = client.chat.completions.create( model="openbmb/MiniCPM4-8B", messages=[ {"role": "user", "content": "Write an article about Artificial Intelligence."}, ], temperature=0.7, max_tokens=1024, ) print(response.choices[0].message.content) ``` - Use speculative acceleration ```shell python3 -m sglang.launch_server --model-path [model] \ --speculative_draft_model_path [draft_model] \ --host 0.0.0.0 --trust-remote-code \ --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \ --mem-fraction 0.5 ``` ## MiniCPM 3.0
Click to view details about MiniCPM3.0 MiniCPM 3.0 is a language model with 4 billion parameters. Compared to MiniCPM 1.0/2.0, it offers more comprehensive features and a significant improvement in overall capabilities. Its performance on most evaluation benchmarks rivals or even surpasses many models with 7B-9B parameters. * **Supports Function Call🛠️ and Code Interpreter💻**: Achieved SOTA among models with fewer than 9B parameters on the [Berkeley Function Calling Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html), outperforming GLM-4-9B-Chat and Qwen2-7B-Instruct. * **Exceptional Reasoning Ability🧮**: In terms of math abilities, it outperforms GPT-3.5-Turbo and several 7B-9B models on [MathBench](https://open-compass.github.io/MathBench/). On the highly challenging [LiveCodeBench](https://livecodebench.github.io/), it surpasses Llama3.1-8B-Instruct. * **Outstanding Instruction-Following in English and Chinese🤖**: Exceeds GLM-4-9B-Chat and Qwen2-7B-Instruct on English instruction following with [IFEval](https://huggingface.co/datasets/google/IFEval) and on Chinese instruction following with [FollowBench-zh](https://huggingface.co/datasets/YuxinJiang/FollowBench). * **Long Context Capability**: Natively supports 32k context length, with flawless performance. We introduce the [LLMxMapReduce](https://github.com/thunlp/LLMxMapReduce) framework, theoretically enabling processing of context lengths up to infinity. Enhanced by LLMxMapReduce, MiniCPM3-4B achieves performance comparable to GPT-4 and KimiChat on InfiniteBench. * **RAG Capability**:We release [MiniCPM RAG Suite](https://huggingface.co/collections/openbmb/minicpm-rag-suite-66d976b4204cd0a4f8beaabb). Based on the MiniCPM series models, [MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding) and [MiniCPM-Reranker](https://huggingface.co/openbmb/MiniCPM-Reranker) achieve SOTA performance on Chinese and Chinese-English cross-lingual retrieval tests. Specifically designed for the RAG scenario, [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA) outperforms models like Llama3-8B and Baichuan2-13B on multiple tasks, such as open-domain question answering. ### Evaluation Results #### Comprehensive Evaluation
Benchmarks Qwen2-7B-Instruct GLM-4-9B-Chat Gemma2-9B-it Llama3.1-8B-Instruct GPT-3.5-Turbo-0125 Phi-3.5-mini-Instruct(3.8B) MiniCPM3-4B
English
MMLU 70.5 72.4 72.6 69.4 69.2 68.4 67.2
BBH 64.9 76.3 65.2 67.8 70.3 68.6 70.2
MT-Bench 8.41 8.35 7.88 8.28 8.17 8.60 8.41
IFEVAL (Prompt Strict-Acc.) 51.0 64.5 71.9 71.5 58.8 49.4 68.4
Chinese
CMMLU 80.9 71.5 59.5 55.8 54.5 46.9 73.3
CEVAL 77.2 75.6 56.7 55.2 52.8 46.1 73.6
AlignBench v1.1 7.10 6.61 7.10 5.68 5.82 5.73 6.74
FollowBench-zh (SSR) 63.0 56.4 57.0 50.6 64.6 58.1 66.8
Mathematics
MATH 49.6 50.6 46.0 51.9 41.8 46.4 46.6
GSM8K 82.3 79.6 79.7 84.5 76.4 82.7 81.1
MathBench 63.4 59.4 45.8 54.3 48.9 54.9 65.6
Coding
HumanEval+ 70.1 67.1 61.6 62.8 66.5 68.9 68.3
MBPP+ 57.1 62.2 64.3 55.3 71.4 55.8 63.2
LiveCodeBench v3 22.2 20.2 19.2 20.4 24.0 19.6 22.6
Tool Use
BFCL v2 71.6 70.1 19.2 73.3 75.4 48.4 76.0
Overall
Average 65.3 65.0 57.9 60.8 61.0 57.2 66.3
#### Function Calling We evaluate the function calling capability of MiniCPM3 on [Berkeley Function Calling Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html). MiniCPM3-4B outperforms several models with 7B-9B parameters on this leaderboard, surpassing GPT-3.5-Turbo-0125.
Model Overall Accuracy AST Summary Exec Summary Irrelevance Detection Relevance Detection
MiniCPM3-4B 76.03% 68.55% 85.54% 53.71% 90.24%
Llama3.1-8B-Instruct 73.28% 64.61% 86.48% 43.12% 85.37%
Qwen2-7B-Instruct 71.61% 65.71% 79.57% 44.70% 90.24%
GLM-4-9B-Chat 70.08% 60.69% 80.02% 55.02% 82.93%
Phi-3.5-mini-instruct 48.44% 38.89% 54.04% 46.78% 65.85%
Gemma2-9B-it 19.18% 5.41% 18.50% 88.88% 7.32%
#### Long Context Capability In the [Needle in a Haystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) test with a context length of 32k, the results are shown as follows: ![needle](assets/eval_needle.jpeg) We also propose a divide-and-conquer long-sequence processing framework [LLMxMapReduce](https://github.com/thunlp/LLMxMapReduce) to support text with any length. MiniCPM3xMapReduce can achieve comparable performance with GPT-4 and KimiChat. | | Context length| Qwen2-70b | Kimi-Chat(2024.06) | GPT-4 (From InfiniteBench) | MiniCPM 3.0 x MR | Qwen2-70b x MR | Llama3-70bx MR | | ----------------------------- | ---------- | --------- | ------------------ | -------------------------- | --------------- | ------------ | ------------- | | Math.Find | 87.9k | 59.71% | 18.57% | 60.00% | 83.43% | 54.29% | **91.43%** | | Retrieve.KV | 89.9k | 29.00% | 69.20% | 89.00% | 93.80% | 98.80% | **98.89%** | | En.Dia | 103.6K | 23.00% | 23.00% | 7.50% | 12.50% | **46.50%** | 17.50% | | Code.Debug | 114.7k | 45.43% | 38.32% | 54.31% | 25.63% | 54.82% | **62.94%** | | Retrieve.Number | 122.4k | **100.00%** | 97.45% | **100.00%** | 99.32% | **100.00%** | 99.79% | | Retrieve.PassKey | 122.4k | **100.00%** | 99.32% | **100.00%** | 98.81% | **100.00%** | **100.00%** | | En.Sum | 171.5K | 31.85% | 29.94% | 14.73% | 25.89% | **32.39%** | 30.63% | | En.MC | 184.4k | 81.66% | 79.91% | 68.12% | 66.38% |**83.84%** | 82.10% | | En.QA | 192.6k | 21.97% | 18.80% | 22.44% | 28.39% | 23.13% | **34.70%** | | Zh.QA | 2068.6k | 21.40% | 19.84% | **25.96%** | 23.66% | 19.10% | N/A | | avg w/o Zh.QA | / | 51.92% | 52.96% | 55.33% | 59.29% | 64.98% | **68.64%** | | avg | / | 48.86% | 49.65% | 52.39% | 55.55% | **60.39%** | N/A | ### Inference #### Huggingface ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed(0) path = 'openbmb/MiniCPM3-4B' tokenizer = AutoTokenizer.from_pretrained(path) model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7) print(responds) ``` #### SGLang (Recommended) * Installation Refer to SGLang [repo](https://github.com/sgl-project/sglang) to install the latest version *via source code*. * Launch a server ```shell python -m sglang.launch_server --model openbmb/MiniCPM3-4B --trust-remote-code --port 30000 --chat-template chatml ``` * Example code ```python from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint @function def multi_turn_question(s, question_1, question_2): s += user(question_1) s += assistant(gen("answer_1", max_tokens=1024)) s += user(question_2) s += assistant(gen("answer_2", max_tokens=1024)) set_default_backend(RuntimeEndpoint("http://localhost:30000")) state = multi_turn_question.run( question_1="Introduce artificial intelligence", question_2="Write an article about it", ) for m in state.messages(): print(m["role"], ":", m["content"]) ``` #### vLLM * Install vllm ```shell pip install "vllm>=0.6.2" ``` * Inference ```python from transformers import AutoTokenizer from vllm import LLM, SamplingParams model_name = "openbmb/MiniCPM3-4B" prompt = [{"role": "user", "content": "Write an article about Artificial Intelligence."}] tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) llm = LLM(model=model_name, trust_remote_code=True, tensor_parallel_size=1 ) sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024) outputs = llm.generate(prompts=input_text, sampling_params=sampling_params) print(outputs[0].outputs[0].text) ``` #### llama.cpp We have provided the [GGUF formats]((https://huggingface.co/openbmb/MiniCPM3-4B-GGUF)) of MiniCPM3, which can be used in llama.cpp. * Install llama.cpp ```shell git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make ``` * Inference ```shell ./llama-cli -c 1024 -m minicpm3-4b-fp16.gguf -n 1024 --top-p 0.7 --temp 0.7 --prompt "<|im_start|>user\nWrite an article about Artificial Intelligence.<|im_end|>\n<|im_start|>assistant\n" ``` ### Fine-Tuning #### LLaMA-Factory We have supported fine-tuning MiniCPM3 using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). For usage instructions, refer to [LLaMA-Factory Fine-tuning](https://modelbest.feishu.cn/docx/Z7USdW4lloZzkZxQ14icJ3senjb?from=from_copylink)." ### Advanced Features We use [vLLM](#vllm) in the example code for the following advanced features. #### Function calling We provide example code for using function calls with MiniCPM3: ```bash cd demo/minicpm3/function_call python function_call.py ``` If you want to start a function call service, use the following commands: ```bash cd demo/minicpm3/function_call pip install -r requirements.txt python openai_api_server.py \ --model openbmb/MiniCPM3-4B \ --served-model-name MiniCPM3-4B \ --chat-template chatml.jinja \ --dtype auto \ --api-key token-abc123 \ --tensor-parallel-size 1 \ --trust-remote-code ``` Below is a demo of using a search engine to answer the question: ![function_call](./assets/function_call.gif) #### Code Interpreter We provide example code for using the code interpreter with MiniCPM3: ```bash cd demo/minicpm3/code_interpreter pip install -r requirements.txt python code_interpreter.py openbmb/MiniCPM3-4B ``` Below is an example of using the code interpreter to generate a QR code: ![code_interpreter](./assets/code_interpreter.gif)
## MiniCPM 2.0
Click to view details about MiniCPM2.0 ### Introdution MiniCPM 2.0 series upgrade MiniCPM in multiple dimensions, including: - [MiniCPM-2B-128k](https://huggingface.co/openbmb/MiniCPM-2B-128k):Extend the length of MiniCPM-2B context window to 128k, outperform larger models such as ChatGLM3-6B-128k、Yi-6B-200k on InfiniteBench. - [MiniCPM-MoE-8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B):Upcycling from MiniCPM-2B. Compared to MiniCPM-2B, the overall performance improves by an average of 4.5pp. - [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16): 60% inference cost reduction compared with MiniCPM-2B, while still showing better overall performance than LLaMA2-13B. - [MiniCPM-S-1B](https://huggingface.co/openbmb/MiniCPM-S-1B-sft): The FFN layer achieves an average sparsity of 87.89% and reduces FFN FLOPs by 84%, while maintaining no performance loss in downstream tasks. Combined with the PowerInfer, MiniCPM-S-1B inferece speed increase is approximately 2.8x. ### Evaluation Results #### MiniCPM-2B-128k | Model | avg | avg w/o code&math | passkey | number_string | kv_retrieval | longbook_choice_eng | longbook_qa_chn | longbook_qa_eng | longbook_sum_eng | longdialogue_qa_eng | math_calc | math_find | code_debug | code_run | |-------------------------------------|-------|-------------------|---------|---------------|--------------|---------------------|-----------------|-----------------|------------------|---------------------|-----------|-----------|------------|----------| | LWM-Text-128k | 24.45 | 33.62 | 100 | 97.8 | 0.6 | 28.82 | 15.93 | 14.31 | 9.99 | 1.5 | 0 | 3.43 | 20.05 | 1 | | Yarn-Mistral-7b-128k | 19.84 | 27.36 | 92.71 | | 0 | 27.95 | 15.49 | 9.55 | 9.06 | 7.5 | 0 | 17.14 | 0.76 | 1.25 | | Mistral-7B-Instruct-v0.2(ABF 1000w) | 27.75 | 36.9 | 100 | 78.98 | 3.6 | 37.12 | 11.74 | 17.37 | 21.12 | 9.5 | 0 | 29.43 | 17.51 | 0 | | Yi-6B-200k | 22.15 | 32.54 | 100 | 94.92 | 0 | 36.68 | 15.07 | 9.2 | 0.92 | 3.5 | 0 | 4.29 | 0.51 | 0.75 | | chatglm3-6b-128k | 25.58 | 36.57 | 89.93 | 99.66 | 5.2 | 46.29 | 10.7 | 8.38 | 25.91 | 6.5 | 0 | 8 | 5.33 | 1 | | MiniCPM-2.4B-128k | 27.32 | 37.68 | 98.31 | 99.83 | 9 | 29.69 | 23.06 | 16.33 | 15.73 | 9.5 | 0 | 4.29 | 22.08 | 0 | #### MiniCPM-MoE-8x2B
Model BBH MMLU CEval CMMLU HumanEval MBPP† GSM8K MATH
Llama2-34B* 44.1 62.6 - - 22.6 33.0 42.2 6.24
Mistral-7B-Instruct-v0.2 39.81 60.51 42.55 41.92 36.59 39.63 40.49 4.95
Gemma-7B* 55.1 64.3 - - 32.3 44.4 46.4 24.3
Qwen1.5-7B* 40.2 61 74.1 73.1 36 37.4 62.5 20.3
Deepseek-MoE(16B)* - 45.0 40.6 42.5 26.8 39.2 18.8 4.3
MiniCPM-2.4B 36.87 53.46 51.13 51.07 50.00 35.93 53.83 10.24
MiniCPM-MoE-8x2B 39.22 58.90 58.11 58.80 55.49 41.68 61.56 10.52
Note:* means evaluation results are directly taken from their technical reports. † means evaluation results on the full set of MBPP, instead of the hand-verified set. #### MiniCPM-S-1B - Code Generation:Average pass@1 score of HumanEval(0-shot) and MBPP(3-shot). - Commonsense Reasoning: Average 0-shot accuracy of PIQA, SIQA, HellaSwag, WinoGrande and COPA. - Reading Comprehension: Average 0-shot accuracy of BoolQ, LAMBADA and TyDi-QA. - Other Benchmarks: We report average performance of GSM8K(8-shot)、MMLU(5-shot)、BBH(3-shot) and AGI-Eval(0-shot). | Setting | Average
Sparsity | Average
Performance | Code
Generation | Commonsense
Reasoning | Reading
Comprehension | GSM8K | MMLU | BBH | AGI-Eval | | :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: | | LLaMA2-7B | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | | ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | | **ProSparse-7B**\* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | | **ProSparse-7B** | **89.32** | **38.46** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | | LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | | ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | | **ProSparse-13B**\* | 87.97 | **45.07** | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | | **ProSparse-13B** | **88.80** | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | | MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 | | **MiniCPM-S-1B**\* | 86.25 | **44.72** | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 | | **MiniCPM-S-1B** | **87.89** | **44.72** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 | Note: 1. [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [ReluLLaMA-13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B). "ProSparse-7B\*"、"ProSparse-13B\*" and "MiniCPM-S-1B\*" represent ProSparse versions that don't have activation thresholds offset. 2. For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA and AGI-Eval, we adopt ppl-based evaluation. For GSM8K, MMLU and BBH, we perform generation-based evaluation. ### Inference #### HuggingFace, vLLM Please refer to [Inference](#huggingface-inferene) section in MiniCPM1.0. #### PowerInfer Currently, PowerInfer is exclusively tailored for the MiniCPM-S-1B model; support for other versions is not yet available, stay tuned. 1. Ensure your cmake version is 3.17 or above. If you have already installed it, you can skip this step. ```bash # Download the installation package sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz # Extract the installation package sudo tar -zxvf cmake-3.23.0.tar.gz # Configure the installation environment sudo ./configure sudo make -j8 # Compile and install sudo make install # Check the version after installation cmake --version # If the version number is returned, the installation was successful # cmake version 3.23.0 ``` 2. Install PowerInfer:: ```bash git clone https://github.com/SJTU-IPADS/PowerInfer cd PowerInfer pip install -r requirements.txt # install Python helpers' dependencies ``` 3. Compile the CPU version of PowerInfer. If your machine only has a CPU, or if you want to perform inference using the CPU, run the following commands:: ```bash cmake -S . -B build cmake --build build --config Release ``` 4. Compile the GPU version of PowerInfer. If your machine has a GPU, you can run the following commands: ```bash cmake -S . -B build -DLLAMA_CUBLAS=ON cmake --build build --config Release ``` 5. Retrieve the sparse model: ```bash git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main #or git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf ``` 6. Model Inference: ```bash cd PowerInfer # Below is the command template. output_token_count refers to the maximum output tokens, thread_num is the number of threads, and prompt is the input prompt text. #./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt # Below is an example ./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p 'hello,tell me a story please.' ```
## MiniCPM 1.0
Click to view details about MiniCPM1.0 ### Introduction MiniCPM-2B is a dense language model with only 2.4B parameters excluding embeddings (2.7B in total). - MiniCPM has very close performance compared with Mistral-7B on open-sourced general benchmarks with better ability on Chinese, Mathematics and Coding after SFT. The overall performance exceeds Llama2-13B, MPT-30B, Falcon-40B, etc. - After DPO, MiniCPM outperforms Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, Zephyr-7B-alpha, etc. on MTBench. Note: To ensure the generality of the model for academic research purposes, **we have not subject it to any identity-specific training.** Meanwhile, as we use ShareGPT open-source corpus as part of the training data, the model may output identity-related information similar to the GPT series models. ### Evaluation Results #### Evaluation Settings * Since it is difficult to standardize the evaluation of LLMs and there is no public prompt and test code for a large number of evaluations, we can only try our best to make it suitable for all types of models in terms of specific evaluation methods. * Overall, we use a unified prompt input for testing, and adjust the input according to the corresponding template for each model. * **The evaluation scripts and prompts have been open-sourced in our Github repository, and we welcome more developers to continuously improve our evaluation methods.** * For the text evaluation part, we use our open source large model capability evaluation framework [UltraEval](https://github.com/OpenBMB/UltraEval). The following is the open source model reproduction process: * install UltraEval ```shell git clone https://github.com/OpenBMB/UltraEval.git cd UltraEval pip install -e . ``` * Download the relevant data and unzip it for processing ```shell wget -O RawData.zip "https://cloud.tsinghua.edu.cn/f/71b5232264ae4833a4d0/?dl=1" unzip RawData.zip python data_process.py ``` * Execute evaluation scripts (templates are provided and can be customized) ```shell bash run_eval.sh ``` #### Deployment mode * Because MiniCPM uses the structure of Mup, which is slightly different from existing models in terms of specific computations, we have based the implementation of our model on the vllm=0.2.2 version. * **For non-MiniCPM models, we directly sampled the latest version of vllm=0.2.7 for inference.** #### Evaluation method * For the QA task (multiple-choice task), we chose to test in two ways: * PPL: The options are used as a continuation of the question generation and the answer selection is based on the PPL of each option; * The second is to generate the answer options directly. * For different models, the results obtained by these two approaches vary widely. the results on both MiniCPM models are closer, while models such as Mistral-7B-v0.1 perform better on PPL and worse on direct generation. * In the specific evaluation, we take the higher score of the two evaluation methods as the final result, so as to ensure the fairness of the comparison (* in the following table indicates the PPL). #### Text evaluation |Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |Llama2-7B|35.40|36.21|31.765|32.42|31.11|44.32|12.2|27.17|13.57|1.8|33.23|75.25|42.75|75.62*| |Qwen-7B|49.46|47.19|59.655|58.96|60.35|57.65|17.07|42.15|41.24|5.34|37.75|83.42|64.76|75.32*| |Deepseek-7B|39.96|39.15|43.635|42.82|44.45|47.82|20.12|41.45|15.85|1.53|33.38|74.58*|42.15*|75.45*| |Mistral-7B|48.97|49.96|44.54|46.12|42.96|62.69|27.44|45.2|33.13|5.0|41.06|83.92|70.73|80.43*| |Llama2-13B|41.48|42.44|37.19|37.32|37.06|54.71|17.07|32.55|21.15|2.25|37.92|78.87*|58.19|79.23*| |MPT-30B|38.17|39.82|30.715|29.34|32.09|46.56|21.95|35.36|10.31|1.56|38.22|78.66*|46.08*|79.72*| |Falcon-40B|43.62|44.21|40.93|40.29|41.57|53.53|24.39|36.53|22.44|1.92|36.24|81.94*|57.68|83.26*| |MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| |Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |TinyLlama-1.1B|25.36|25.55|24.525|25.02|24.03|24.3|6.71|19.91|2.27|0.74|28.78|60.77*|28.15*|58.33*|Qwen-1.8B|34.72|31.87|47.565|49.81|45.32|43.37|7.93|17.8|19.26|2.42|29.07|63.97*|43.69|59.28*| |Qwen-1.8B|34.72|31.87|47.565|49.81|45.32|43.37|7.93|17.8|19.26|2.42|29.07|63.97*|43.69|59.28*| |Gemini Nano-3B|-|-|-|-|-|-|-|27.2(report)|22.8(report)|-|42.4(report)|-|-|-| |StableLM-Zephyr-3B|43.46|46.31|30.615|30.34|30.89|45.9|35.37|31.85|52.54|12.49|37.68|73.78|55.38|71.87*| |Phi-2-2B|48.84|54.41|23.775|23.37|24.18|52.66|47.56|55.04|57.16|3.5|43.39|86.11|71.25|73.07*| |MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| |Model|Average Score|Average Score in English|Average Score in Chinese|C-Eval|CMMLU|MMLU|HumanEval|MBPP|GSM8K|MATH|BBH|ARC-E|ARC-C|HellaSwag| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |ChatGLM2-6B|37.98|35.17|50.63|52.05|49.21|45.77|10.37|9.38|22.74|5.96|32.6|74.45|56.82|58.48*| |Mistral-7B-Instruct-v0.1|44.36|45.89|37.51|38.06|36.96|53.56|29.27|39.34|28.73|3.48|39.52|81.61|63.99|73.47*| |Mistral-7B-Instruct-v0.2|50.91|52.83|42.235|42.55|41.92|60.51|36.59|48.95|40.49|4.95|39.81|86.28|73.38|84.55*| |Qwen-7B-Chat|44.93|42.05|57.9|58.57|57.23|56.03|15.85|40.52|42.23|8.3|37.34|64.44*|39.25*|74.52*| |Yi-6B-Chat|50.46|45.89|70.995|70.88|71.11|62.95|14.02|28.34|36.54|3.88|37.43|84.89|70.39|74.6*| |Baichuan2-7B-Chat|44.68|42.74|53.39|53.28|53.5|53|21.34|32.32|25.25|6.32|37.46|79.63|60.15|69.23*| |Deepseek-7B-chat|49.34|49.56|48.335|46.95|49.72|51.67|40.85|48.48|48.52|4.26|35.7|76.85|63.05|76.68*| |Llama2-7B-Chat|38.16|39.17|33.59|34.54|32.64|47.64|14.02|27.4|21.15|2.08|35.54|74.28|54.78|75.65*| |MiniCPM-2B|52.33|52.6|51.1|51.13|51.07|53.46|50.00|47.31|53.83|10.24|36.87|85.44|68.00|68.25| #### DPO evaluation |Model|MT-bench| |---|---| |GPT-4-turbo|9.32| |GPT-3.5-turbo|8.39| |Mistral-8*7b-Instruct-v0.1|8.30| |Claude-2.1|8.18| |Zephyr-7B-beta|7.34| |**MiniCPM-2B**|**7.25**| |Vicuna-33B|7.12| |Zephyr-7B-alpha|6.88| |LLaMA-2-70B-chat|6.86| |Mistral-7B-Instruct-v0.1|6.84| |MPT-34B-instruct|6.39| ### Quick Start #### Online - [Colab](https://colab.research.google.com/drive/1tJcfPyWGWA5HezO7GKLeyeIso0HyOc0l?usp=sharing) #### Web-demo based on Gradio Using the following command can launch the gradio-based demo. ```shell # generation powered by vllm python demo/minicpm/vllm_based_demo.py --model_path # generation powered by huggingface python demo/minicpm/hf_based_demo.py --model_path ``` #### Huggingface Inferene ##### MiniCPM-2B Install `transformers>=4.36.0` and `accelerate`,run the following python code: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed(0) path = 'openbmb/MiniCPM-2B-dpo-bf16' tokenizer = AutoTokenizer.from_pretrained(path) model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) responds, history = model.chat(tokenizer, "Which city is the capital of China?", temperature=0.8, top_p=0.8) print(responds) ``` ##### MiniCPM-2B (Llama Format) To facilitate ease of use, we have converted the model weights of MiniCPM to adapt to the structure of the LLaMA model: ```python import torch from transformers import LlamaTokenizerFast, LlamaForCausalLM model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" tokenizer = LlamaTokenizerFast.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" input_ids = tokenizer.encode("{}".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() responses = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) responses = tokenizer.decode(responses[0], skip_special_tokens=True) print(responses) ``` #### vLLM Inference Install [vLLM](https://github.com/vllm-project/vllm). ```shell pip install "vllm>=0.4.1" ``` See [here](#vllm) for the inference code. #### SGLang Inference Install [SGLang](https://github.com/sgl-project/sglang). * First, launch a server: ```bash python -m sglang.launch_server --model-path openbmb/MiniCPM-2B-dpo-fp16 --trust-remote-code --port 30000 ``` * You can use it for inference as shown below: ```python from sglang import function, gen, set_default_backend, RuntimeEndpoint @function def text_qa(s, question): s += "" + question + "" s += gen("answer", max_tokens=1024, temperature=0.7, top_p=0.7) set_default_backend(RuntimeEndpoint("http://localhost:30000")) state = text_qa.run( question="What is the capital of China?", ) print(state["answer"]) ``` #### llama.cpp, Ollama, fastllm, mlx_lm Inference We have supported inference with [llama.cpp](https://github.com/ggerganov/llama.cpp/), [ollama](https://github.com/ollama/ollama), [fastllm](https://github.com/ztxz16/fastllm), [mlx_lm](https://github.com/ml-explore/mlx-examples). Thanks to [@runfuture](https://github.com/runfuture) for the adaptation of llama.cpp and ollama. Please refer to [Edge Deployment Tutorial](https://modelbest.feishu.cn/wiki/VL5kw9DsEiRDmJkEyTUcydE0nie). #### Quantization Please refer to [Quantization Tutorial](https://modelbest.feishu.cn/wiki/EatbwdLuvitbbMk2X5wcX6h5n7c). #### Fine-Tuning * With parameter-efficient tuning, we can tune MiniCPM using one piece of NVIDIA GeForce GTX 1080/2080: [code](https://github.com/OpenBMB/MiniCPM/tree/main/finetune). * mlx finetune: [Guideline](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#share-ASrDdvFAloHtycxfy85cLNhAnd3) - [xtuner](https://github.com/InternLM/xtuner): [The best choice to do parameter-efficient tuning on MiniCPM](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#AMdXdzz8qoadZhxU4EucELWznzd) - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git):[One click solution of finetuning MiniCPM](https://modelbest.feishu.cn/wiki/AIU3wbREcirOm9kkvd7cxujFnMb#BAWrdSjXuoFvX4xuIuzc8Amln5E)
## LICENSE #### Model LICENSE * This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. #### Statement * As a language model, MiniCPM generates content by learning from a vast amount of text. * However, it does not possess the ability to comprehend or express personal opinions or value judgments. * Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers. * Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own. ## Institutions This project is developed by the following institutions: - [Modelbest Inc.](https://modelbest.cn/) - [THUNLP](https://nlp.csai.tsinghua.edu.cn/) - [Gaoling School of Artificial Intelligence of RUC](https://linyankai.github.io/) ## Citation * Please cite our paper: [MiniCPM1](https://arxiv.org/abs/2404.06395) and [MiniCPM4](https://github.com/OpenBMB/MiniCPM/blob/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable. ``` @article{minicpm4, title={MiniCPM4: Ultra-Efficient LLMs on End Devices}, author={MiniCPM Team}, year={2025} } @inproceedings{huminicpm, title={MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies}, author={Hu, Shengding and Tu, Yuge and Han, Xu and Cui, Ganqu and He, Chaoqun and Zhao, Weilin and Long, Xiang and Zheng, Zhi and Fang, Yewei and Huang, Yuxiang and others}, booktitle={First Conference on Language Modeling}, year={2024} } ```