v1.0.5

18ff696d · chenzk · 18ff696d · 18ff696d · 18ff696d · 18ff696d
Commit 18ff696d authored Dec 03, 2024 by chenzk
20 changed files
--- a/.gitignore
+++ b/.gitignore
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+.vscode
+
+checkpoints/
+wandb/
--- a/.pre-commit-config-check.yaml
+++ b/.pre-commit-config-check.yaml
+repos:
+  - repo: https://github.com/psf/black
+    rev: 22.12.0
+    hooks:
+      - id: black
+        language_version: python3
+        args:
+          - --line-length=119
+          - --check
+  - repo: https://github.com/charliermarsh/ruff-pre-commit
+    # Ruff version.
+    rev: 'v0.0.271'
+    hooks:
+      - id: ruff
+        args:
+          - --no-fix
+  - repo: local
+    hooks:
+      - id: pylint-nanotron
+        name: pylint nanotron core
+        entry: pylint --init-hook='import sys; sys.path.append(".")'
+        exclude: ^examples/.*$ # ignore examples as for each example we need to go in and look
+        language: system
+        types: [ python ]
+        args:
+          - --errors-only
+      - id: pylint-example-dataloading
+        name: pylint example dataloading
+        entry: pylint --init-hook='import sys; sys.path.append(".")'
+        files: ^examples/dataloading/.*$
+        language: system
+        types: [ python ]
+        args:
+          - --errors-only
+      - id: pylint-example-gpt2-mqa
+        name: pylint example gpt2_mqa
+        entry: pylint --init-hook='import sys; sys.path.append(".")'
+        files: ^examples/gpt2_mqa/.*$
+        language: system
+        types: [ python ]
+        args:
+          - --errors-only
+      - id: pylint-example-gpt2
+        name: pylint example gpt2
+        entry: pylint --init-hook='import sys; sys.path.append(".")'
+        files: ^examples/gpt2/.*$
+        language: system
+        types: [ python ]
+        args:
+          - --errors-only
+      - id: pylint-example-llama
+        name: pylint example llama
+        entry: pylint --init-hook='import sys; sys.path.append(".")'
+        files: ^examples/llama/.*$
+        language: system
+        types: [ python ]
+        args:
+          - --errors-only
+      - id: pylint-example-p2p
+        name: pylint example p2p
+        entry: pylint --init-hook='import sys; sys.path.append(".")'
+        files: ^examples/p2p/.*$
+        language: system
+        types: [ python ]
+        args:
+          - --errors-only
+      - id: pylint-example-t5
+        name: pylint example t5
+        entry: pylint --init-hook='import sys; sys.path.append(".")'
+        files: ^examples/t5/.*$
+        language: system
+        types: [ python ]
+        args:
+          - --errors-only
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.4.0
+    hooks:
+      - id: trailing-whitespace
+      - id: end-of-file-fixer
+  - repo: https://github.com/psf/black
+    rev: 22.12.0
+    hooks:
+      - id: black
+        language_version: python3
+        args:
+          - --line-length=119
+  - repo: https://github.com/charliermarsh/ruff-pre-commit
+    # Ruff version.
+    rev: 'v0.0.271'
+    hooks:
+      - id: ruff
+        args:
+          - --fix
+          - --exit-non-zero-on-fix
+  # - repo: https://github.com/PyCQA/isort
+  #   rev: 5.12.0
+  #   hooks:
+  #     - id: isort
+  #       args:
+  #         - --profile=black
+  #         - --skip-glob=wandb/**/*
+  #         - --thirdparty=wandb
+  - repo: https://github.com/codespell-project/codespell
+    rev: v2.1.0
+    hooks:
+      - id: codespell
+        args:
+          - -w
+          - --ignore-words-list=nd,reacher,thist,ths,magent,ba,fo,doesnt
--- a/.pylintrc
+++ b/.pylintrc
+[MASTER]
+# Use multiprocessing for pylint
+jobs=0
+
+# List of members which are set dynamically and missed by Pylint inference
+# system, and so shouldn't trigger E1101 when accessed.
+ignore-paths=
+
+load-plugins=linter.pylint.ban_rank,
+
+[MESSAGES CONTROL]
+# Disable list of rules
+disable=
+    no-member,                  # E1101: Module 'torch' has no 'allclose' member (no-member)
+    no-name-in-module,          # E0611: No name 'HFTensorBoardLogger' in module 'huggingface_hub' (no-name-in-module)
+    import-error,               # E0401: Unable to import 'tensorboardX' (import-error)
+    relative-beyond-top-level   # E0402: Attempted relative import beyond top-level package (relative-beyond-top-level)
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our
+community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+feedback@huggingface.co.
+All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series
+of actions.
+
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+<!---
+Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# How to contribute to 🤗 Nanotron?
+
+Everyone is welcome to contribute, and we value everybody's contribution. Code
+is thus not the only way to help the community. Answering questions, helping
+others, reaching out and improving the documentations are immensely valuable to
+the community.
+
+It also helps us if you spread the word: reference the library from blog posts
+on the awesome projects it made possible, shout out on Twitter every time it has
+helped you, or simply star the repo to say "thank you".
+
+Whichever way you choose to contribute, please be mindful to respect our
+[code of conduct](CODE_OF_CONDUCT.md).
+
+## You can contribute in so many ways!
+
+Some of the ways you can contribute to nanotron:
+* Fixing outstanding issues with the existing code;
+* Contributing to the examples or to the documentation;
+* Submitting issues related to bugs or desired new features.
+
+## Submitting a new issue or feature request
+
+Do your best to follow these guidelines when submitting an issue or a feature
+request. It will make it easier for us to come back to you quickly and with good
+feedback.
+
+### Did you find a bug?
+
+The 🤗 Nanotron library is robust and reliable thanks to the users who notify us of
+the problems they encounter. So thank you for reporting an issue.
+
+First, we would really appreciate it if you could **make sure the bug was not
+already reported** (use the search bar on Github under Issues).
+
+Did not find it? :( So we can act quickly on it, please follow these steps:
+
+* Include your **OS type and version**, the versions of **Python** and **PyTorch**.
+* A short, self-contained, code snippet that allows us to reproduce the bug in
+  less than 30s;
+* Provide your Nanotron configuration used for the run;
+* Describe the expected behavior and the actual behavior;
+
+### Do you want a new feature?
+
+A good feature request addresses the following points:
+
+1. Motivation first:
+* Is it related to a problem/frustration with the library? If so, please explain
+  why. Providing a code snippet that demonstrates the problem is best.
+* Is it related to something you would need for a project? We'd love to hear
+  about it!
+* Is it something you worked on and think could benefit the community?
+  Awesome! Tell us what problem it solved for you.
+2. Write a *full paragraph* describing the feature;
+3. Provide a **code snippet** that demonstrates its future use;
+4. In case this is related to a paper, please attach a link;
+5. Attach any additional information (drawings, screenshots, etc.) you think may help.
+
+If your issue is well written we're already 80% of the way there by the time you
+post it.
+
+## Submitting a pull request (PR)
+
+Before writing code, we strongly advise you to search through the existing PRs or
+issues to make sure that nobody is already working on the same thing. If you are
+unsure, it is always a good idea to open an issue to get some feedback.
+
+You will need basic `git` proficiency to be able to contribute to
+🤗 Nanotron. `git` is not the easiest tool to use but it has the greatest
+manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
+Git](https://git-scm.com/book/en/v2) is a very good reference.
+
+Follow these steps to start contributing:
+
+1. Fork the [repository](https://github.com/huggingface/nanotron) by
+   clicking on the 'Fork' button on the repository's page. This creates a copy of the code
+   under your GitHub user account.
+
+2. Clone your fork to your local disk, and add the base repository as a remote. The following command
+   assumes you have your public SSH key uploaded to GitHub. See the following guide for more
+   [information](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository).
+
+   ```bash
+   $ git clone git@github.com:<your Github handle>/nanotron.git
+   $ cd nanotron
+   $ git remote add upstream https://github.com/huggingface/nanotron.git
+   ```
+
+3. Create a new branch to hold your development changes, and do this for every new PR you work on.
+
+   Start by synchronizing your `main` branch with the `upstream/main` branch (ore details in the [GitHub Docs](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)):
+
+   ```bash
+   $ git checkout main
+   $ git fetch upstream
+   $ git merge upstream/main
+   ```
+
+   Once your `main` branch is synchronized, create a new branch from it:
+
+   ```bash
+   $ git checkout -b a-descriptive-name-for-my-changes
+   ```
+
+   **Do not** work on the `main` branch.
+
+4. Set up a development environment by running the following command in a conda or a virtual environment you've created for working on this library:
+
+   ```bash
+   $ pip install -e ".[dev]"
+   $ pip install -e ".[test]"
+   $ pre-commit install
+   ```
+
+   (If nanotron was already installed in the virtual environment, remove
+   it with `pip uninstall nanotron` before reinstalling it in editable
+   mode with the `-e` flag.)
+
+   Alternatively, if you are using [Visual Studio Code](https://code.visualstudio.com/Download), the fastest way to get set up is by using
+   the provided Dev Container. Documentation on how to get started with dev containers is available [here](https://code.visualstudio.com/docs/remote/containers).
+
+5. Develop the features on your branch.
+
+   As you work on the features, you should make sure that the test suite
+   passes. You should run the tests impacted by your changes like this (see
+   below an explanation regarding the environment variable):
+
+   ```bash
+   $ pytest tests/<TEST_TO_RUN>.py
+   ```
+
+   `nanotron` relies on `ruff` to format its source code
+   consistently. After you make changes, apply automatic style corrections and code verifications
+   that can't be automated in one go with:
+
+   This target is also optimized to only work with files modified by the PR you're working on.
+
+   If you prefer to run the checks one after the other, the following command apply the
+   style corrections:
+
+   ```bash
+   $ pre-commit run --all-files
+   ```
+
+   Once you're happy with your changes, add changed files using `git add` and
+   make a commit with `git commit` to record your changes locally:
+
+   ```bash
+   $ git add modified_file.py
+   $ git commit
+   ```
+
+   Please write [good commit messages](https://chris.beams.io/posts/git-commit/).
+
+   It is a good idea to sync your copy of the code with the original
+   repository regularly. This way you can quickly account for changes:
+
+   ```bash
+   $ git fetch upstream
+   $ git rebase upstream/main
+   ```
+
+   Push the changes to your account using:
+
+   ```bash
+   $ git push -u origin a-descriptive-name-for-my-changes
+   ```
+
+6. Once you are satisfied (**and the checklist below is happy too**), go to the
+   webpage of your fork on GitHub. Click on 'Pull request' to send your changes
+   to the project maintainers for review.
+
+7. It's ok if maintainers ask you for changes. It happens to core contributors
+   too! So everyone can see the changes in the Pull request, work in your local
+   branch and push the changes to your fork. They will automatically appear in
+   the pull request.
+
+
+### Checklist
+
+1. The title of your pull request should be a summary of its contribution;
+2. If your pull request addresses an issue, please mention the issue number in
+   the pull request description to make sure they are linked (and people
+   consulting the issue know you are working on it);
+3. To indicate a work in progress please prefix the title with `[WIP]`, or mark
+   the PR as a draft PR. These are useful to avoid duplicated work, and to differentiate
+   it from PRs ready to be merged;
+4. Make sure existing tests pass;
+5. Add high-coverage tests. No quality testing = no merge.
+
+See an example of a good PR here: https://github.com/huggingface/nanotron/pull/155
+
+### Tests
+
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in
+the [tests folder](https://github.com/huggingface/nanotron/tree/main/tests).
+
+We use `pytest` in order to run the tests. From the root of the
+repository, here's how to run tests with `pytest` for the library:
+
+```bash
+# Runs all tests (where 12 of which run in parallel)
+$ pytest -n 12 tests
+```
+
+You can specify a smaller set of tests in order to test only the feature
+you're working on.
--- a/HuggingFaceTB/cosmo2-tokenizer/README.md
+++ b/HuggingFaceTB/cosmo2-tokenizer/README.md
+---
+library_name: transformers
+datasets:
+- HuggingFaceTB/cosmo2_training_data_subset_1M
+---
+
+# cosmo2-tokenizer
+ Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from:
+ - FineWeb-Edu 70%
+ - Cosmopedia v2 15%
+ - StarCoderData  8%
+ - OpenWebMath 5%
+ - StackOverFlow 2%
\ No newline at end of file
--- a/HuggingFaceTB/cosmo2-tokenizer/merges.txt
+++ b/HuggingFaceTB/cosmo2-tokenizer/merges.txt
--- a/HuggingFaceTB/cosmo2-tokenizer/special_tokens_map.json
+++ b/HuggingFaceTB/cosmo2-tokenizer/special_tokens_map.json
+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<|im_start|>",
+    "<|im_end|>",
+    "<repo_name>",
+    "<reponame>",
+    "<file_sep>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<jupyter_script>",
+    "<empty_output>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}
--- a/HuggingFaceTB/cosmo2-tokenizer/tokenizer.json
+++ b/HuggingFaceTB/cosmo2-tokenizer/tokenizer.json
--- a/HuggingFaceTB/cosmo2-tokenizer/tokenizer_config.json
+++ b/HuggingFaceTB/cosmo2-tokenizer/tokenizer_config.json
+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<|im_start|>",
+    "<|im_end|>",
+    "<repo_name>",
+    "<reponame>",
+    "<file_sep>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<jupyter_script>",
+    "<empty_output>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 1000000000000000019884624838656,
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}
--- a/HuggingFaceTB/cosmo2-tokenizer/vocab.json
+++ b/HuggingFaceTB/cosmo2-tokenizer/vocab.json
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/Llama-3.2-3B/README.md
+++ b/Llama-3.2-3B/README.md
--- a/Makefile
+++ b/Makefile
+# Run nanotron's tests and examples's tests
+test:
+	pytest \
+        --color=yes \
+        --durations=0 \
+        --ignore tests/fp8 \
+        --verbose \
+        tests/
+
+	pip install -r examples/doremi/requirements.txt
+	pytest \
+        --color=yes \
+        --durations=0 \
+        --ignore tests/fp8 \
+        --verbose \
+        examples/doremi/tests/
+
+	pip install -r examples/llama/requirements.txt
+	pytest \
+        --color=yes \
+        --verbose \
+        examples/llama/tests/
--- a/Meta-Llama-3.1-8B/README.md
+++ b/Meta-Llama-3.1-8B/README.md
--- a/README.md
+++ b/README.md
+# Llama
+彻底开源预训练大模型，本项目能够预训练出超出qwen2.5、llama3效果的大语言模型，为一些人工智能大厂的训练代码。
+
+目前各种SOTA NLP大模型算法都与Llama高度相似，故Llama适合作为算法研发的蓝底，本项目首次从数据集、预训练到调优完全开源大模型算法代码，帮助全世界所有算法研究人员共同研究以促进人类文明进步。
+<div align=center>
+    <img src="./doc/llm.png"/>
+</div>
+
+## 论文
+`Open and Efficient Foundation Language Models`
+- https://arxiv.org/pdf/2302.13971
+
+## 模型结构
+Llama系列采用极简Decoder-only结构，Llama源自基本的transformer结构，主体为attention(QKV自点积)+ffn(全连接)，最后外加一个softmax进行概率转换输出即可，为了使数据分布归一化方便训练收敛，在attention、ffn、softmax前分别再加一个RMS Norm。
+
+总体而言，Llama系列模型结构高度相似，llama1在GPT基础上引入旋转矩阵解决之前绝对位置编码复杂的问题，引入RMSNorm解决LayerNorm计算量大的问题，llama2在llama1的基础上引入GQA进一步减小计算量， llama3在llama2的基础上引入蒸馏、剪枝、量化等再进一步减小计算量，模型中其它模块（如flash-attn2、KV cache）只是增加训练推理效率的模块，本项目兼容Llama系列，以下分别为读者提供Llama的简图和详图帮助读者全方位理解。
+<div align=center>
+    <img src="./doc/llama3.png"/>
+</div>
+
+<div align=center>
+    <img src="./doc/llama3_detail.png"/>
+</div>
+
+Facebook官网最原始的llama3请参考代码：[`Llama3`](https://github.com/meta-llama/llama3/blob/main/llama/model.py)，本项目中的llama结构在ffn等层上略有修改，其它不同点只是模型规模参数和实现方式，读者若需要纯原版llama3可自行修改。
+
+## 算法原理
+整个Llama算法都体现出大道至简的思想。
+
+原理采用极简的纯矩阵计算，llama将输入embedding（将语句根据词汇量和词的位置、属性转换成数字化矩阵）后放入attention+ffn等提取特征，最后利用Softmax将解码器最后一层产生的未经归一化的分数向量（logits）转换为概率分布，其中每个元素表示生成对应词汇的概率，这使得模型可以生成一个分布，并从中选择最可能的词作为预测结果，然后一个字一个预测出来就是咱们看到的对话生成效果。
+
+损失函数采用最简单方便的CE(cross entropy) loss便可。
+<div align=center>
+    <img src="./doc/algorithm.png"/>
+</div>
+
+## 环境配置
+```
+mv nanotron_pytorch nanotron # 去框架名后缀
+```
+
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：b272aae8ec72
+docker run -it --shm-size=64G -v $PWD/nanotron:/home/nanotron -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name llama <your IMAGE ID> bash
+cd /home/nanotron
+pip install -r requirements.txt
+pip install -e . #安装nanotron==0.4库
+pip install whl/rotary_emb-0.1.0+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # 安装rotary_emb==0.1.0
+```
+### Dockerfile（方法二）
+```
+cd cd /home/nanotron/docker
+docker build --no-cache -t llama:latest .
+docker run --shm-size=64G --name llama -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../nanotron:/home/nanotron -it llama bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+cd /home/nanotron
+pip install -e . #安装nanotron==0.4库
+pip install whl/rotary_emb-0.1.0+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # 安装rotary_emb==0.1.0
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.hpccube.com/tool/
+```
+DTK驱动:dtk24.04.3
+python:python3.10
+torch:2.3.0
+torchvision:0.18.1
+torchaudio:2.1.2
+triton:2.1.0
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.3.0
+xformers:0.0.25
+```
+
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/nanotron
+pip install -r requirements.txt
+pip install -e . #安装nanotron==0.4库
+pip install whl/rotary_emb-0.1.0+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # 安装rotary_emb==0.1.0
+```
+
+## 数据集
+实验性的迷你数据集[`openwebtext-10k`](./openwebtext-10k.tar.xz)源于openwebtext，仅供试验，实际训练中读者可在HF下载`*.parquet`开源数据集使用，或将自己的数据集按HF的官方说明制作成此类格式使用。
+
+数据集在训练之前需要用tokenlizer处理成NLP模型的输入tokens，Facebook官方采用tiktoken库制作tockens便可训练出SOTA模型：[`llama3 tokenizer`](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py)，本项目可根据读者需求自由选择各种HF的开源tokenlizer，将其填写在`config`的`.yaml`中便可自动被项目调用。
+
+`openwebtext-10k`用于tiny llama预训练示例，[`fineweb-edu-dedup`](http://113.200.138.88:18080/aidatasets/argilla-warehouse/fineweb-edu-dedup-filtered.git) 用于smollm预训练示例（HF公司自研人工智能模型），从SCNet快速下载通道下载后重命名即可，原始`fineweb-edu-dedup`数据(`*.parquet`)可通过以下命令转换成`fineweb-edu-dedup-ds`数据(`*.ds`)，`datatrove`制作`*.ds`数据参考[`Nanosets`](./docs/nanoset.md):
+```
+sh convert_data_to_ds.sh
+```
+```
+datatrove库 bug solve: Exception: Is a directory (os error 21)
+
+vim /usr/local/lib/python3.10/site-packages/datatrove/utils/tokenization.py , line 19
+modify:
+# return Tokenizer.from_file(name_or_path)
+return Tokenizer.from_file(name_or_path + "/tokenizer.json")
+```
+
+项目中已包含`tokenlizer`：[`dummy`](./robot-test/dummy-tokenizer-wordlevel) 、[`cosmo2`](./HuggingFaceTB/cosmo2-tokenizer) ，其它tokenizer（如：llama3）根据读者需求可自行下载。
+
+预训练数据的完整目录结构如下：
+```
+/home/nanotron
+    ├── openwebtext-10k.tar.xz
+    ├── stas/openwebtext-10k
+        ├── dataset_infos.json
+        ├── openwebtext-10k.py
+        ├── process.txt
+        └── README.md
+    ├── datasets/fineweb-edu-dedup
+        ├── train-00000-of-00002.parquet
+        ├── train-00001-of-00002.parquet
+    └── datasets/fineweb-edu-dedup-ds
+        ├── 00000_unshuffled.ds
+        ├── 00000_unshuffled.ds.index
+        ├── 00000_unshuffled.ds.metadata
+        ...
+```
+`备注：`本项目灵活度大，仅适于算法基础较好的研究人员使用，对算法基础和代码基础有一定的需求，其它人员可能存在一定的上手门槛，可参考光源上预训练项目[`allamo_pytorch`](http://developer.sourcefind.cn/codes/modelzoo/allamo_pytorch.git)中的简单预训练代码llama3__scratch进行上手学习。
+
+## 训练
+### 单机多卡
+本项目的最大特点是完全开源、营造自由科研环境，项目中的算法、模型读者可自由修改、研发以提出自己的算法来为社会做贡献，在[`llama`](./src/nanotron/models/llama.py)修改模型文件，但为了方便介绍，本步骤说明以小规模模型tiny llama作为示例：
+```
+cd /home/nanotron
+sh train.sh # 不同卡数的训练方式参照train.sh中的说明，完整规模llama3的训练方式可参考train.sh中的说明。
+# 遇到Do you wish to run the custom code? [y/N]，填y。
+
+# 其它功能正在优化中，欢迎共同优化和拓展。
+```
+
+Facebook原版llama3的模型参数可参考[`Llama-3.1-8B`](./checkpoints/Nanotron-Llama-3.1-8B/model_config.json)、[`Llama-3.2-3B`](./checkpoints/Nanotron-Llama-3.2-3B/model_config.json)，这两个参数文件根据以下命令可获取：
+```
+sh convert_hf_to_nanotron.sh # Llama系列的基础模型皆支持转换
+
+# 若已预训练完成某个模型，可转换成HF格式权重进行发布，以及用其它开源库继续微调，不同模型请读者根据具体参数修改此文件中的相应参数进行转换。
+# sh convert_nanotron_to_hf.sh
+```
+
+为了方便读者借鉴HF官方的预训练方式，项目中还提供了`smollm`的预训练示例，参考文档[`pre-training`](https://github.com/huggingface/smollm/tree/main/pre-training)：
+```
+sh train_smollm1_135M_demo.sh # Demo仅供试用，细节请自行研究，若读者具备超算集群，可参照launch.slurm编写自己具体的slurm脚本。
+```
+
+
+`Tips ：`通过本项目获得自主研发模型的预训练权重后，后续微调模型放在[`LLaMA-Factory`](https://github.com/hiyouga/LLaMA-Factory.git)、[`ollama`](https://github.com/ollama/ollama.git)、公开「后训练」一切的[`open-instruct`](https://github.com/allenai/open-instruct.git)等工具中进行更方便。
+
+更多资料可参考源项目的[`README_origin`](./README_origin.md)
+
+## 推理
+```
+sh infer.sh # 以checkpoints/10中的权重作为示例，其他权重请参照此示例修改权重路径。
+```
+
+更多资料可参考源项目的[`README_origin`](./README_origin.md)
+
+## result
+由于示例训练数据较少、模型为简化模型且训练时间短，此推理仅供参考显示效果，以方便读者了解项目使用方法，根据官方示例所得：
+
+`输入: `
+```
+input: [CLS] the [UNK] [UNK] [UNK] is [SEP]
+```
+
+`输出:`
+```
+generation: [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]
+```
+上述重复字符为示例Demo的tokenlizer过于简单导致的正常现象，更换一个复杂的tokenlizer便可输出正常结果，例如使用`cosmo2-tokenizer`预训练：
+```
+CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 run_train.py --config-file examples/config_tiny_llama_cosmo2tokenizer.yaml
+```
+由此可见，不同tokenlizer会对训练结果造成明显差别，故建议实际训练中选择设计更好的tokenlizer。
+
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+
+## 应用场景
+### 算法类别
+`对话问答`
+### 热点应用行业
+`制造,广媒,金融,能源,医疗,家居,教育`
+## 预训练权重
+预训练权重快速下载中心：[SCNet AIModels](http://113.200.138.88:18080/aimodels) ，项目中的预训练权重可从快速下载通道下载：[Llama-3.1-8B](http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-8B.git) 、[Llama-3.2-3B](http://113.200.138.88:18080/aimodels/meta-llama/Llama-3.2-3B.git) 。
+
+Hugging Face下载地址为：[meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) 、[meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) 
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/nanotron_pytorch.git
+## 参考资料
+- https://github.com/huggingface/nanotron.git
+- https://github.com/meta-llama/llama3.git
+- https://github.com/hiyouga/LLaMA-Factory.git
+- https://github.com/ollama/ollama.git
+- https://github.com/allenai/open-instruct.git
+
--- a/README_origin.md
+++ b/README_origin.md
+<h1 align="center">⚡️ Nanotron</h1>
+
+<p align="center">
+    <a href="https://github.com/huggingface/nanotron/releases">
+        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/nanotron.svg">
+    </a>
+    <a href="https://github.com/huggingface/nanotron/blob/master/LICENSE">
+        <img alt="License" src="https://img.shields.io/github/license/huggingface/nanotron.svg?color=green">
+    </a>
+</p>
+
+<h4 align="center">
+    <p>
+        <a href="#installation">Installation</a> •
+        <a href="#quick-start">Quick Start</a> •
+        <a href="#features">Features</a> •
+        <a href="CONTRIBUTING.md">Contributing</a>
+    <p>
+</h4>
+
+<h3 align="center">
+    <a href="https://huggingface.co/nanotron"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" /></a>
+</h3>
+<h3 align="center">
+<p>Pretraining models made easy
+</h3>
+
+
+Nanotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Nanotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:
+
+- **Simplicity**: Nanotron is designed to be easy to use. It provides a simple and flexible API to pretrain models on custom datasets.
+- **Performance**: Optimized for speed and scalability, Nanotron uses the latest techniques to train models faster and more efficiently.
+
+## Installation
+
+```bash
+# Requirements: Python>=3.10
+git clone https://github.com/huggingface/nanotron
+cd nanotron
+pip install --upgrade pip
+pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
+pip install -e .
+
+# Install dependencies if you want to use the example scripts
+pip install datasets transformers
+pip install triton "flash-attn>=2.5.0" --no-build-isolation
+```
+> [!NOTE]
+> If you get `undefined symbol: ncclCommRegister` error you should install torch 2.1.2 instead: `pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121`
+
+> [!TIP]
+> We log to wandb automatically if it's installed. For that you can use `pip install wandb`. If you don't want to use wandb, you can run `wandb disabled`.
+
+## Quick Start
+### Training a tiny Llama model
+The following command will train a tiny Llama model on a single node with 8 GPUs. The model will be saved in the `checkpoints` directory as specified in the config file.
+```bash
+CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml
+```
+
+### Run generation from your checkpoint
+```bash
+torchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1
+# We could set a larger TP for faster generation, and a larger PP in case of very large models.
+```
+
+### Custom examples
+You can find more examples in the [`/examples`](/examples) directory:
+<!-- Make a table of the examples we support -->
+| Example | Description |
+| --- | --- |
+| `custom-dataloader` | Plug a custom dataloader to nanotron |
+| `datatrove` | Use the datatrove library to load data |
+| `doremi` | Use DoReMi to speed up training |
+| `mamba` | Train an example Mamba model |
+| `moe` | Train an example Mixture-of-Experts (MoE) model |
+| `mup` | Use spectral µTransfer to scale up your model |
+| `examples/config_tiny_llama_with_s3_upload.yaml` | For automatically uploading checkpoints to S3 |
+
+We're working on adding more examples soon! Feel free to add a PR to add your own example. 🚀
+
+
+## Features
+We currently support the following features:
+- [x] 3D parallelism (DP+TP+PP)
+- [x] Expert parallelism for MoEs
+- [x] AFAB and 1F1B schedules for PP
+- [x] Explicit APIs for TP and PP which enables easy debugging
+- [x] ZeRO-1 optimizer
+- [x] FP32 gradient accumulation
+- [x] Parameter tying/sharding
+- [x] Custom module checkpointing for large models
+- [x] Spectral µTransfer parametrization for scaling up neural networks
+- [x] Mamba example
+
+And we have on our roadmap:
+- [ ] FP8 training
+- [ ] ZeRO-3 optimizer (a.k.a FSDP)
+- [ ] `torch.compile` support
+- [ ] Ring attention
+- [ ] Interleaved 1f1b schedule
+
+## Credits
+We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for `Megatron-LM/apex`, Microsoft for `DeepSpeed`, HazyResearch for `flash-attn`..
--- a/convert_data_to_ds.sh
+++ b/convert_data_to_ds.sh
+python3 tools/preprocess_data.py \
+       --tokenizer-name-or-path HuggingFaceTB/cosmo2-tokenizer \
+       --output-folder datasets/fineweb-edu-dedup-ds \
+       --n-tasks 16 \
+       hf \
+       --dataset datasets/fineweb-edu-dedup \
--- a/convert_hf_to_nanotron.sh
+++ b/convert_hf_to_nanotron.sh
+torchrun --nproc-per-node 1 examples/llama/convert_hf_to_nanotron.py --checkpoint_path Meta-Llama-3.1-8B --save_path checkpoints/Nanotron-Llama-3.1-8B
+# torchrun --nproc-per-node 1 examples/llama/convert_hf_to_nanotron.py --checkpoint_path Llama-3.2-3B --save_path checkpoints/Nanotron-Llama-3.2-3B