Unverified Commit 5d8eb93e authored by hugo-syn's avatar hugo-syn Committed by GitHub
Browse files

chore: Fix multiple typos (#28574)

parent 81899778
...@@ -50,7 +50,7 @@ The raw dataset contains many duplicates. We deduplicated and filtered the datas ...@@ -50,7 +50,7 @@ The raw dataset contains many duplicates. We deduplicated and filtered the datas
- fraction of alphanumeric characters < 0.25 - fraction of alphanumeric characters < 0.25
- containing the word "auto-generated" or similar in the first 5 lines - containing the word "auto-generated" or similar in the first 5 lines
- filtering with a probability of 0.7 of files with a mention of "test file" or "configuration file" or similar in the first 5 lines - filtering with a probability of 0.7 of files with a mention of "test file" or "configuration file" or similar in the first 5 lines
- filtering with a probability of 0.7 of files with high occurence of the keywords "test " or "config" - filtering with a probability of 0.7 of files with high occurrence of the keywords "test " or "config"
- filtering with a probability of 0.7 of files without a mention of the keywords `def` , `for`, `while` and `class` - filtering with a probability of 0.7 of files without a mention of the keywords `def` , `for`, `while` and `class`
- filtering files that use the assignment operator `=` less than 5 times - filtering files that use the assignment operator `=` less than 5 times
- filtering files with ratio between number of characters and number of tokens after tokenization < 1.5 (the average ratio is 3.6) - filtering files with ratio between number of characters and number of tokens after tokenization < 1.5 (the average ratio is 3.6)
......
...@@ -1153,7 +1153,7 @@ In the following, we will describe how to do so using a standard console, but yo ...@@ -1153,7 +1153,7 @@ In the following, we will describe how to do so using a standard console, but yo
2. Once you've installed the google cloud sdk, you should set your account by running the following command. Make sure that `<your-email-address>` corresponds to the gmail address you used to sign up for this event. 2. Once you've installed the google cloud sdk, you should set your account by running the following command. Make sure that `<your-email-address>` corresponds to the gmail address you used to sign up for this event.
```bash ```bash
$ gcloud config set account <your-email-adress> $ gcloud config set account <your-email-address>
``` ```
3. Let's also make sure the correct project is set in case your email is used for multiple gcloud projects: 3. Let's also make sure the correct project is set in case your email is used for multiple gcloud projects:
......
...@@ -57,4 +57,4 @@ wget https://huggingface.co/datasets/vasudevgupta/natural-questions-validation/r ...@@ -57,4 +57,4 @@ wget https://huggingface.co/datasets/vasudevgupta/natural-questions-validation/r
python3 evaluate.py python3 evaluate.py
``` ```
You can find our checkpoint on HuggingFace Hub ([see this](https://huggingface.co/vasudevgupta/flax-bigbird-natural-questions)). In case you are interested in PyTorch BigBird fine-tuning, you can refer to [this repositary](https://github.com/thevasudevgupta/bigbird). You can find our checkpoint on HuggingFace Hub ([see this](https://huggingface.co/vasudevgupta/flax-bigbird-natural-questions)). In case you are interested in PyTorch BigBird fine-tuning, you can refer to [this repository](https://github.com/thevasudevgupta/bigbird).
...@@ -27,7 +27,7 @@ To adapt the script for other models, we need to also change the `ParitionSpec` ...@@ -27,7 +27,7 @@ To adapt the script for other models, we need to also change the `ParitionSpec`
TODO: Add more explantion. TODO: Add more explantion.
Before training, let's prepare our model first. To be able to shard the model, the sharded dimention needs to be a multiple of devices it'll be sharded on. But GPTNeo's vocab size is 50257, so we need to resize the embeddings accordingly. Before training, let's prepare our model first. To be able to shard the model, the sharded dimension needs to be a multiple of devices it'll be sharded on. But GPTNeo's vocab size is 50257, so we need to resize the embeddings accordingly.
```python ```python
from transformers import FlaxGPTNeoForCausalLM, GPTNeoConfig from transformers import FlaxGPTNeoForCausalLM, GPTNeoConfig
......
...@@ -95,4 +95,4 @@ python run_mlm_wwm.py \ ...@@ -95,4 +95,4 @@ python run_mlm_wwm.py \
**Note1:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length. **Note1:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.
**Note2:** And if you have any questions or something goes wrong when runing this code, don't hesitate to pin @wlhgtc. **Note2:** And if you have any questions or something goes wrong when running this code, don't hesitate to pin @wlhgtc.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment