Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language:
- sot
thumbnail: https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg
tags:
- sot
- fill-mask
- pytorch
- roberta
- masked-lm
license: MIT
---
# Takalani Sesame - Southern Sotho 🇿🇦
<img src="https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg" width="600"/>
## Model description
Takalani Sesame (named after the South African version of Sesame Street) is a project that aims to promote the use of South African languages in NLP, and in particular look at techniques for low-resource languages to equalise performance with larger languages around the world.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("jannesg/takalane_sot_roberta")
model = AutoModelWithLMHead.from_pretrained("jannesg/takalane_sot_roberta")
```
#### Limitations and bias
Updates will be added continously to improve performance.
## Training data
Data collected from [https://wortschatz.uni-leipzig.de/en](https://wortschatz.uni-leipzig.de/en) <br/>
**Sentences:** 20000
## Training procedure
No preprocessing. Standard Huggingface hyperparameters.
## Author
Jannes Germishuys [website](http://jannesgg.github.io)
---
language:
- tn
thumbnail: https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg
tags:
- tn
- fill-mask
- pytorch
- roberta
- masked-lm
license: MIT
---
# Takalani Sesame - Tswana 🇿🇦
<img src="https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg" width="600"/>
## Model description
Takalani Sesame (named after the South African version of Sesame Street) is a project that aims to promote the use of South African languages in NLP, and in particular look at techniques for low-resource languages to equalise performance with larger languages around the world.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("jannesg/takalane_ssw_roberta")
model = AutoModelWithLMHead.from_pretrained("jannesg/takalane_ssw_roberta")
```
#### Limitations and bias
Updates will be added continously to improve performance.
## Training data
Data collected from [https://wortschatz.uni-leipzig.de/en](https://wortschatz.uni-leipzig.de/en) <br/>
**Sentences:** 380
## Training procedure
No preprocessing. Standard Huggingface hyperparameters.
## Author
Jannes Germishuys [website](http://jannesgg.github.io)
---
language:
- tn
thumbnail: https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg
tags:
- tn
- fill-mask
- pytorch
- roberta
- masked-lm
license: MIT
---
# Takalani Sesame - Tswana 🇿🇦
<img src="https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg" width="600"/>
## Model description
Takalani Sesame (named after the South African version of Sesame Street) is a project that aims to promote the use of South African languages in NLP, and in particular look at techniques for low-resource languages to equalise performance with larger languages around the world.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("jannesg/takalane_tsn_roberta")
model = AutoModelWithLMHead.from_pretrained("jannesg/takalane_tsn_roberta")
```
#### Limitations and bias
Updates will be added continously to improve performance.
## Training data
Data collected from [https://wortschatz.uni-leipzig.de/en](https://wortschatz.uni-leipzig.de/en) <br/>
**Sentences:** 10000
## Training procedure
No preprocessing. Standard Huggingface hyperparameters.
## Author
Jannes Germishuys [website](http://jannesgg.github.io)
---
language:
- ts
thumbnail: https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg
tags:
- ts
- fill-mask
- pytorch
- roberta
- masked-lm
license: MIT
---
# Takalani Sesame - Tsonga 🇿🇦
<img src="https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg" width="600"/>
## Model description
Takalani Sesame (named after the South African version of Sesame Street) is a project that aims to promote the use of South African languages in NLP, and in particular look at techniques for low-resource languages to equalise performance with larger languages around the world.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("jannesg/takalane_tso_roberta")
model = AutoModelWithLMHead.from_pretrained("jannesg/takalane_tso_roberta")
```
#### Limitations and bias
Updates will be added continously to improve performance.
## Training data
Data collected from [https://wortschatz.uni-leipzig.de/en](https://wortschatz.uni-leipzig.de/en) <br/>
**Sentences:** 20000
## Training procedure
No preprocessing. Standard Huggingface hyperparameters.
## Author
Jannes Germishuys [website](http://jannesgg.github.io)
---
language:
- ven
thumbnail: https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg
tags:
- ven
- fill-mask
- pytorch
- roberta
- masked-lm
license: MIT
---
# Takalani Sesame - Venda 🇿🇦
<img src="https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg" width="600"/>
## Model description
Takalani Sesame (named after the South African version of Sesame Street) is a project that aims to promote the use of South African languages in NLP, and in particular look at techniques for low-resource languages to equalise performance with larger languages around the world.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("jannesg/takalane_ven_roberta")
model = AutoModelWithLMHead.from_pretrained("jannesg/takalane_ven_roberta")
```
#### Limitations and bias
Updates will be added continously to improve performance.
## Training data
Data collected from [https://wortschatz.uni-leipzig.de/en](https://wortschatz.uni-leipzig.de/en) <br/>
**Sentences:** 9279
## Training procedure
No preprocessing. Standard Huggingface hyperparameters.
## Author
Jannes Germishuys [website](http://jannesgg.github.io)
---
language:
- xho
thumbnail: https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg
tags:
- xho
- fill-mask
- pytorch
- roberta
- masked-lm
license: MIT
---
# Takalani Sesame - Xhosa 🇿🇦
<img src="https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg" width="600"/>
## Model description
Takalani Sesame (named after the South African version of Sesame Street) is a project that aims to promote the use of South African languages in NLP, and in particular look at techniques for low-resource languages to equalise performance with larger languages around the world.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("jannesg/takalane_xho_roberta")
model = AutoModelWithLMHead.from_pretrained("jannesg/takalane_xho_roberta")
```
#### Limitations and bias
Updates will be added continously to improve performance.
## Training data
Data collected from [https://wortschatz.uni-leipzig.de/en](https://wortschatz.uni-leipzig.de/en) <br/>
**Sentences:** 100000
## Training procedure
No preprocessing. Standard Huggingface hyperparameters.
## Author
Jannes Germishuys [website](http://jannesgg.github.io)
---
language:
- zul
thumbnail: https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg
tags:
- zul
- fill-mask
- pytorch
- roberta
- masked-lm
license: MIT
---
# Takalani Sesame - Zulu 🇿🇦
<img src="https://pbs.twimg.com/media/EVjR6BsWoAAFaq5.jpg" width="600"/>
## Model description
Takalani Sesame (named after the South African version of Sesame Street) is a project that aims to promote the use of South African languages in NLP, and in particular look at techniques for low-resource languages to equalise performance with larger languages around the world.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("jannesg/takalane_zul_roberta")
model = AutoModelWithLMHead.from_pretrained("jannesg/takalane_zul_roberta")
```
#### Limitations and bias
Updates will be added continously to improve performance.
## Training data
Data collected from [https://wortschatz.uni-leipzig.de/en](https://wortschatz.uni-leipzig.de/en) <br/>
**Sentences:** 410000
## Training procedure
No preprocessing. Standard Huggingface hyperparameters.
## Author
Jannes Germishuys [website](http://jannesgg.github.io)
---
language: tl
tags:
- bert
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# BERT Tagalog Base Cased (Whole Word Masking)
Tagalog version of BERT trained on a large preprocessed text corpus scraped and sourced from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community. This particular version uses whole word masking.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/bert-tagalog-base-cased-WWM', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-cased-WWM', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/bert-tagalog-base-cased-WWM')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-cased-WWM', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@inproceedings{localization2020cruz,
title={{Localization of Fake News Detection via Multitask Transfer Learning}},
author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={2589--2597},
year={2020},
url={https://www.aclweb.org/anthology/2020.lrec-1.315}
}
@article{cruz2020establishing,
title={Establishing Baselines for Text Classification in Low-Resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:2005.02068},
year={2020}
}
@article{cruz2019evaluating,
title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:1907.00409},
year={2019}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- bert
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# BERT Tagalog Base Cased
Tagalog version of BERT trained on a large preprocessed text corpus scraped and sourced from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/bert-tagalog-base-cased', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-cased', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/bert-tagalog-base-cased')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-cased', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@inproceedings{localization2020cruz,
title={{Localization of Fake News Detection via Multitask Transfer Learning}},
author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={2589--2597},
year={2020},
url={https://www.aclweb.org/anthology/2020.lrec-1.315}
}
@article{cruz2020establishing,
title={Establishing Baselines for Text Classification in Low-Resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:2005.02068},
year={2020}
}
@article{cruz2019evaluating,
title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:1907.00409},
year={2019}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- bert
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# BERT Tagalog Base Uncased (Whole Word Masking)
Tagalog version of BERT trained on a large preprocessed text corpus scraped and sourced from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community. This particular version uses whole word masking.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/bert-tagalog-base-uncased-WWM', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-uncased-WWM', do_lower_case=True)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/bert-tagalog-base-uncased-WWM')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-uncased-WWM', do_lower_case=True)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@inproceedings{localization2020cruz,
title={{Localization of Fake News Detection via Multitask Transfer Learning}},
author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={2589--2597},
year={2020},
url={https://www.aclweb.org/anthology/2020.lrec-1.315}
}
@article{cruz2020establishing,
title={Establishing Baselines for Text Classification in Low-Resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:2005.02068},
year={2020}
}
@article{cruz2019evaluating,
title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:1907.00409},
year={2019}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- bert
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# BERT Tagalog Base Uncased
Tagalog version of BERT trained on a large preprocessed text corpus scraped and sourced from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/bert-tagalog-base-uncased', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-uncased', do_lower_case=True)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/bert-tagalog-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-uncased', do_lower_case=True)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@inproceedings{localization2020cruz,
title={{Localization of Fake News Detection via Multitask Transfer Learning}},
author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={2589--2597},
year={2020},
url={https://www.aclweb.org/anthology/2020.lrec-1.315}
}
@article{cruz2020establishing,
title={Establishing Baselines for Text Classification in Low-Resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:2005.02068},
year={2020}
}
@article{cruz2019evaluating,
title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:1907.00409},
year={2019}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- distilbert
- bert
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# DistilBERT Tagalog Base Cased
Tagalog version of DistilBERT, distilled from [`bert-tagalog-base-cased`](https://huggingface.co/jcblaise/bert-tagalog-base-cased). This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/distilbert-tagalog-base-cased', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/distilbert-tagalog-base-cased', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/distilbert-tagalog-base-cased')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/distilbert-tagalog-base-cased', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@inproceedings{localization2020cruz,
title={{Localization of Fake News Detection via Multitask Transfer Learning}},
author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={2589--2597},
year={2020},
url={https://www.aclweb.org/anthology/2020.lrec-1.315}
}
@article{cruz2020establishing,
title={Establishing Baselines for Text Classification in Low-Resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:2005.02068},
year={2020}
}
@article{cruz2019evaluating,
title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
journal={arXiv preprint arXiv:1907.00409},
year={2019}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- electra
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# ELECTRA Tagalog Base Cased Discriminator
Tagalog ELECTRA model pretrained with a large corpus scraped from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
This is the discriminator model, which is the main Transformer used for finetuning to downstream tasks. For generation, mask-filling, and retraining, refer to the Generator models.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-base-cased-discriminator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-base-cased-discriminator', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-base-cased-discriminator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-base-cased-discriminator', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@article{cruz2020investigating,
title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
journal={arXiv preprint arXiv:2010.11574},
year={2020}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- electra
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# ELECTRA Tagalog Base Cased Generator
Tagalog ELECTRA model pretrained with a large corpus scraped from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
This is the generator model used to sample synthetic text and pretrain the discriminator. Only use this model for retraining and mask-filling. For the actual model for downstream tasks, please refer to the discriminator models.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-base-cased-generator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-base-cased-generator', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-base-cased-generator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-base-cased-generator', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@article{cruz2020investigating,
title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
journal={arXiv preprint arXiv:2010.11574},
year={2020}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- electra
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# ELECTRA Tagalog Base Uncased Discriminator
Tagalog ELECTRA model pretrained with a large corpus scraped from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
This is the discriminator model, which is the main Transformer used for finetuning to downstream tasks. For generation, mask-filling, and retraining, refer to the Generator models.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-base-uncased-discriminator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-base-uncased-discriminator', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-base-uncased-discriminator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-base-uncased-discriminator', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@article{cruz2020investigating,
title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
journal={arXiv preprint arXiv:2010.11574},
year={2020}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- electra
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# ELECTRA Tagalog Base Uncased Generator
Tagalog ELECTRA model pretrained with a large corpus scraped from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
This is the generator model used to sample synthetic text and pretrain the discriminator. Only use this model for retraining and mask-filling. For the actual model for downstream tasks, please refer to the discriminator models.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-base-uncased-generator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-base-uncased-generator', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-base-uncased-generator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-base-uncased-generator', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@article{cruz2020investigating,
title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
journal={arXiv preprint arXiv:2010.11574},
year={2020}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- electra
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# ELECTRA Tagalog Small Cased Discriminator
Tagalog ELECTRA model pretrained with a large corpus scraped from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
This is the discriminator model, which is the main Transformer used for finetuning to downstream tasks. For generation, mask-filling, and retraining, refer to the Generator models.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-small-cased-discriminator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-cased-discriminator', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-small-cased-discriminator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-cased-discriminator', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@article{cruz2020investigating,
title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
journal={arXiv preprint arXiv:2010.11574},
year={2020}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- electra
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# ELECTRA Tagalog Small Cased Generator
Tagalog ELECTRA model pretrained with a large corpus scraped from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
This is the generator model used to sample synthetic text and pretrain the discriminator. Only use this model for retraining and mask-filling. For the actual model for downstream tasks, please refer to the discriminator models.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-small-cased-generator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-cased-generator', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-small-cased-generator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-cased-generator', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@article{cruz2020investigating,
title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
journal={arXiv preprint arXiv:2010.11574},
year={2020}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- electra
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# ELECTRA Tagalog Small Uncased Discriminator
Tagalog ELECTRA model pretrained with a large corpus scraped from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
This is the discriminator model, which is the main Transformer used for finetuning to downstream tasks. For generation, mask-filling, and retraining, refer to the Generator models.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-small-uncased-discriminator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-uncased-discriminator', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-small-uncased-discriminator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-uncased-discriminator', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@article{cruz2020investigating,
title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
journal={arXiv preprint arXiv:2010.11574},
year={2020}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
---
language: tl
tags:
- electra
- tagalog
- filipino
license: gpl-3.0
inference: false
---
# ELECTRA Tagalog Small Uncased Generator
Tagalog ELECTRA model pretrained with a large corpus scraped from the internet. This model is part of a larger research project. We open-source the model to allow greater usage within the Filipino NLP community.
This is the generator model used to sample synthetic text and pretrain the discriminator. Only use this model for retraining and mask-filling. For the actual model for downstream tasks, please refer to the discriminator models.
## Usage
The model can be loaded and used in both PyTorch and TensorFlow through the HuggingFace Transformers package.
```python
from transformers import TFAutoModel, AutoModel, AutoTokenizer
# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-small-uncased-generator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-uncased-generator', do_lower_case=False)
# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-small-uncased-generator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-uncased-generator', do_lower_case=False)
```
Finetuning scripts and other utilities we use for our projects can be found in our centralized repository at https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks
## Citations
All model details and training setups can be found in our papers. If you use our model or find it useful in your projects, please cite our work:
```
@article{cruz2020investigating,
title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
journal={arXiv preprint arXiv:2010.11574},
year={2020}
}
```
## Data and Other Resources
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
## Contact
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment