# UmBERTo Wikipedia Uncased + italian SQuAD v1 📚 🧐 ❓
[UmBERTo-Wikipedia-Uncased](https://huggingface.co/Musixmatch/umberto-wikipedia-uncased-v1) fine-tuned on [Italian SQUAD v1 dataset](https://github.com/crux82/squad-it) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
[UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking.
UmBERTo-Wikipedia-Uncased Training is trained on a relative small corpus (~7GB) extracted from Wikipedia-ITA.
## Details of the downstream task (Q&A) - Dataset 📚
[SQuAD](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/)[Rajpurkar et al. 2016] is a large scale dataset for training of question answering systems on factoid questions. It contains more than 100,000 question-answer pairs about passages from 536 articles chosen from various domains of Wikipedia.
**SQuAD-it** is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian. The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.
## Model training 🏋️
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
With 10 epochs the model overfits the train dataset so I evaluated the different checkpoints created during training (every 1000 steps) and chose the best (In this case the one created at 17000 steps).