# CodeBERTa-language-id: The World’s fanciest programming language identification algo 🤯
To demonstrate the usefulness of our CodeBERTa pretrained model on downstream tasks beyond language modeling, we fine-tune the [`CodeBERTa-small-v1`](https://huggingface.co/huggingface/CodeBERTa-small-v1) checkpoint on the task of classifying a sample of code into the programming language it's written in (*programming language identification*).
We add a sequence classification head on top of the model.
On the evaluation dataset, we attain an eval accuracy and F1 > 0.999 which is not surprising given that the task of language identification is relatively easy (see an intuition why, below).
What if I remove the `const` token from the assignment?
```python
pipeline("foo = 'bar'")
# [{'label': 'javascript', 'score': 0.7176245}]
```
For some reason, this is still statistically detected as JS code, even though it's also valid Python code. However, if we slightly tweak it:
```python
pipeline("foo = u'bar'")
# [{'label': 'python', 'score': 0.7638422}]
```
This is now detected as Python (Notice the `u` string modifier).
Okay, enough with the JS and Python domination already! Let's try fancier languages:
```python
pipeline("echo $FOO")
# [{'label': 'php', 'score': 0.9995257}]
```
(Yes, I used the word "fancy" to describe PHP 😅)
```python
pipeline("outcome := rand.Intn(6) + 1")
# [{'label': 'go', 'score': 0.9936151}]
```
Why is the problem of language identification so easy (with the correct toolkit)? Because code's syntax is rigid, and simple tokens such as `:=` (the assignment operator in Go) are perfect predictors of the underlying language:
```python
pipeline(":=")
# [{'label': 'go', 'score': 0.9998052}]
```
By the way, because we trained our own custom tokenizer on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset, and it handles streams of bytes in a very generic way, syntactic constructs such `:=` are represented by a single token:
CodeBERTa is a RoBERTa-like model trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub.
Supported languages:
```shell
"go"
"java"
"javascript"
"php"
"python"
"ruby"
```
The **tokenizer** is a Byte-level BPE tokenizer trained on the corpus using Hugging Face `tokenizers`.
Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).
The (small) **model** is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full corpus (~2M functions) for 5 epochs.