task_summary.mdx 19.5 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

13
# What 馃 Transformers can do
Sylvain Gugger's avatar
Sylvain Gugger committed
14

15
馃 Transformers is a library of pretrained state-of-the-art models for natural language processing (NLP), computer vision, and audio and speech processing tasks. Not only does the library contain Transformer models, but it also has non-Transformer models like modern convolutional networks for computer vision tasks. If you look at some of the most popular consumer products today, like smartphones, apps, and televisions, odds are that some kind of deep learning technology is behind it. Want to remove a background object from a picture taken by your smartphone? This is an example of a panoptic segmentation task (don't worry if you don't know what this means yet, we'll describe it in the following sections!). 
16

17
This page provides an overview of the different speech and audio, computer vision, and NLP tasks that can be solved with the 馃 Transformers library in just three lines of code!
Sylvain Gugger's avatar
Sylvain Gugger committed
18

19
## Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
20

21
Audio and speech processing tasks are a little different from the other modalities mainly because audio as an input is a continuous signal. Unlike text, a raw audio waveform can't be neatly split into discrete chunks the way a sentence can be divided into words. To get around this, the raw audio signal is typically sampled at regular intervals. If you take more samples within an interval, the sampling rate is higher, and the audio more closely resembles the original audio source.
Sylvain Gugger's avatar
Sylvain Gugger committed
22

23
Previous approaches preprocessed the audio to extract useful features from it. It is now more common to start audio and speech processing tasks by directly feeding the raw audio waveform to a feature encoder to extract an audio representation. This simplifies the preprocessing step and allows the model to learn the most essential features.
Sylvain Gugger's avatar
Sylvain Gugger committed
24

25
### Audio classification
Sylvain Gugger's avatar
Sylvain Gugger committed
26

27
Audio classification is a task that labels audio data from a predefined set of classes. It is a broad category with many specific applications, some of which include:
Sylvain Gugger's avatar
Sylvain Gugger committed
28

29
30
31
32
* acoustic scene classification: label audio with a scene label ("office", "beach", "stadium")
* acoustic event detection: label audio with a sound event label ("car horn", "whale calling", "glass breaking")
* tagging: label audio containing multiple sounds (birdsongs, speaker identification in a meeting)
* music classification: label music with a genre label ("metal", "hip-hop", "country")
Sylvain Gugger's avatar
Sylvain Gugger committed
33
34
35
36

```py
>>> from transformers import pipeline

Steven Liu's avatar
Steven Liu committed
37
38
39
40
41
42
43
44
>>> classifier = pipeline(task="audio-classification", model="superb/hubert-base-superb-er")
>>> preds = classifier("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>> preds
[{'score': 0.4532, 'label': 'hap'},
 {'score': 0.3622, 'label': 'sad'},
 {'score': 0.0943, 'label': 'neu'},
 {'score': 0.0903, 'label': 'ang'}]
Sylvain Gugger's avatar
Sylvain Gugger committed
45
46
```

47
### Automatic speech recognition
Sylvain Gugger's avatar
Sylvain Gugger committed
48

49
Automatic speech recognition (ASR) transcribes speech into text. It is one of the most common audio tasks due partly to speech being such a natural form of human communication. Today, ASR systems are embedded in "smart" technology products like speakers, phones, and cars. We can ask our virtual assistants to play music, set reminders, and tell us the weather. 
Sylvain Gugger's avatar
Sylvain Gugger committed
50

51
But one of the key challenges Transformer architectures have helped with is in low-resource languages. By pretraining on large amounts of speech data, finetuning the model on only one hour of labeled speech data in a low-resource language can still produce high-quality results compared to previous ASR systems trained on 100x more labeled data.
Sylvain Gugger's avatar
Sylvain Gugger committed
52
53
54
55

```py
>>> from transformers import pipeline

Steven Liu's avatar
Steven Liu committed
56
57
58
>>> transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small")
>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}
Sylvain Gugger's avatar
Sylvain Gugger committed
59
60
```

61
## Computer vision
Sylvain Gugger's avatar
Sylvain Gugger committed
62

63
One of the first and earliest successful computer vision tasks was recognizing images of zip code numbers using a [convolutional neural network (CNN)](glossary#convolution). An image is composed of pixels, and each pixel has a numerical value. This makes it easy to represent an image as a matrix of pixel values. Each particular combination of pixel values describes the colors of an image. 
Sylvain Gugger's avatar
Sylvain Gugger committed
64

65
Two general ways computer vision tasks can be solved are:
Sylvain Gugger's avatar
Sylvain Gugger committed
66

67
68
1. Use convolutions to learn the hierarchical features of an image from low-level features to high-level abstract things.
2. Split an image into patches and use a Transformer to gradually learn how each image patch is related to each other to form an image. Unlike the bottom-up approach favored by a CNN, this is kind of like starting out with a blurry image and then gradually bringing it into focus.
Sylvain Gugger's avatar
Sylvain Gugger committed
69

70
### Image classification
Sylvain Gugger's avatar
Sylvain Gugger committed
71

72
Image classification labels an entire image from a predefined set of classes. Like most classification tasks, there are many practical use cases for image classification, some of which include:
Sylvain Gugger's avatar
Sylvain Gugger committed
73

74
75
76
77
* healthcare: label medical images to detect disease or monitor patient health
* environment: label satellite images to monitor deforestation, inform wildland management or detect wildfires
* agriculture: label images of crops to monitor plant health or satellite images for land use monitoring 
* ecology: label images of animal or plant species to monitor wildlife populations or track endangered species
Sylvain Gugger's avatar
Sylvain Gugger committed
78
79
80
81

```py
>>> from transformers import pipeline

82
>>> classifier = pipeline(task="image-classification")
Steven Liu's avatar
Steven Liu committed
83
84
85
86
87
88
89
90
91
92
>>> preds = classifier(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
... )
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>> print(*preds, sep="\n")
{'score': 0.4403, 'label': 'lynx, catamount'}
{'score': 0.0343, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}
{'score': 0.0321, 'label': 'snow leopard, ounce, Panthera uncia'}
{'score': 0.0235, 'label': 'Egyptian cat'}
{'score': 0.023, 'label': 'tiger cat'}
Sylvain Gugger's avatar
Sylvain Gugger committed
93
94
```

95
### Object detection
Sylvain Gugger's avatar
Sylvain Gugger committed
96

97
Unlike image classification, object detection identifies multiple objects within an image and the objects' positions in an image (defined by the bounding box). Some example applications of object detection include:
Sylvain Gugger's avatar
Sylvain Gugger committed
98

99
100
101
* self-driving vehicles: detect everyday traffic objects such as other vehicles, pedestrians, and traffic lights
* remote sensing: disaster monitoring, urban planning, and weather forecasting
* defect detection: detect cracks or structural damage in buildings, and manufacturing defects
102

Sylvain Gugger's avatar
Sylvain Gugger committed
103
104
105
```py
>>> from transformers import pipeline

106
>>> detector = pipeline(task="object-detection")
Steven Liu's avatar
Steven Liu committed
107
108
109
110
111
112
113
114
>>> preds = detector(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
... )
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"], "box": pred["box"]} for pred in preds]
>>> preds
[{'score': 0.9865,
  'label': 'cat',
  'box': {'xmin': 178, 'ymin': 154, 'xmax': 882, 'ymax': 598}}]
Sylvain Gugger's avatar
Sylvain Gugger committed
115
116
```

117
### Image segmentation
Sylvain Gugger's avatar
Sylvain Gugger committed
118

119
Image segmentation is a pixel-level task that assigns every pixel in an image to a class. It differs from object detection, which uses bounding boxes to label and predict objects in an image because segmentation is more granular. Segmentation can detect objects at a pixel-level. There are several types of image segmentation:
Sylvain Gugger's avatar
Sylvain Gugger committed
120

121
122
* instance segmentation: in addition to labeling the class of an object, it also labels each distinct instance of an object ("dog-1", "dog-2")
* panoptic segmentation: a combination of semantic and instance segmentation; it labels each pixel with a semantic class **and** each distinct instance of an object
Sylvain Gugger's avatar
Sylvain Gugger committed
123

124
Segmentation tasks are helpful in self-driving vehicles to create a pixel-level map of the world around them so they can navigate safely around pedestrians and other vehicles. It is also useful for medical imaging, where the task's finer granularity can help identify abnormal cells or organ features. Image segmentation can also be used in ecommerce to virtually try on clothes or create augmented reality experiences by overlaying objects in the real world through your camera.
Sylvain Gugger's avatar
Sylvain Gugger committed
125
126
127
128

```py
>>> from transformers import pipeline

129
>>> segmenter = pipeline(task="image-segmentation")
Steven Liu's avatar
Steven Liu committed
130
131
132
133
134
135
136
137
>>> preds = segmenter(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
... )
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>> preds
[{'score': 0.9856, 'label': 'LABEL_184'},
 {'score': 0.9976, 'label': 'snow'},
 {'score': 0.9962, 'label': 'cat'}]
Sylvain Gugger's avatar
Sylvain Gugger committed
138
139
```

140
### Depth estimation
Sylvain Gugger's avatar
Sylvain Gugger committed
141

142
Depth estimation predicts the distance of each pixel in an image from the camera. This computer vision task is especially important for scene understanding and reconstruction. For example, in self-driving cars, vehicles need to understand how far objects like pedestrians, traffic signs, and other vehicles are to avoid obstacles and collisions. Depth information is also helpful for constructing 3D representations from 2D images and can be used to create high-quality 3D representations of biological structures or buildings.
Sylvain Gugger's avatar
Sylvain Gugger committed
143

144
There are two approaches to depth estimation:
Sylvain Gugger's avatar
Sylvain Gugger committed
145

146
147
* stereo: depths are estimated by comparing two images of the same image from slightly different angles
* monocular: depths are estimated from a single image
Sylvain Gugger's avatar
Sylvain Gugger committed
148
149
150
151

```py
>>> from transformers import pipeline

152
>>> depth_estimator = pipeline(task="depth-estimation")
Steven Liu's avatar
Steven Liu committed
153
154
155
>>> preds = depth_estimator(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
... )
Sylvain Gugger's avatar
Sylvain Gugger committed
156
157
```

158
## Natural language processing
Sylvain Gugger's avatar
Sylvain Gugger committed
159

160
NLP tasks are among the most common types of tasks because text is such a natural way for us to communicate. To get text into a format recognized by a model, it needs to be tokenized. This means dividing a sequence of text into separate words or subwords (tokens) and then converting these tokens into numbers. As a result, you can represent a sequence of text as a sequence of numbers, and once you have a sequence of numbers, it can be input into a model to solve all sorts of NLP tasks!
Sylvain Gugger's avatar
Sylvain Gugger committed
161

162
### Text classification
Sylvain Gugger's avatar
Sylvain Gugger committed
163

164
Like classification tasks in any modality, text classification labels a sequence of text (it can be sentence-level, a paragraph, or a document) from a predefined set of classes. There are many practical applications for text classification, some of which include:
Sylvain Gugger's avatar
Sylvain Gugger committed
165

166
167
* sentiment analysis: label text according to some polarity like `positive` or `negative` which can inform and support decision-making in fields like politics, finance, and marketing
* content classification: label text according to some topic to help organize and filter information in news and social media feeds (`weather`, `sports`, `finance`, etc.)
Sylvain Gugger's avatar
Sylvain Gugger committed
168

Sylvain Gugger's avatar
Sylvain Gugger committed
169
```py
170
>>> from transformers import pipeline
Sylvain Gugger's avatar
Sylvain Gugger committed
171

172
>>> classifier = pipeline(task="sentiment-analysis")
Steven Liu's avatar
Steven Liu committed
173
174
175
176
>>> preds = classifier("Hugging Face is the best thing since sliced bread!")
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>> preds
[{'score': 0.9991, 'label': 'POSITIVE'}]
Sylvain Gugger's avatar
Sylvain Gugger committed
177
178
```

179
### Token classification
Sylvain Gugger's avatar
Sylvain Gugger committed
180

181
In any NLP task, text is preprocessed by separating the sequence of text into individual words or subwords. These are known as [tokens](/glossary#token). Token classification assigns each token a label from a predefined set of classes. 
Sylvain Gugger's avatar
Sylvain Gugger committed
182

183
Two common types of token classification are:
Sylvain Gugger's avatar
Sylvain Gugger committed
184

185
186
* named entity recognition (NER): label a token according to an entity category like organization, person, location or date. NER is especially popular in biomedical settings, where it can label genes, proteins, and drug names.
* part-of-speech tagging (POS): label a token according to its part-of-speech like noun, verb, or adjective. POS is useful for helping translation systems understand how two identical words are grammatically different (bank as a noun versus bank as a verb).
Sylvain Gugger's avatar
Sylvain Gugger committed
187
188
189
190

```py
>>> from transformers import pipeline

191
>>> classifier = pipeline(task="ner")
Steven Liu's avatar
Steven Liu committed
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
>>> preds = classifier("Hugging Face is a French company based in New York City.")
>>> preds = [
...     {
...         "entity": pred["entity"],
...         "score": round(pred["score"], 4),
...         "index": pred["index"],
...         "word": pred["word"],
...         "start": pred["start"],
...         "end": pred["end"],
...     }
...     for pred in preds
... ]
>>> print(*preds, sep="\n")
{'entity': 'I-ORG', 'score': 0.9968, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9293, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9763, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-MISC', 'score': 0.9983, 'index': 6, 'word': 'French', 'start': 18, 'end': 24}
{'entity': 'I-LOC', 'score': 0.999, 'index': 10, 'word': 'New', 'start': 42, 'end': 45}
{'entity': 'I-LOC', 'score': 0.9987, 'index': 11, 'word': 'York', 'start': 46, 'end': 50}
{'entity': 'I-LOC', 'score': 0.9992, 'index': 12, 'word': 'City', 'start': 51, 'end': 55}
Sylvain Gugger's avatar
Sylvain Gugger committed
212
```
Sylvain Gugger's avatar
Sylvain Gugger committed
213

214
### Question answering
Sylvain Gugger's avatar
Sylvain Gugger committed
215

216
Question answering is another token-level task that returns an answer to a question, sometimes with context (open-domain) and other times without context (closed-domain). This task happens whenever we ask a virtual assistant something like whether a restaurant is open. It can also provide customer or technical support and help search engines retrieve the relevant information you're asking for. 
217

218
There are two common types of question answering:
219

220
221
* extractive: extractive: given a question and some context, the answer is a span of text from the context the model must extract
* abstractive: given a question and some context, the answer is generated from the context; this approach is handled by the [`Text2TextGenerationPipeline`] instead of the [`QuestionAnsweringPipeline`] shown below
222
223
224
225


```py
>>> from transformers import pipeline
226

227
>>> question_answerer = pipeline(task="question-answering")
Steven Liu's avatar
Steven Liu committed
228
>>> preds = question_answerer(
229
230
...     question="What is the name of the repository?",
...     context="The name of the repository is huggingface/transformers",
231
... )
Steven Liu's avatar
Steven Liu committed
232
233
234
235
>>> print(
...     f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
... )
score: 0.9327, start: 30, end: 54, answer: huggingface/transformers
236
237
```

238
### Summarization
239

240
Summarization creates a shorter version of a text from a longer one while trying to preserve most of the meaning of the original document. Summarization is a sequence-to-sequence task; it outputs a shorter text sequence than the input. There are a lot of long-form documents that can be summarized to help readers quickly understand the main points. Legislative bills, legal and financial documents, patents, and scientific papers are a few examples of documents that could be summarized to save readers time and serve as a reading aid.
241

242
Like question answering, there are two types of summarization:
243

244
245
* extractive: identify and extract the most important sentences from the original text
* abstractive: generate the target summary (which may include new words not in the input document) from the original text; the [`SummarizationPipeline`] uses the abstractive approach
246
247
248
249

```py
>>> from transformers import pipeline

250
251
>>> summarizer = pipeline(task="summarization")
>>> summarizer(
Steven Liu's avatar
Steven Liu committed
252
...     "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles."
253
... )
Steven Liu's avatar
Steven Liu committed
254
[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}]
255
256
```

257
### Translation
258

259
Translation converts a sequence of text in one language to another. It is important in helping people from different backgrounds communicate with each other, help translate content to reach wider audiences, and even be a learning tool to help people learn a new language. Along with summarization, translation is a sequence-to-sequence task, meaning the model receives an input sequence and returns a target output sequence. 
260

261
In the early days, translation models were mostly monolingual, but recently, there has been increasing interest in multilingual models that can translate between many pairs of languages.
262
263
264
265

```py
>>> from transformers import pipeline

266
>>> text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning."
Steven Liu's avatar
Steven Liu committed
267
>>> translator = pipeline(task="translation", model="t5-small")
268
>>> translator(text)
Steven Liu's avatar
Steven Liu committed
269
[{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}]
270
271
```

272
### Language modeling
273

274
Language modeling is a task that predicts a word in a sequence of text. It has become a very popular NLP task because a pretrained language model can be finetuned for many other downstream tasks. Lately, there has been a lot of interest in large language models (LLMs) which demonstrate zero- or few-shot learning. This means the model can solve tasks it wasn't explicitly trained to do! Language models can be used to generate fluent and convincing text, though you need to be careful since the text may not always be accurate.
275

276
There are two types of language modeling:
277

278
* causal: the model's objective is to predict the next token in a sequence, and future tokens are masked
279

280
281
    ```py
    >>> from transformers import pipeline
282

Steven Liu's avatar
Steven Liu committed
283
284
285
    >>> prompt = "Hugging Face is a community-based open-source platform for machine learning."
    >>> generator = pipeline(task="text-generation")
    >>> generator(prompt)  # doctest: +SKIP
286
    ```
287

288
289
290
* masked: the model's objective is to predict a masked token in a sequence with full access to the tokens in the sequence
    
    ```py
Steven Liu's avatar
Steven Liu committed
291
    >>> text = "Hugging Face is a community-based open-source <mask> for machine learning."
292
    >>> fill_mask = pipeline(task="fill-mask")
Steven Liu's avatar
Steven Liu committed
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
    >>> preds = fill_mask(text, top_k=1)
    >>> preds = [
    ...     {
    ...         "score": round(pred["score"], 4),
    ...         "token": pred["token"],
    ...         "token_str": pred["token_str"],
    ...         "sequence": pred["sequence"],
    ...     }
    ...     for pred in preds
    ... ]
    >>> preds
    [{'score': 0.2236,
      'token': 1761,
      'token_str': ' platform',
      'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}]
308
    ```
309

310
Hopefully, this page has given you some more background information about all the types of tasks in each modality and the practical importance of each one. In the next [section](tasks_explained), you'll learn **how** 馃 Transformers work to solve these tasks.