task_summary.mdx 15 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

13
# What 馃 Transformers can do
Sylvain Gugger's avatar
Sylvain Gugger committed
14

15
馃 Transformers is a library of pretrained state-of-the-art models for natural language processing (NLP), computer vision, and audio and speech processing tasks. Not only does the library contain Transformer models, but it also has non-Transformer models like modern convolutional networks for computer vision tasks. If you look at some of the most popular consumer products today, like smartphones, apps, and televisions, odds are that some kind of deep learning technology is behind it. Want to remove a background object from a picture taken by your smartphone? This is an example of a panoptic segmentation task (don't worry if you don't know what this means yet, we'll describe it in the following sections!). 
16

17
This page provides an overview of the different speech and audio, computer vision, and NLP tasks that can be solved with the 馃 Transformers library in just three lines of code!
Sylvain Gugger's avatar
Sylvain Gugger committed
18

19
## Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
20

21
Audio and speech processing tasks are a little different from the other modalities mainly because audio as an input is a continuous signal. Unlike text, a raw audio waveform can't be neatly split into discrete chunks the way a sentence can be divided into words. To get around this, the raw audio signal is typically sampled at regular intervals. If you take more samples within an interval, the sampling rate is higher, and the audio more closely resembles the original audio source.
Sylvain Gugger's avatar
Sylvain Gugger committed
22

23
Previous approaches preprocessed the audio to extract useful features from it. It is now more common to start audio and speech processing tasks by directly feeding the raw audio waveform to a feature encoder to extract an audio representation. This simplifies the preprocessing step and allows the model to learn the most essential features.
Sylvain Gugger's avatar
Sylvain Gugger committed
24

25
### Audio classification
Sylvain Gugger's avatar
Sylvain Gugger committed
26

27
Audio classification is a task that labels audio data from a predefined set of classes. It is a broad category with many specific applications, some of which include:
Sylvain Gugger's avatar
Sylvain Gugger committed
28

29
30
31
32
* acoustic scene classification: label audio with a scene label ("office", "beach", "stadium")
* acoustic event detection: label audio with a sound event label ("car horn", "whale calling", "glass breaking")
* tagging: label audio containing multiple sounds (birdsongs, speaker identification in a meeting)
* music classification: label music with a genre label ("metal", "hip-hop", "country")
Sylvain Gugger's avatar
Sylvain Gugger committed
33
34
35
36

```py
>>> from transformers import pipeline

37
38
>>> classifier = pipeline(task="audio-classification")
>>> classifier("path/to/audio/file.mp3")
Sylvain Gugger's avatar
Sylvain Gugger committed
39
40
```

41
### Automatic speech recognition
Sylvain Gugger's avatar
Sylvain Gugger committed
42

43
Automatic speech recognition (ASR) transcribes speech into text. It is one of the most common audio tasks due partly to speech being such a natural form of human communication. Today, ASR systems are embedded in "smart" technology products like speakers, phones, and cars. We can ask our virtual assistants to play music, set reminders, and tell us the weather. 
Sylvain Gugger's avatar
Sylvain Gugger committed
44

45
But one of the key challenges Transformer architectures have helped with is in low-resource languages. By pretraining on large amounts of speech data, finetuning the model on only one hour of labeled speech data in a low-resource language can still produce high-quality results compared to previous ASR systems trained on 100x more labeled data.
Sylvain Gugger's avatar
Sylvain Gugger committed
46
47
48
49

```py
>>> from transformers import pipeline

50
51
>>> transcriber = pipeline(task="automatic-speech-recognition")
>>> transcriber("path/to/audio/file.mp3")
Sylvain Gugger's avatar
Sylvain Gugger committed
52
53
```

54
## Computer vision
Sylvain Gugger's avatar
Sylvain Gugger committed
55

56
One of the first and earliest successful computer vision tasks was recognizing images of zip code numbers using a [convolutional neural network (CNN)](glossary#convolution). An image is composed of pixels, and each pixel has a numerical value. This makes it easy to represent an image as a matrix of pixel values. Each particular combination of pixel values describes the colors of an image. 
Sylvain Gugger's avatar
Sylvain Gugger committed
57

58
Two general ways computer vision tasks can be solved are:
Sylvain Gugger's avatar
Sylvain Gugger committed
59

60
61
1. Use convolutions to learn the hierarchical features of an image from low-level features to high-level abstract things.
2. Split an image into patches and use a Transformer to gradually learn how each image patch is related to each other to form an image. Unlike the bottom-up approach favored by a CNN, this is kind of like starting out with a blurry image and then gradually bringing it into focus.
Sylvain Gugger's avatar
Sylvain Gugger committed
62

63
### Image classification
Sylvain Gugger's avatar
Sylvain Gugger committed
64

65
Image classification labels an entire image from a predefined set of classes. Like most classification tasks, there are many practical use cases for image classification, some of which include:
Sylvain Gugger's avatar
Sylvain Gugger committed
66

67
68
69
70
* healthcare: label medical images to detect disease or monitor patient health
* environment: label satellite images to monitor deforestation, inform wildland management or detect wildfires
* agriculture: label images of crops to monitor plant health or satellite images for land use monitoring 
* ecology: label images of animal or plant species to monitor wildlife populations or track endangered species
Sylvain Gugger's avatar
Sylvain Gugger committed
71
72
73
74

```py
>>> from transformers import pipeline

75
76
>>> classifier = pipeline(task="image-classification")
>>> classifier("path/to/image/file.jpg")
Sylvain Gugger's avatar
Sylvain Gugger committed
77
78
```

79
### Object detection
Sylvain Gugger's avatar
Sylvain Gugger committed
80

81
Unlike image classification, object detection identifies multiple objects within an image and the objects' positions in an image (defined by the bounding box). Some example applications of object detection include:
Sylvain Gugger's avatar
Sylvain Gugger committed
82

83
84
85
* self-driving vehicles: detect everyday traffic objects such as other vehicles, pedestrians, and traffic lights
* remote sensing: disaster monitoring, urban planning, and weather forecasting
* defect detection: detect cracks or structural damage in buildings, and manufacturing defects
86

Sylvain Gugger's avatar
Sylvain Gugger committed
87
88
89
```py
>>> from transformers import pipeline

90
91
>>> detector = pipeline(task="object-detection")
>>> detector("path/to/image/file.jpg")
Sylvain Gugger's avatar
Sylvain Gugger committed
92
93
```

94
### Image segmentation
Sylvain Gugger's avatar
Sylvain Gugger committed
95

96
Image segmentation is a pixel-level task that assigns every pixel in an image to a class. It differs from object detection, which uses bounding boxes to label and predict objects in an image because segmentation is more granular. Segmentation can detect objects at a pixel-level. There are several types of image segmentation:
Sylvain Gugger's avatar
Sylvain Gugger committed
97

98
99
* instance segmentation: in addition to labeling the class of an object, it also labels each distinct instance of an object ("dog-1", "dog-2")
* panoptic segmentation: a combination of semantic and instance segmentation; it labels each pixel with a semantic class **and** each distinct instance of an object
Sylvain Gugger's avatar
Sylvain Gugger committed
100

101
Segmentation tasks are helpful in self-driving vehicles to create a pixel-level map of the world around them so they can navigate safely around pedestrians and other vehicles. It is also useful for medical imaging, where the task's finer granularity can help identify abnormal cells or organ features. Image segmentation can also be used in ecommerce to virtually try on clothes or create augmented reality experiences by overlaying objects in the real world through your camera.
Sylvain Gugger's avatar
Sylvain Gugger committed
102
103
104
105

```py
>>> from transformers import pipeline

106
107
>>> segmenter = pipeline(task="image-segmentation")
>>> segmenter("path/to/image/file.jpg")
Sylvain Gugger's avatar
Sylvain Gugger committed
108
109
```

110
### Depth estimation
Sylvain Gugger's avatar
Sylvain Gugger committed
111

112
Depth estimation predicts the distance of each pixel in an image from the camera. This computer vision task is especially important for scene understanding and reconstruction. For example, in self-driving cars, vehicles need to understand how far objects like pedestrians, traffic signs, and other vehicles are to avoid obstacles and collisions. Depth information is also helpful for constructing 3D representations from 2D images and can be used to create high-quality 3D representations of biological structures or buildings.
Sylvain Gugger's avatar
Sylvain Gugger committed
113

114
There are two approaches to depth estimation:
Sylvain Gugger's avatar
Sylvain Gugger committed
115

116
117
* stereo: depths are estimated by comparing two images of the same image from slightly different angles
* monocular: depths are estimated from a single image
Sylvain Gugger's avatar
Sylvain Gugger committed
118
119
120
121

```py
>>> from transformers import pipeline

122
123
>>> depth_estimator = pipeline(task="depth-estimation")
>>> depth_estimator("path/to/image/file.jpg")
Sylvain Gugger's avatar
Sylvain Gugger committed
124
125
```

126
## Natural language processing
Sylvain Gugger's avatar
Sylvain Gugger committed
127

128
NLP tasks are among the most common types of tasks because text is such a natural way for us to communicate. To get text into a format recognized by a model, it needs to be tokenized. This means dividing a sequence of text into separate words or subwords (tokens) and then converting these tokens into numbers. As a result, you can represent a sequence of text as a sequence of numbers, and once you have a sequence of numbers, it can be input into a model to solve all sorts of NLP tasks!
Sylvain Gugger's avatar
Sylvain Gugger committed
129

130
### Text classification
Sylvain Gugger's avatar
Sylvain Gugger committed
131

132
Like classification tasks in any modality, text classification labels a sequence of text (it can be sentence-level, a paragraph, or a document) from a predefined set of classes. There are many practical applications for text classification, some of which include:
Sylvain Gugger's avatar
Sylvain Gugger committed
133

134
135
* sentiment analysis: label text according to some polarity like `positive` or `negative` which can inform and support decision-making in fields like politics, finance, and marketing
* content classification: label text according to some topic to help organize and filter information in news and social media feeds (`weather`, `sports`, `finance`, etc.)
Sylvain Gugger's avatar
Sylvain Gugger committed
136

Sylvain Gugger's avatar
Sylvain Gugger committed
137
```py
138
>>> from transformers import pipeline
Sylvain Gugger's avatar
Sylvain Gugger committed
139

140
141
>>> classifier = pipeline(task="sentiment-analysis")
>>> classifier("Hugging Face is the best thing since sliced bread!")
Sylvain Gugger's avatar
Sylvain Gugger committed
142
143
```

144
### Token classification
Sylvain Gugger's avatar
Sylvain Gugger committed
145

146
In any NLP task, text is preprocessed by separating the sequence of text into individual words or subwords. These are known as [tokens](/glossary#token). Token classification assigns each token a label from a predefined set of classes. 
Sylvain Gugger's avatar
Sylvain Gugger committed
147

148
Two common types of token classification are:
Sylvain Gugger's avatar
Sylvain Gugger committed
149

150
151
* named entity recognition (NER): label a token according to an entity category like organization, person, location or date. NER is especially popular in biomedical settings, where it can label genes, proteins, and drug names.
* part-of-speech tagging (POS): label a token according to its part-of-speech like noun, verb, or adjective. POS is useful for helping translation systems understand how two identical words are grammatically different (bank as a noun versus bank as a verb).
Sylvain Gugger's avatar
Sylvain Gugger committed
152
153
154
155

```py
>>> from transformers import pipeline

156
157
>>> classifier = pipeline(task="ner")
>>> classifier("Hugging Face is a French company based in New York City.")
Sylvain Gugger's avatar
Sylvain Gugger committed
158
```
Sylvain Gugger's avatar
Sylvain Gugger committed
159

160
### Question answering
Sylvain Gugger's avatar
Sylvain Gugger committed
161

162
Question answering is another token-level task that returns an answer to a question, sometimes with context (open-domain) and other times without context (closed-domain). This task happens whenever we ask a virtual assistant something like whether a restaurant is open. It can also provide customer or technical support and help search engines retrieve the relevant information you're asking for. 
163

164
There are two common types of question answering:
165

166
167
* extractive: extractive: given a question and some context, the answer is a span of text from the context the model must extract
* abstractive: given a question and some context, the answer is generated from the context; this approach is handled by the [`Text2TextGenerationPipeline`] instead of the [`QuestionAnsweringPipeline`] shown below
168
169
170
171


```py
>>> from transformers import pipeline
172

173
174
175
176
>>> question_answerer = pipeline(task="question-answering")
>>> question_answerer(
...     question="What is the name of the repository?",
...     context="The name of the repository is huggingface/transformers",
177
178
179
... )
```

180
### Summarization
181

182
Summarization creates a shorter version of a text from a longer one while trying to preserve most of the meaning of the original document. Summarization is a sequence-to-sequence task; it outputs a shorter text sequence than the input. There are a lot of long-form documents that can be summarized to help readers quickly understand the main points. Legislative bills, legal and financial documents, patents, and scientific papers are a few examples of documents that could be summarized to save readers time and serve as a reading aid.
183

184
Like question answering, there are two types of summarization:
185

186
187
* extractive: identify and extract the most important sentences from the original text
* abstractive: generate the target summary (which may include new words not in the input document) from the original text; the [`SummarizationPipeline`] uses the abstractive approach
188
189
190
191

```py
>>> from transformers import pipeline

192
193
194
195
>>> summarizer = pipeline(task="summarization")
>>> summarizer(
...     "Hugging Face is a French company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
... )
196
197
```

198
### Translation
199

200
Translation converts a sequence of text in one language to another. It is important in helping people from different backgrounds communicate with each other, help translate content to reach wider audiences, and even be a learning tool to help people learn a new language. Along with summarization, translation is a sequence-to-sequence task, meaning the model receives an input sequence and returns a target output sequence. 
201

202
In the early days, translation models were mostly monolingual, but recently, there has been increasing interest in multilingual models that can translate between many pairs of languages.
203
204
205
206

```py
>>> from transformers import pipeline

207
208
209
>>> text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning."
>>> translator = pipeline(task="translation")
>>> translator(text)
210
211
```

212
### Language modeling
213

214
Language modeling is a task that predicts a word in a sequence of text. It has become a very popular NLP task because a pretrained language model can be finetuned for many other downstream tasks. Lately, there has been a lot of interest in large language models (LLMs) which demonstrate zero- or few-shot learning. This means the model can solve tasks it wasn't explicitly trained to do! Language models can be used to generate fluent and convincing text, though you need to be careful since the text may not always be accurate.
215

216
There are two types of language modeling:
217

218
* causal: the model's objective is to predict the next token in a sequence, and future tokens are masked
219

220
221
    ```py
    >>> from transformers import pipeline
222

223
224
225
226
    >>> prompt = "Hugging Face is a"
    >>> text_generator = pipeline(task="text-generation")
    >>> text_generator(prompt)
    ```
227

228
229
230
231
232
233
234
* masked: the model's objective is to predict a masked token in a sequence with full access to the tokens in the sequence
    
    ```py
    >>> text = "Hugging Face is a <mask> company based in New York City."
    >>> fill_mask = pipeline(task="fill-mask")
    >>> fill_mask(text, top_k=3)
    ```
235

236
Hopefully, this page has given you some more background information about all the types of tasks in each modality and the practical importance of each one. In the next [section](tasks_explained), you'll learn **how** 馃 Transformers work to solve these tasks.