layoutlm.rst 6.22 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

Minghao Li's avatar
Minghao Li committed
13
LayoutLM
Sylvain Gugger's avatar
Sylvain Gugger committed
14
-----------------------------------------------------------------------------------------------------------------------
Minghao Li's avatar
Minghao Li committed
15

NielsRogge's avatar
NielsRogge committed
16
17
.. _Overview:

Minghao Li's avatar
Minghao Li committed
18
Overview
Sylvain Gugger's avatar
Sylvain Gugger committed
19
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Minghao Li's avatar
Minghao Li committed
20

Sylvain Gugger's avatar
Sylvain Gugger committed
21
22
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
23
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
NielsRogge's avatar
NielsRogge committed
24
25
26
27
28
29
30
31
32
information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results
on several downstream tasks:

- form understanding: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a collection of 199 annotated
  forms comprising more than 30,000 words).
- receipt understanding: the `SROIE <https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for
  training and 347 receipts for testing).
- document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
  400,000 images belonging to one of 16 classes).
Minghao Li's avatar
Minghao Li committed
33
34
35

The abstract from the paper is the following:

Sylvain Gugger's avatar
Sylvain Gugger committed
36
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
37
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
Sylvain Gugger's avatar
Sylvain Gugger committed
38
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
NielsRogge's avatar
NielsRogge committed
39
40
41
42
43
44
45
the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is
beneficial for a great number of real-world document image understanding tasks such as information extraction from
scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM.
To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for
document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form
understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
(from 93.07 to 94.42).*
Minghao Li's avatar
Minghao Li committed
46
47
48

Tips:

NielsRogge's avatar
NielsRogge committed
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
- In addition to `input_ids`, :meth:`~transformer.LayoutLMModel.forward` also expects the input :obj:`bbox`, which are
  the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
  as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python wrapper
  <https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1) format, where
  (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
  position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
  scale. To normalize, you can use the following function:

.. code-block::

   def normalize_bbox(bbox, width, height):
        return [
            int(1000 * (bbox[0] / width)),
            int(1000 * (bbox[1] / height)),
            int(1000 * (bbox[2] / width)),
            int(1000 * (bbox[3] / height)),
        ]

Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:

.. code-block::

   from PIL import Image

   image = Image.open("name_of_your_document - can be a png file, pdf, etc.")

   width, height = image.size

- For a demo which shows how to fine-tune :class:`LayoutLMForTokenClassification` on the `FUNSD dataset
  <https://guillaumejaume.github.io/FUNSD/>`__ (a collection of annotated forms), see `this notebook
  <https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb>`__.
  It includes an inference part, which shows how to use Google's Tesseract on a new document.
Minghao Li's avatar
Minghao Li committed
82
83
84
85
86

The original code can be found `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.


LayoutLMConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
87
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Minghao Li's avatar
Minghao Li committed
88
89
90
91
92
93

.. autoclass:: transformers.LayoutLMConfig
    :members:


LayoutLMTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
94
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Minghao Li's avatar
Minghao Li committed
95
96
97
98
99

.. autoclass:: transformers.LayoutLMTokenizer
    :members:


100
101
102
103
104
105
106
LayoutLMTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.LayoutLMTokenizerFast
    :members:


Minghao Li's avatar
Minghao Li committed
107
LayoutLMModel
Sylvain Gugger's avatar
Sylvain Gugger committed
108
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Minghao Li's avatar
Minghao Li committed
109
110
111
112
113
114

.. autoclass:: transformers.LayoutLMModel
    :members:


LayoutLMForMaskedLM
Sylvain Gugger's avatar
Sylvain Gugger committed
115
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Minghao Li's avatar
Minghao Li committed
116
117
118
119
120

.. autoclass:: transformers.LayoutLMForMaskedLM
    :members:


NielsRogge's avatar
NielsRogge committed
121
122
123
124
125
126
127
LayoutLMForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.LayoutLMForSequenceClassification
    :members:


Minghao Li's avatar
Minghao Li committed
128
LayoutLMForTokenClassification
Sylvain Gugger's avatar
Sylvain Gugger committed
129
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Minghao Li's avatar
Minghao Li committed
130
131
132

.. autoclass:: transformers.LayoutLMForTokenClassification
    :members: