strings.rst 9.15 KB
Newer Older
1
2
3
4
5
Strings, bytes and Unicode conversions
######################################

.. note::

Jason Rhinelander's avatar
Jason Rhinelander committed
6
7
8
9
10
    This section discusses string handling in terms of Python 3 strings. For
    Python 2.7, replace all occurrences of ``str`` with ``unicode`` and
    ``bytes`` with ``str``.  Python 2.7 users may find it best to use ``from
    __future__ import unicode_literals`` to avoid unintentionally using ``str``
    instead of ``unicode``.
11
12
13
14

Passing Python strings to C++
=============================

Jason Rhinelander's avatar
Jason Rhinelander committed
15
16
17
18
When a Python ``str`` is passed from Python to a C++ function that accepts
``std::string`` or ``char *`` as arguments, pybind11 will encode the Python
string to UTF-8. All Python ``str`` can be encoded in UTF-8, so this operation
does not fail.
19

Jason Rhinelander's avatar
Jason Rhinelander committed
20
21
22
The C++ language is encoding agnostic. It is the responsibility of the
programmer to track encodings. It's often easiest to simply `use UTF-8
everywhere <http://utf8everywhere.org/>`_.
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

.. code-block:: c++

    m.def("utf8_test",
        [](const std::string &s) {
            cout << "utf-8 is icing on the cake.\n";
            cout << s;
        }
    );
    m.def("utf8_charptr",
        [](const char *s) {
            cout << "My favorite food is\n";
            cout << s;
        }
    );

.. code-block:: python

    >>> utf8_test('🎂')
    utf-8 is icing on the cake.
    🎂

    >>> utf8_charptr('🍕')
    My favorite food is
    🍕

.. note::

Jason Rhinelander's avatar
Jason Rhinelander committed
51
52
    Some terminal emulators do not support UTF-8 or emoji fonts and may not
    display the example above correctly.
53

Jason Rhinelander's avatar
Jason Rhinelander committed
54
55
The results are the same whether the C++ function accepts arguments by value or
reference, and whether or not ``const`` is used.
56
57
58
59

Passing bytes to C++
--------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
60
A Python ``bytes`` object will be passed to C++ functions that accept
61
62
63
``std::string`` or ``char*`` *without* conversion.  On Python 3, in order to
make a function *only* accept ``bytes`` (and not ``str``), declare it as taking
a ``py::bytes`` argument.
64
65
66
67
68


Returning C++ strings to Python
===============================

Jason Rhinelander's avatar
Jason Rhinelander committed
69
70
71
72
73
When a C++ function returns a ``std::string`` or ``char*`` to a Python caller,
**pybind11 will assume that the string is valid UTF-8** and will decode it to a
native Python ``str``, using the same API as Python uses to perform
``bytes.decode('utf-8')``. If this implicit conversion fails, pybind11 will
raise a ``UnicodeDecodeError``.
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

.. code-block:: c++

    m.def("std_string_return",
        []() {
            return std::string("This string needs to be UTF-8 encoded");
        }
    );

.. code-block:: python

    >>> isinstance(example.std_string_return(), str)
    True


Jason Rhinelander's avatar
Jason Rhinelander committed
89
90
91
92
Because UTF-8 is inclusive of pure ASCII, there is never any issue with
returning a pure ASCII string to Python. If there is any possibility that the
string is not pure ASCII, it is necessary to ensure the encoding is valid
UTF-8.
93
94
95

.. warning::

Jason Rhinelander's avatar
Jason Rhinelander committed
96
97
    Implicit conversion assumes that a returned ``char *`` is null-terminated.
    If there is no null terminator a buffer overrun will occur.
98
99
100
101

Explicit conversions
--------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
102
103
104
If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one
can perform a explicit conversion and return a ``py::str`` object. Explicit
conversion has the same overhead as implicit conversion.
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121

.. code-block:: c++

    // This uses the Python C API to convert Latin-1 to Unicode
    m.def("str_output",
        []() {
            std::string s = "Send your r\xe9sum\xe9 to Alice in HR"; // Latin-1
            py::str py_s = PyUnicode_DecodeLatin1(s.data(), s.length());
            return py_s;
        }
    );

.. code-block:: python

    >>> str_output()
    'Send your résumé to Alice in HR'

Jason Rhinelander's avatar
Jason Rhinelander committed
122
123
124
The `Python C API
<https://docs.python.org/3/c-api/unicode.html#built-in-codecs>`_ provides
several built-in codecs.
125
126


Jason Rhinelander's avatar
Jason Rhinelander committed
127
128
One could also use a third party encoding library such as libiconv to transcode
to UTF-8.
129
130
131
132

Return C++ strings without conversion
-------------------------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
133
134
135
If the data in a C++ ``std::string`` does not represent text and should be
returned to Python as ``bytes``, then one can return the data as a
``py::bytes`` object.
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151

.. code-block:: c++

    m.def("return_bytes",
        []() {
            std::string s("\xba\xd0\xba\xd0");  // Not valid UTF-8
            return py::bytes(s);  // Return the data without transcoding
        }
    );

.. code-block:: python

    >>> example.return_bytes()
    b'\xba\xd0\xba\xd0'


Jason Rhinelander's avatar
Jason Rhinelander committed
152
153
Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without
encoding, but cannot convert ``std::string`` back to ``bytes`` implicitly.
154
155
156
157
158
159
160

.. code-block:: c++

    m.def("asymmetry",
        [](std::string s) {  // Accepts str or bytes from Python
            return s;  // Looks harmless, but implicitly converts to str
        }
Jason Rhinelander's avatar
Jason Rhinelander committed
161
    );
162
163
164
165
166
167
168
169
170
171
172
173
174

.. code-block:: python

    >>> isinstance(example.asymmetry(b"have some bytes"), str)
    True

    >>> example.asymmetry(b"\xba\xd0\xba\xd0")  # invalid utf-8 as bytes
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte


Wide character strings
======================

Jason Rhinelander's avatar
Jason Rhinelander committed
175
176
177
When a Python ``str`` is passed to a C++ function expecting ``std::wstring``,
``wchar_t*``, ``std::u16string`` or ``std::u32string``, the ``str`` will be
encoded to UTF-16 or UTF-32 depending on how the C++ compiler implements each
Jason Rhinelander's avatar
Jason Rhinelander committed
178
179
180
type, in the platform's native endianness. When strings of these types are
returned, they are assumed to contain valid UTF-16 or UTF-32, and will be
decoded to Python ``str``.
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208

.. code-block:: c++

    #define UNICODE
    #include <windows.h>

    m.def("set_window_text",
        [](HWND hwnd, std::wstring s) {
            // Call SetWindowText with null-terminated UTF-16 string
            ::SetWindowText(hwnd, s.c_str());
        }
    );
    m.def("get_window_text",
        [](HWND hwnd) {
            const int buffer_size = ::GetWindowTextLength(hwnd) + 1;
            auto buffer = std::make_unique< wchar_t[] >(buffer_size);

            ::GetWindowText(hwnd, buffer.data(), buffer_size);

            std::wstring text(buffer.get());

            // wstring will be converted to Python str
            return text;
        }
    );

.. warning::

Jason Rhinelander's avatar
Jason Rhinelander committed
209
210
    Wide character strings may not work as described on Python 2.7 or Python
    3.3 compiled with ``--enable-unicode=ucs2``.
211

Jason Rhinelander's avatar
Jason Rhinelander committed
212
213
Strings in multibyte encodings such as Shift-JIS must transcoded to a
UTF-8/16/32 before being returned to Python.
214
215
216
217
218


Character literals
==================

Jason Rhinelander's avatar
Jason Rhinelander committed
219
220
221
C++ functions that accept character literals as input will receive the first
character of a Python ``str`` as their input. If the string is longer than one
Unicode character, trailing characters will be ignored.
222

Jason Rhinelander's avatar
Jason Rhinelander committed
223
224
225
When a character literal is returned from C++ (such as a ``char`` or a
``wchar_t``), it will be converted to a ``str`` that represents the single
character.
226
227
228
229
230
231
232

.. code-block:: c++

    m.def("pass_char", [](char c) { return c; });
    m.def("pass_wchar", [](wchar_t w) { return w; });

.. code-block:: python
Jason Rhinelander's avatar
Jason Rhinelander committed
233

234
235
236
    >>> example.pass_char('A')
    'A'

Jason Rhinelander's avatar
Jason Rhinelander committed
237
238
239
While C++ will cast integers to character types (``char c = 0x65;``), pybind11
does not convert Python integers to characters implicitly. The Python function
``chr()`` can be used to convert integers to characters.
240
241

.. code-block:: python
Jason Rhinelander's avatar
Jason Rhinelander committed
242

243
244
245
246
247
248
    >>> example.pass_char(0x65)
    TypeError

    >>> example.pass_char(chr(0x65))
    'A'

Jason Rhinelander's avatar
Jason Rhinelander committed
249
250
If the desire is to work with an 8-bit integer, use ``int8_t`` or ``uint8_t``
as the argument type.
251
252
253
254

Grapheme clusters
-----------------

Jason Rhinelander's avatar
Jason Rhinelander committed
255
256
257
258
259
260
A single grapheme may be represented by two or more Unicode characters. For
example 'é' is usually represented as U+00E9 but can also be expressed as the
combining character sequence U+0065 U+0301 (that is, the letter 'e' followed by
a combining acute accent). The combining character will be lost if the
two-character sequence is passed as an argument, even though it renders as a
single grapheme.
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277

.. code-block:: python

    >>> example.pass_wchar('é')
    'é'

    >>> combining_e_acute = 'e' + '\u0301'

    >>> combining_e_acute
    'é'

    >>> combining_e_acute == 'é'
    False

    >>> example.pass_wchar(combining_e_acute)
    'e'

Jason Rhinelander's avatar
Jason Rhinelander committed
278
279
Normalizing combining characters before passing the character literal to C++
may resolve *some* of these issues:
280
281
282
283
284
285

.. code-block:: python

    >>> example.pass_wchar(unicodedata.normalize('NFC', combining_e_acute))
    'é'

Jason Rhinelander's avatar
Jason Rhinelander committed
286
287
288
289
In some languages (Thai for example), there are `graphemes that cannot be
expressed as a single Unicode code point
<http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`_, so there is
no way to capture them in a C++ character type.
290
291


292
293
294
295
296
297
298
299
300
C++17 string views
==================

C++17 string views are automatically supported when compiling in C++17 mode.
They follow the same rules for encoding and decoding as the corresponding STL
string type (for example, a ``std::u16string_view`` argument will be passed
UTF-16-encoded data, and a returned ``std::string_view`` will be decoded as
UTF-8).

301
302
303
References
==========

304
* `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Jason Rhinelander's avatar
Jason Rhinelander committed
305
* `C++ - Using STL Strings at Win32 API Boundaries <https://msdn.microsoft.com/en-ca/magazine/mt238407.aspx>`_