strings.rst 8.66 KB
Newer Older
1
2
3
4
5
6
Strings, bytes and Unicode conversions
######################################

Passing Python strings to C++
=============================

Jason Rhinelander's avatar
Jason Rhinelander committed
7
8
9
10
When a Python ``str`` is passed from Python to a C++ function that accepts
``std::string`` or ``char *`` as arguments, pybind11 will encode the Python
string to UTF-8. All Python ``str`` can be encoded in UTF-8, so this operation
does not fail.
11

Jason Rhinelander's avatar
Jason Rhinelander committed
12
13
14
The C++ language is encoding agnostic. It is the responsibility of the
programmer to track encodings. It's often easiest to simply `use UTF-8
everywhere <http://utf8everywhere.org/>`_.
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

.. code-block:: c++

    m.def("utf8_test",
        [](const std::string &s) {
            cout << "utf-8 is icing on the cake.\n";
            cout << s;
        }
    );
    m.def("utf8_charptr",
        [](const char *s) {
            cout << "My favorite food is\n";
            cout << s;
        }
    );

31
.. code-block:: pycon
32

33
    >>> utf8_test("🎂")
34
35
36
    utf-8 is icing on the cake.
    🎂

37
    >>> utf8_charptr("🍕")
38
39
40
41
42
    My favorite food is
    🍕

.. note::

Jason Rhinelander's avatar
Jason Rhinelander committed
43
44
    Some terminal emulators do not support UTF-8 or emoji fonts and may not
    display the example above correctly.
45

Jason Rhinelander's avatar
Jason Rhinelander committed
46
47
The results are the same whether the C++ function accepts arguments by value or
reference, and whether or not ``const`` is used.
48
49
50
51

Passing bytes to C++
--------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
52
A Python ``bytes`` object will be passed to C++ functions that accept
53
54
55
``std::string`` or ``char*`` *without* conversion.  In order to make a function
*only* accept ``bytes`` (and not ``str``), declare it as taking a ``py::bytes``
argument.
56
57
58
59
60


Returning C++ strings to Python
===============================

Jason Rhinelander's avatar
Jason Rhinelander committed
61
62
63
64
65
When a C++ function returns a ``std::string`` or ``char*`` to a Python caller,
**pybind11 will assume that the string is valid UTF-8** and will decode it to a
native Python ``str``, using the same API as Python uses to perform
``bytes.decode('utf-8')``. If this implicit conversion fails, pybind11 will
raise a ``UnicodeDecodeError``.
66
67
68
69
70
71
72
73
74

.. code-block:: c++

    m.def("std_string_return",
        []() {
            return std::string("This string needs to be UTF-8 encoded");
        }
    );

75
.. code-block:: pycon
76
77
78
79
80

    >>> isinstance(example.std_string_return(), str)
    True


Jason Rhinelander's avatar
Jason Rhinelander committed
81
82
83
84
Because UTF-8 is inclusive of pure ASCII, there is never any issue with
returning a pure ASCII string to Python. If there is any possibility that the
string is not pure ASCII, it is necessary to ensure the encoding is valid
UTF-8.
85
86
87

.. warning::

Jason Rhinelander's avatar
Jason Rhinelander committed
88
89
    Implicit conversion assumes that a returned ``char *`` is null-terminated.
    If there is no null terminator a buffer overrun will occur.
90
91
92
93

Explicit conversions
--------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
94
95
96
If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one
can perform a explicit conversion and return a ``py::str`` object. Explicit
conversion has the same overhead as implicit conversion.
97
98
99
100
101
102
103
104
105
106
107
108

.. code-block:: c++

    // This uses the Python C API to convert Latin-1 to Unicode
    m.def("str_output",
        []() {
            std::string s = "Send your r\xe9sum\xe9 to Alice in HR"; // Latin-1
            py::str py_s = PyUnicode_DecodeLatin1(s.data(), s.length());
            return py_s;
        }
    );

109
.. code-block:: pycon
110
111
112
113

    >>> str_output()
    'Send your résumé to Alice in HR'

Jason Rhinelander's avatar
Jason Rhinelander committed
114
115
116
The `Python C API
<https://docs.python.org/3/c-api/unicode.html#built-in-codecs>`_ provides
several built-in codecs.
117
118


Jason Rhinelander's avatar
Jason Rhinelander committed
119
120
One could also use a third party encoding library such as libiconv to transcode
to UTF-8.
121
122
123
124

Return C++ strings without conversion
-------------------------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
125
126
127
If the data in a C++ ``std::string`` does not represent text and should be
returned to Python as ``bytes``, then one can return the data as a
``py::bytes`` object.
128
129
130
131
132
133
134
135
136
137

.. code-block:: c++

    m.def("return_bytes",
        []() {
            std::string s("\xba\xd0\xba\xd0");  // Not valid UTF-8
            return py::bytes(s);  // Return the data without transcoding
        }
    );

138
.. code-block:: pycon
139
140
141
142
143

    >>> example.return_bytes()
    b'\xba\xd0\xba\xd0'


Jason Rhinelander's avatar
Jason Rhinelander committed
144
145
Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without
encoding, but cannot convert ``std::string`` back to ``bytes`` implicitly.
146
147
148
149
150
151
152

.. code-block:: c++

    m.def("asymmetry",
        [](std::string s) {  // Accepts str or bytes from Python
            return s;  // Looks harmless, but implicitly converts to str
        }
Jason Rhinelander's avatar
Jason Rhinelander committed
153
    );
154

155
.. code-block:: pycon
156
157
158
159
160
161
162
163
164
165
166

    >>> isinstance(example.asymmetry(b"have some bytes"), str)
    True

    >>> example.asymmetry(b"\xba\xd0\xba\xd0")  # invalid utf-8 as bytes
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte


Wide character strings
======================

Jason Rhinelander's avatar
Jason Rhinelander committed
167
168
169
When a Python ``str`` is passed to a C++ function expecting ``std::wstring``,
``wchar_t*``, ``std::u16string`` or ``std::u32string``, the ``str`` will be
encoded to UTF-16 or UTF-32 depending on how the C++ compiler implements each
Jason Rhinelander's avatar
Jason Rhinelander committed
170
171
172
type, in the platform's native endianness. When strings of these types are
returned, they are assumed to contain valid UTF-16 or UTF-32, and will be
decoded to Python ``str``.
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198

.. code-block:: c++

    #define UNICODE
    #include <windows.h>

    m.def("set_window_text",
        [](HWND hwnd, std::wstring s) {
            // Call SetWindowText with null-terminated UTF-16 string
            ::SetWindowText(hwnd, s.c_str());
        }
    );
    m.def("get_window_text",
        [](HWND hwnd) {
            const int buffer_size = ::GetWindowTextLength(hwnd) + 1;
            auto buffer = std::make_unique< wchar_t[] >(buffer_size);

            ::GetWindowText(hwnd, buffer.data(), buffer_size);

            std::wstring text(buffer.get());

            // wstring will be converted to Python str
            return text;
        }
    );

Jason Rhinelander's avatar
Jason Rhinelander committed
199
200
Strings in multibyte encodings such as Shift-JIS must transcoded to a
UTF-8/16/32 before being returned to Python.
201
202
203
204
205


Character literals
==================

Jason Rhinelander's avatar
Jason Rhinelander committed
206
207
208
C++ functions that accept character literals as input will receive the first
character of a Python ``str`` as their input. If the string is longer than one
Unicode character, trailing characters will be ignored.
209

Jason Rhinelander's avatar
Jason Rhinelander committed
210
211
212
When a character literal is returned from C++ (such as a ``char`` or a
``wchar_t``), it will be converted to a ``str`` that represents the single
character.
213
214
215
216
217
218

.. code-block:: c++

    m.def("pass_char", [](char c) { return c; });
    m.def("pass_wchar", [](wchar_t w) { return w; });

219
.. code-block:: pycon
Jason Rhinelander's avatar
Jason Rhinelander committed
220

221
    >>> example.pass_char("A")
222
223
    'A'

Jason Rhinelander's avatar
Jason Rhinelander committed
224
225
226
While C++ will cast integers to character types (``char c = 0x65;``), pybind11
does not convert Python integers to characters implicitly. The Python function
``chr()`` can be used to convert integers to characters.
227

228
.. code-block:: pycon
Jason Rhinelander's avatar
Jason Rhinelander committed
229

230
231
232
233
234
235
    >>> example.pass_char(0x65)
    TypeError

    >>> example.pass_char(chr(0x65))
    'A'

Jason Rhinelander's avatar
Jason Rhinelander committed
236
237
If the desire is to work with an 8-bit integer, use ``int8_t`` or ``uint8_t``
as the argument type.
238
239
240
241

Grapheme clusters
-----------------

Jason Rhinelander's avatar
Jason Rhinelander committed
242
243
244
245
246
247
A single grapheme may be represented by two or more Unicode characters. For
example 'é' is usually represented as U+00E9 but can also be expressed as the
combining character sequence U+0065 U+0301 (that is, the letter 'e' followed by
a combining acute accent). The combining character will be lost if the
two-character sequence is passed as an argument, even though it renders as a
single grapheme.
248

249
.. code-block:: pycon
250

251
    >>> example.pass_wchar("é")
252
253
    'é'

254
    >>> combining_e_acute = "e" + "\u0301"
255
256
257
258

    >>> combining_e_acute
    'é'

259
    >>> combining_e_acute == "é"
260
261
262
263
264
    False

    >>> example.pass_wchar(combining_e_acute)
    'e'

Jason Rhinelander's avatar
Jason Rhinelander committed
265
266
Normalizing combining characters before passing the character literal to C++
may resolve *some* of these issues:
267

268
.. code-block:: pycon
269

270
    >>> example.pass_wchar(unicodedata.normalize("NFC", combining_e_acute))
271
272
    'é'

Jason Rhinelander's avatar
Jason Rhinelander committed
273
274
275
276
In some languages (Thai for example), there are `graphemes that cannot be
expressed as a single Unicode code point
<http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`_, so there is
no way to capture them in a C++ character type.
277
278


279
280
281
282
283
284
285
286
287
C++17 string views
==================

C++17 string views are automatically supported when compiling in C++17 mode.
They follow the same rules for encoding and decoding as the corresponding STL
string type (for example, a ``std::u16string_view`` argument will be passed
UTF-16-encoded data, and a returned ``std::string_view`` will be decoded as
UTF-8).

288
289
290
References
==========

291
* `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Jason Rhinelander's avatar
Jason Rhinelander committed
292
* `C++ - Using STL Strings at Win32 API Boundaries <https://msdn.microsoft.com/en-ca/magazine/mt238407.aspx>`_