strings.rst 8.68 KB
Newer Older
1
2
3
4
5
Strings, bytes and Unicode conversions
######################################

.. note::

Jason Rhinelander's avatar
Jason Rhinelander committed
6
7
8
9
10
    This section discusses string handling in terms of Python 3 strings. For
    Python 2.7, replace all occurrences of ``str`` with ``unicode`` and
    ``bytes`` with ``str``.  Python 2.7 users may find it best to use ``from
    __future__ import unicode_literals`` to avoid unintentionally using ``str``
    instead of ``unicode``.
11
12
13
14

Passing Python strings to C++
=============================

Jason Rhinelander's avatar
Jason Rhinelander committed
15
16
17
18
When a Python ``str`` is passed from Python to a C++ function that accepts
``std::string`` or ``char *`` as arguments, pybind11 will encode the Python
string to UTF-8. All Python ``str`` can be encoded in UTF-8, so this operation
does not fail.
19

Jason Rhinelander's avatar
Jason Rhinelander committed
20
21
22
The C++ language is encoding agnostic. It is the responsibility of the
programmer to track encodings. It's often easiest to simply `use UTF-8
everywhere <http://utf8everywhere.org/>`_.
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

.. code-block:: c++

    m.def("utf8_test",
        [](const std::string &s) {
            cout << "utf-8 is icing on the cake.\n";
            cout << s;
        }
    );
    m.def("utf8_charptr",
        [](const char *s) {
            cout << "My favorite food is\n";
            cout << s;
        }
    );

.. code-block:: python

    >>> utf8_test('🎂')
    utf-8 is icing on the cake.
    🎂

    >>> utf8_charptr('🍕')
    My favorite food is
    🍕

.. note::

Jason Rhinelander's avatar
Jason Rhinelander committed
51
52
    Some terminal emulators do not support UTF-8 or emoji fonts and may not
    display the example above correctly.
53

Jason Rhinelander's avatar
Jason Rhinelander committed
54
55
The results are the same whether the C++ function accepts arguments by value or
reference, and whether or not ``const`` is used.
56
57
58
59

Passing bytes to C++
--------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
60
61
A Python ``bytes`` object will be passed to C++ functions that accept
``std::string`` or ``char*`` *without* conversion.
62
63
64
65
66


Returning C++ strings to Python
===============================

Jason Rhinelander's avatar
Jason Rhinelander committed
67
68
69
70
71
When a C++ function returns a ``std::string`` or ``char*`` to a Python caller,
**pybind11 will assume that the string is valid UTF-8** and will decode it to a
native Python ``str``, using the same API as Python uses to perform
``bytes.decode('utf-8')``. If this implicit conversion fails, pybind11 will
raise a ``UnicodeDecodeError``.
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86

.. code-block:: c++

    m.def("std_string_return",
        []() {
            return std::string("This string needs to be UTF-8 encoded");
        }
    );

.. code-block:: python

    >>> isinstance(example.std_string_return(), str)
    True


Jason Rhinelander's avatar
Jason Rhinelander committed
87
88
89
90
Because UTF-8 is inclusive of pure ASCII, there is never any issue with
returning a pure ASCII string to Python. If there is any possibility that the
string is not pure ASCII, it is necessary to ensure the encoding is valid
UTF-8.
91
92
93

.. warning::

Jason Rhinelander's avatar
Jason Rhinelander committed
94
95
    Implicit conversion assumes that a returned ``char *`` is null-terminated.
    If there is no null terminator a buffer overrun will occur.
96
97
98
99

Explicit conversions
--------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
100
101
102
If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one
can perform a explicit conversion and return a ``py::str`` object. Explicit
conversion has the same overhead as implicit conversion.
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119

.. code-block:: c++

    // This uses the Python C API to convert Latin-1 to Unicode
    m.def("str_output",
        []() {
            std::string s = "Send your r\xe9sum\xe9 to Alice in HR"; // Latin-1
            py::str py_s = PyUnicode_DecodeLatin1(s.data(), s.length());
            return py_s;
        }
    );

.. code-block:: python

    >>> str_output()
    'Send your résumé to Alice in HR'

Jason Rhinelander's avatar
Jason Rhinelander committed
120
121
122
The `Python C API
<https://docs.python.org/3/c-api/unicode.html#built-in-codecs>`_ provides
several built-in codecs.
123
124


Jason Rhinelander's avatar
Jason Rhinelander committed
125
126
One could also use a third party encoding library such as libiconv to transcode
to UTF-8.
127
128
129
130

Return C++ strings without conversion
-------------------------------------

Jason Rhinelander's avatar
Jason Rhinelander committed
131
132
133
If the data in a C++ ``std::string`` does not represent text and should be
returned to Python as ``bytes``, then one can return the data as a
``py::bytes`` object.
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149

.. code-block:: c++

    m.def("return_bytes",
        []() {
            std::string s("\xba\xd0\xba\xd0");  // Not valid UTF-8
            return py::bytes(s);  // Return the data without transcoding
        }
    );

.. code-block:: python

    >>> example.return_bytes()
    b'\xba\xd0\xba\xd0'


Jason Rhinelander's avatar
Jason Rhinelander committed
150
151
Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without
encoding, but cannot convert ``std::string`` back to ``bytes`` implicitly.
152
153
154
155
156
157
158

.. code-block:: c++

    m.def("asymmetry",
        [](std::string s) {  // Accepts str or bytes from Python
            return s;  // Looks harmless, but implicitly converts to str
        }
Jason Rhinelander's avatar
Jason Rhinelander committed
159
    );
160
161
162
163
164
165
166
167
168
169
170
171
172

.. code-block:: python

    >>> isinstance(example.asymmetry(b"have some bytes"), str)
    True

    >>> example.asymmetry(b"\xba\xd0\xba\xd0")  # invalid utf-8 as bytes
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte


Wide character strings
======================

Jason Rhinelander's avatar
Jason Rhinelander committed
173
174
175
When a Python ``str`` is passed to a C++ function expecting ``std::wstring``,
``wchar_t*``, ``std::u16string`` or ``std::u32string``, the ``str`` will be
encoded to UTF-16 or UTF-32 depending on how the C++ compiler implements each
Jason Rhinelander's avatar
Jason Rhinelander committed
176
177
178
type, in the platform's native endianness. When strings of these types are
returned, they are assumed to contain valid UTF-16 or UTF-32, and will be
decoded to Python ``str``.
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206

.. code-block:: c++

    #define UNICODE
    #include <windows.h>

    m.def("set_window_text",
        [](HWND hwnd, std::wstring s) {
            // Call SetWindowText with null-terminated UTF-16 string
            ::SetWindowText(hwnd, s.c_str());
        }
    );
    m.def("get_window_text",
        [](HWND hwnd) {
            const int buffer_size = ::GetWindowTextLength(hwnd) + 1;
            auto buffer = std::make_unique< wchar_t[] >(buffer_size);

            ::GetWindowText(hwnd, buffer.data(), buffer_size);

            std::wstring text(buffer.get());

            // wstring will be converted to Python str
            return text;
        }
    );

.. warning::

Jason Rhinelander's avatar
Jason Rhinelander committed
207
208
    Wide character strings may not work as described on Python 2.7 or Python
    3.3 compiled with ``--enable-unicode=ucs2``.
209

Jason Rhinelander's avatar
Jason Rhinelander committed
210
211
Strings in multibyte encodings such as Shift-JIS must transcoded to a
UTF-8/16/32 before being returned to Python.
212
213
214
215
216


Character literals
==================

Jason Rhinelander's avatar
Jason Rhinelander committed
217
218
219
C++ functions that accept character literals as input will receive the first
character of a Python ``str`` as their input. If the string is longer than one
Unicode character, trailing characters will be ignored.
220

Jason Rhinelander's avatar
Jason Rhinelander committed
221
222
223
When a character literal is returned from C++ (such as a ``char`` or a
``wchar_t``), it will be converted to a ``str`` that represents the single
character.
224
225
226
227
228
229
230

.. code-block:: c++

    m.def("pass_char", [](char c) { return c; });
    m.def("pass_wchar", [](wchar_t w) { return w; });

.. code-block:: python
Jason Rhinelander's avatar
Jason Rhinelander committed
231

232
233
234
    >>> example.pass_char('A')
    'A'

Jason Rhinelander's avatar
Jason Rhinelander committed
235
236
237
While C++ will cast integers to character types (``char c = 0x65;``), pybind11
does not convert Python integers to characters implicitly. The Python function
``chr()`` can be used to convert integers to characters.
238
239

.. code-block:: python
Jason Rhinelander's avatar
Jason Rhinelander committed
240

241
242
243
244
245
246
    >>> example.pass_char(0x65)
    TypeError

    >>> example.pass_char(chr(0x65))
    'A'

Jason Rhinelander's avatar
Jason Rhinelander committed
247
248
If the desire is to work with an 8-bit integer, use ``int8_t`` or ``uint8_t``
as the argument type.
249
250
251
252

Grapheme clusters
-----------------

Jason Rhinelander's avatar
Jason Rhinelander committed
253
254
255
256
257
258
A single grapheme may be represented by two or more Unicode characters. For
example 'é' is usually represented as U+00E9 but can also be expressed as the
combining character sequence U+0065 U+0301 (that is, the letter 'e' followed by
a combining acute accent). The combining character will be lost if the
two-character sequence is passed as an argument, even though it renders as a
single grapheme.
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275

.. code-block:: python

    >>> example.pass_wchar('é')
    'é'

    >>> combining_e_acute = 'e' + '\u0301'

    >>> combining_e_acute
    'é'

    >>> combining_e_acute == 'é'
    False

    >>> example.pass_wchar(combining_e_acute)
    'e'

Jason Rhinelander's avatar
Jason Rhinelander committed
276
277
Normalizing combining characters before passing the character literal to C++
may resolve *some* of these issues:
278
279
280
281
282
283

.. code-block:: python

    >>> example.pass_wchar(unicodedata.normalize('NFC', combining_e_acute))
    'é'

Jason Rhinelander's avatar
Jason Rhinelander committed
284
285
286
287
In some languages (Thai for example), there are `graphemes that cannot be
expressed as a single Unicode code point
<http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`_, so there is
no way to capture them in a C++ character type.
288
289
290
291
292


References
==========

Jason Rhinelander's avatar
Jason Rhinelander committed
293
294
* `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`
* `C++ - Using STL Strings at Win32 API Boundaries <https://msdn.microsoft.com/en-ca/magazine/mt238407.aspx>`_