It turns out that the Python 2
str
type doesn’t hold a string; it holds a sequence ofbytes
. In the old ASCII world, a string was simply a sequence of bytes, so this distinction didn’t matter. But in our new Unicode world, thestr
type is not appropriate for working with strings. In Python 2,str
is not a string! It’s just a sequence ofbytes
.
Py2 vs Py3
python2 str(bytes) unicode
python3 bytes str
Python2 have unicode()
but python3 do not.
Provement
Python 2.7.15 (default, Jun 17 2018, 12:46:58)
>>> bytes
<type 'str'>
>>> str
<type 'str'>
unicode vs bytes
bytes <-- utf-8(or other) --> unicode
The io
module
New in version 2.6.
The io
module provides the Python interfaces to stream handling.
Under Python 2.x, this is proposed as an alternative to the built-in file
object, but in Python 3.x it is the default interface to
access files and streams.
Open file in python2 and python3
import io
with io.open(filename, newline='', encoding='utf-8') as f:
content = f.read()
newline controls how universal newlines works (it only applies to
text mode). It can be None
, ''
, '\n'
, '\r'
, and '\r\n'
.
It works as follows:
- On input, if newline is
None
, universal newlines mode is enabled. Lines in the input can end in'\n'
,'\r'
, or'\r\n'
, and these are translated into'\n'
before being returned to the caller. If it is''
, universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. - On output, if newline is
None
, any'\n'
characters written are translated to the system default line separator,os.linesep
. If newline is''
, no translation takes place. If newline is any of the other legal values, any'\n'
characters written are translated to the given string.
Conclusion
So the most friendly way to programmers is setting newline
to ''
when write file to disable translating to system linesep, and using default None
when read file to translate all kinds of line separaters to standard \n
.
Files in memory
io.StringIO
in py2 and py3 are all only accept unicode, aka unicode
in py2 and str
in py3.
StringIO.StringIO
or cStringIO.StringIO
only exist in Python2.
Please use StringIO.StringIO()
.
http://docs.python.org/library/io.html#io.StringIO
http://docs.python.org/library/stringio.html
io.StringIO
is a class. It handles Unicode. It reflects the preferred Python 3 library structure.
StringIO.StringIO
is a class. It handles strings. It reflects the legacy Python 2 library structure.
python - How can I use io.StringIO() with the csv module? - Stack Overflow
I/O
file
f.read()
- py2:
str
(bytes
) - py3:
str
- py2:
f.write()
- py2
str
w
: success (convert bytes to text string first automaticlly)wb
: success (write directly)
unicode
('a中国'.decode('utf8')
)w
: UnicodeEncodeError: ‘ascii’ codec can’t encode characters`wb
: UnicodeEncodeError: ‘ascii’ codec can’t encode characters`- in two modes, python2 try to encode
unicode
tostr
- py3:
str
:w
: successwb
:TypeError: a bytes-like object is required, not 'str'
bytes
('a中国'.encode('utf8')
):w
:TypeError: write() argument must be str, not bytes
wb
: success
- py2
JSON
json.loads
- py2: Deserialize s (a
str
orunicode
instance containing a JSON document) to a Python object using this conversion table. - py3: Deserialize s (a
str
,bytes
orbytearray
instance containing a JSON document) to a Python object using this conversion table.
- py2: Deserialize s (a
json.dumps
- py2: Serialize obj to a JSON formatted
str
using this conversion table. Ifensure_ascii
is false, the result may contain non-ASCII characters and the return value may be a unicode instance. - py3: Serialize obj to a JSON formatted
str
using this conversion table.
- py2: Serialize obj to a JSON formatted
Conclusion
Both Python2 and Python3 treat str
as the main standard type for string.
CSV
Python2’s stdlib csv module is nice, but it doesn’t support unicode. Use GitHub - jdunck/python-unicodecsv instead.
Detail:
Because str
in python2 is bytes
actually. So if want to write unicode
to csv, you must encode unicode
to str
using utf-8
encoding.
def py2_unicode_to_str(u):
# unicode is only exist in python2
assert isinstance(u, unicode)
return u.encode('utf-8')
Use class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds)
:
- py2
- The
csvfile
:open(fp, 'w')
- pass key and value in
bytes
which are encoded withutf-8
writer.writerow({py2_unicode_to_str(k): py2_unicode_to_str(v) for k,v in row.items()})
- The
- py3
- The
csvfile
:open(fp, 'w')
- pass normal dict contains
str
asrow
towriter.writerow(row)
- The
Finally code
import sys
is_py2 = sys.version_info[0] == 2
def py2_unicode_to_str(u):
# unicode is only exist in python2
assert isinstance(u, unicode)
return u.encode('utf-8')
with open('file.csv', 'w') as f:
if is_py2:
data = {u'Python中国': u'Python中国', u'Python中国2': u'Python中国2'}
# just one more line to handle this
data = {py2_unicode_to_str(k): py2_unicode_to_str(v) for k, v in data.items()}
fields = list(data[0])
writer = csv.DictWriter(f, fieldnames=fields)
for row in data:
writer.writerow(row)
else:
data = {'Python中国': 'Python中国', 'Python中国2': 'Python中国2'}
fields = list(data[0])
writer = csv.DictWriter(f, fieldnames=fields)
for row in data:
writer.writerow(row)
See my answer on SOF: Read and Write CSV files including unicode with Python 2.7 - Stack Overflow
My library on pypi
GitHub - weaming/fuck-python-str contains several functions to get around with str
in Python major verions.