It turns out that the Python 2
strtype doesn’t hold a string; it holds a sequence ofbytes. In the old ASCII world, a string was simply a sequence of bytes, so this distinction didn’t matter. But in our new Unicode world, thestrtype is not appropriate for working with strings. In Python 2,stris not a string! It’s just a sequence ofbytes.
Py2 vs Py3
python2 str(bytes) unicode
python3 bytes str
Python2 have unicode() but python3 do not.
Provement
Python 2.7.15 (default, Jun 17 2018, 12:46:58)
>>> bytes
<type 'str'>
>>> str
<type 'str'>
unicode vs bytes
bytes <-- utf-8(or other) --> unicode
The io module
New in version 2.6.
The io module provides the Python interfaces to stream handling.
Under Python 2.x, this is proposed as an alternative to the built-in file object, but in Python 3.x it is the default interface to
access files and streams.
Open file in python2 and python3
import io
with io.open(filename, newline='', encoding='utf-8') as f:
content = f.read()
newline controls how universal newlines works (it only applies to
text mode). It can be None, '', '\n', '\r', and '\r\n'.
It works as follows:
- On input, if newline is
None, universal newlines mode is enabled. Lines in the input can end in'\n','\r', or'\r\n', and these are translated into'\n'before being returned to the caller. If it is'', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. - On output, if newline is
None, any'\n'characters written are translated to the system default line separator,os.linesep. If newline is'', no translation takes place. If newline is any of the other legal values, any'\n'characters written are translated to the given string.
Conclusion
So the most friendly way to programmers is setting newline to '' when write file to disable translating to system linesep, and using default None when read file to translate all kinds of line separaters to standard \n.
Files in memory
io.StringIO in py2 and py3 are all only accept unicode, aka unicode in py2 and str in py3.
StringIO.StringIO or cStringIO.StringIO only exist in Python2.
Please use StringIO.StringIO().
http://docs.python.org/library/io.html#io.StringIO
http://docs.python.org/library/stringio.html
io.StringIO is a class. It handles Unicode. It reflects the preferred Python 3 library structure.
StringIO.StringIO is a class. It handles strings. It reflects the legacy Python 2 library structure.
python - How can I use io.StringIO() with the csv module? - Stack Overflow
I/O
file
f.read()- py2:
str(bytes) - py3:
str
- py2:
f.write()- py2
strw: success (convert bytes to text string first automaticlly)wb: success (write directly)
unicode('a中国'.decode('utf8'))w: UnicodeEncodeError: ‘ascii’ codec can’t encode characters`wb: UnicodeEncodeError: ‘ascii’ codec can’t encode characters`- in two modes, python2 try to encode
unicodetostr
- py3:
str:w: successwb:TypeError: a bytes-like object is required, not 'str'
bytes('a中国'.encode('utf8')):w:TypeError: write() argument must be str, not byteswb: success
- py2
JSON
json.loads- py2: Deserialize s (a
strorunicodeinstance containing a JSON document) to a Python object using this conversion table. - py3: Deserialize s (a
str,bytesorbytearrayinstance containing a JSON document) to a Python object using this conversion table.
- py2: Deserialize s (a
json.dumps- py2: Serialize obj to a JSON formatted
strusing this conversion table. Ifensure_asciiis false, the result may contain non-ASCII characters and the return value may be a unicode instance. - py3: Serialize obj to a JSON formatted
strusing this conversion table.
- py2: Serialize obj to a JSON formatted
Conclusion
Both Python2 and Python3 treat str as the main standard type for string.
CSV
Python2’s stdlib csv module is nice, but it doesn’t support unicode. Use GitHub - jdunck/python-unicodecsv instead.
Detail:
Because str in python2 is bytes actually. So if want to write unicode to csv, you must encode unicode to str using utf-8 encoding.
def py2_unicode_to_str(u):
# unicode is only exist in python2
assert isinstance(u, unicode)
return u.encode('utf-8')
Use class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds):
- py2
- The
csvfile:open(fp, 'w') - pass key and value in
byteswhich are encoded withutf-8writer.writerow({py2_unicode_to_str(k): py2_unicode_to_str(v) for k,v in row.items()})
- The
- py3
- The
csvfile:open(fp, 'w') - pass normal dict contains
strasrowtowriter.writerow(row)
- The
Finally code
import sys
is_py2 = sys.version_info[0] == 2
def py2_unicode_to_str(u):
# unicode is only exist in python2
assert isinstance(u, unicode)
return u.encode('utf-8')
with open('file.csv', 'w') as f:
if is_py2:
data = {u'Python中国': u'Python中国', u'Python中国2': u'Python中国2'}
# just one more line to handle this
data = {py2_unicode_to_str(k): py2_unicode_to_str(v) for k, v in data.items()}
fields = list(data[0])
writer = csv.DictWriter(f, fieldnames=fields)
for row in data:
writer.writerow(row)
else:
data = {'Python中国': 'Python中国', 'Python中国2': 'Python中国2'}
fields = list(data[0])
writer = csv.DictWriter(f, fieldnames=fields)
for row in data:
writer.writerow(row)
See my answer on SOF: Read and Write CSV files including unicode with Python 2.7 - Stack Overflow
My library on pypi
GitHub - weaming/fuck-python-str contains several functions to get around with str in Python major verions.