String in Python2 and Python3

发布于 2018-08-19 03:42:33

It turns out that the Python 2 str type doesn’t hold a string; it holds a sequence of bytes. In the old ASCII world, a string was simply a sequence of bytes, so this distinction didn’t matter. But in our new Unicode world, the str type is not appropriate for working with strings. In Python 2, str is not a string! It’s just a sequence of bytes.

Py2 vs Py3

python2 str(bytes) unicode
python3 bytes      str

Python2 have unicode() but python3 do not.

Provement

Python 2.7.15 (default, Jun 17 2018, 12:46:58)
>>> bytes
<type 'str'>
>>> str
<type 'str'>

unicode vs bytes

bytes <-- utf-8(or other) --> unicode

The io module

New in version 2.6.

The io module provides the Python interfaces to stream handling.

Under Python 2.x, this is proposed as an alternative to the built-in file object, but in Python 3.x it is the default interface to access files and streams.

Open file in python2 and python3

import io

with io.open(filename, newline='', encoding='utf-8') as f:
    content = f.read()

newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:

  • On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
  • On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

Conclusion

So the most friendly way to programmers is setting newline to '' when write file to disable translating to system linesep, and using default None when read file to translate all kinds of line separaters to standard \n.

Files in memory

io.StringIO in py2 and py3 are all only accept unicode, aka unicode in py2 and str in py3.

StringIO.StringIO or cStringIO.StringIO only exist in Python2.


Please use StringIO.StringIO().

http://docs.python.org/library/io.html#io.StringIO

http://docs.python.org/library/stringio.html

io.StringIO is a class. It handles Unicode. It reflects the preferred Python 3 library structure.

StringIO.StringIO is a class. It handles strings. It reflects the legacy Python 2 library structure.

python - How can I use io.StringIO() with the csv module? - Stack Overflow

I/O

file

  • f.read()
    • py2: str (bytes)
    • py3: str
  • f.write()
    • py2
      • str
        • w: success (convert bytes to text string first automaticlly)
        • wb: success (write directly)
      • unicode('a中国'.decode('utf8'))
        • w: UnicodeEncodeError: ‘ascii’ codec can’t encode characters`
        • wb: UnicodeEncodeError: ‘ascii’ codec can’t encode characters`
        • in two modes, python2 try to encode unicode to str
    • py3:
      • str:
        • w: success
        • wb: TypeError: a bytes-like object is required, not 'str'
      • bytes('a中国'.encode('utf8')):
        • w: TypeError: write() argument must be str, not bytes
        • wb: success

JSON

  • json.loads
    • py2: Deserialize s (a str or unicode instance containing a JSON document) to a Python object using this conversion table.
    • py3: Deserialize s (a str, bytes or bytearray instance containing a JSON document) to a Python object using this conversion table.
  • json.dumps
    • py2: Serialize obj to a JSON formatted str using this conversion table. If ensure_ascii is false, the result may contain non-ASCII characters and the return value may be a unicode instance.
    • py3: Serialize obj to a JSON formatted str using this conversion table.

Conclusion

Both Python2 and Python3 treat str as the main standard type for string.

CSV

Python2’s stdlib csv module is nice, but it doesn’t support unicode. Use GitHub - jdunck/python-unicodecsv instead.

Detail:

Because str in python2 is bytes actually. So if want to write unicode to csv, you must encode unicode to str using utf-8 encoding.

def py2_unicode_to_str(u):
    # unicode is only exist in python2
    assert isinstance(u, unicode)
    return u.encode('utf-8')

Use class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds):

  • py2
    • The csvfile: open(fp, 'w')
    • pass key and value in bytes which are encoded with utf-8
      • writer.writerow({py2_unicode_to_str(k): py2_unicode_to_str(v) for k,v in row.items()})
  • py3
    • The csvfile: open(fp, 'w')
    • pass normal dict contains str as row to writer.writerow(row)

Finally code

import sys

is_py2 = sys.version_info[0] == 2

def py2_unicode_to_str(u):
    # unicode is only exist in python2
    assert isinstance(u, unicode)
    return u.encode('utf-8')

with open('file.csv', 'w') as f:
    if is_py2:
        data = {u'Python中国': u'Python中国', u'Python中国2': u'Python中国2'}

        # just one more line to handle this
        data = {py2_unicode_to_str(k): py2_unicode_to_str(v) for k, v in data.items()}

        fields = list(data[0])
        writer = csv.DictWriter(f, fieldnames=fields)

        for row in data:
            writer.writerow(row)
    else:
        data = {'Python中国': 'Python中国', 'Python中国2': 'Python中国2'}

        fields = list(data[0])
        writer = csv.DictWriter(f, fieldnames=fields)

        for row in data:
            writer.writerow(row)

See my answer on SOF: Read and Write CSV files including unicode with Python 2.7 - Stack Overflow

My library on pypi

GitHub - weaming/fuck-python-str contains several functions to get around with str in Python major verions.

comments powered by Disqus