python encode

Declare

1
# -*- coding:utf-8 -*-

make clare that the origin code context use utf-8 encoding.

if not declate when unioncode sign appare in origin text, python throw exception:

1
2
SyntaxError: Non-ASCII character '\xe4' in file code_test.py on line 6,
but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Setting default encoding

1
2
3
import sys
reload(sys)
sys.setdefaultencoding('utf8')

told python use utf8 as the default encoding to deal symbols
python’s type str use ascii as the default encoding ps. v2.7

1
2
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e07' in position 0:
ordinal not in range(128)

Exception in db

1
UnicodeEncodeError:'latin-1' codec can't encode character ...

declare the connection and the cursor’s encoding:

1
2
3
4
5
db = pymysql.connect("localhost", "use", "passwd", "db_name" , use_unicode=True,  charset="utf8")
db = MySQLdb.connect(host="localhost", user = "root", passwd = "", db = "testdb", use_unicode=True, charset="utf8")
cursor.execute('SET NAMES utf8;')
cursor.execute('SET CHARACTER SET utf8;')
cursor.execute('SET character_set_connection=utf8;')

Exception in wb

If the web is encoded in gbk or gb2312, use utf8 to show will case messy encoding

1
html = unicode(html, "gbk").encode("utf8")

turn the bytes stream to unicode first in gbk decoding, then encoding in utf8

Encoding transform

  1. if in origin code str delcare has the prfix u then it is unicde encoding
  2. if origin code str has no prefix u then it use it’s text encoding
  3. use unicode(str, codec) to transform str to unicode
  4. most times use str.decode(codec) to decode to bytes stream
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# -*- coding:utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')

# u'\u4e07'
a = u"万"
print a
print ord(a)
print unichr(ord(a))

print "----------------"

b = a.encode('utf8')
print b
c = a.encode('gbk')
print c

in windows bash termianl’s encode is utf8 , gbk is messy encoding
bash
in windows dos terminal’s encode is gbk , utf8 is messy encoding
dos

Encoding in json

1
2
json.dumps # turn json object -> string
json.loads # turn json string -> object
1
2
3
4
# add encoding parameter
json.loads(jStr, encoding="GB2312");
# restrict the string in unicode encoding
json.dumps(js, ensure_ascii=False)

https://stackoverflow.com/questions/3942888/unicodeencodeerror-latin-1-codec-cant-encode-character

https://blog.csdn.net/learn_tech/article/details/52982679

https://blog.csdn.net/xfyangle/article/details/60969522

https://blog.csdn.net/ran337287/article/details/56298949

https://blog.csdn.net/chenzy945/article/details/18267905