Python 编码转换与中文处理
python 中的 unicode是让人很困惑、比较难以理解的问题. utf-8是unicode的一种实现方式,unicode、gbk、gb2312是编码字符集.
decode是将普通字符串按照参数中的编码格式进行解析,然后生成对应的unicode对象
写python时遇到的中文编码问题:
➜ /test sudo vim test .py
#!/usr/bin/python
def weather():
import time
import re
import urllib2
import itchat
hearders = "User-Agent" , "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
url = "<a href="https://tianqi.moji.com/weather/china/guangdong/shantou" style="text-decoration-line: none; margin: 0px !important; padding: 0px !important; color: rgb(42, 0, 255) !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 26px !important; outline: 0px !important; overflow: visible !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;">https://tianqi.moji.com/weather/china/guangdong/shantou</a>"
par = '(<meta name="description" content=")(.*?)(">)'
opener = urllib2.build_opener()
opener.addheaders = [hearders]
urllib2.install_opener(opener)
html = urllib2.urlopen(url). read ().decode( "utf-8" )
data = re.search(par,html).group(2)
print type (data)
data.encode( 'gb2312' )
b = '天气预报'
print type (b)
c = b + '\n' + data
print c
weather()
|
➜ /test sudo python test .py
< type 'unicode' >
< type 'str' >
Traceback (most recent call last):
File "test.py" , line 30, in <module>
weather()
File "test.py" , line 28, in weather
c = b + '\n' + data
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
|
解决方法:
➜ /test sudo vim test .py
#!/usr/bin/python
import sys
reload(sys)
sys.setdefaultencoding( 'utf-8' )
def weather():
import time
import re
import urllib2
import itchat
hearders = "User-Agent" , "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
url = "<a href="https://tianqi.moji.com/weather/china/guangdong/shantou" style="text-decoration-line: none; margin: 0px !important; padding: 0px !important; color: rgb(42, 0, 255) !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 26px !important; outline: 0px !important; overflow: visible !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;">https://tianqi.moji.com/weather/china/guangdong/shantou</a>"
par = '(<meta name="description" content=")(.*?)(">)'
opener = urllib2.build_opener()
opener.addheaders = [hearders]
urllib2.install_opener(opener)
html = urllib2.urlopen(url). read ().decode( "utf-8" )
data = re.search(par,html).group(2)
print type (data)
data.encode( 'gb2312' )
b = '天气预报'
print type (b)
c = b + '\n' + data
print c
weather()
|
测试后:
➜ /test sudo python test .py
< type 'unicode' >
< type 'str' >
|
天气预报
汕头市今天实况:20度 多云,湿度:57%,东风:2级。白天:20度,多云。 夜间:晴,13度,天气偏凉了,墨迹天气建议您穿上厚些的外套或是保暖的羊毛衫,年老体弱者可以选择保暖的摇粒绒外套。
个人感觉网上说中文乱码通用解决办法都是错误的,因为类型不一样解决方法也不一样,所以最近刚好出现了这种问题,从网上找了很多办法没解决到,最后自己去查看资料,才发现需要对症下药。
这是一个抓取网页代码的python脚本
➜ /test sudo cat file .py
#!/usr/bin/python
import urllib,urllib2
import re
url = '<a href="http://sports.sohu.com/nba.shtml" style="text-decoration-line: none; margin: 0px !important; padding: 0px !important; color: rgb(42, 0, 255) !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 26px !important; outline: 0px !important; overflow: visible !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;">http://sports.sohu.com/nba.shtml</a>'
par = '20180125.*\">(.*?)</a></li>'
req = urllib2.Request(url)
response = urllib2.urlopen(req). read ()
print type (response)
print response
|
遇到的问题:
使用中文抓取中文网页时,print出来的中文会出现乱码
➜ /test sudo python file .py
special.wait({
itemspaceid : 99999,
form: "bigView" ,
adsrc : 200,
order : 1,
max_turn : 1,
spec :{
onBeforeRender: function (){
},
onAfterRender: function (){
},
isCloseBtn: true // �Ƿ��йرհ�ť
}
});
|
解决方法:

查看网页源代码发现charset=GBK的类型所以python中要进行类型转换
➜ /test sudo cat file .py
#!/usr/bin/python
import urllib,urllib2
import re
url = '<a href="http://sports.sohu.com/nba.shtml" style="text-decoration-line: none; margin: 0px !important; padding: 0px !important; color: rgb(42, 0, 255) !important; border-radius: 0px !important; background: none !important; border: 0px !important; bottom: auto !important; float: none !important; height: auto !important; left: auto !important; line-height: 26px !important; outline: 0px !important; overflow: visible !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; width: auto !important; box-sizing: content-box !important; min-height: auto !important;">http://sports.sohu.com/nba.shtml</a>'
par = '20180125.*\">(.*?)</a></li>'
req = urllib2.Request(url)
response = urllib2.urlopen(req). read ()
response = unicode(response, 'GBK' ).encode( 'UTF-8' )
print type (response)
print response
|
➜ /test sudo python file .py
special.wait({
itemspaceid : 99999,
form: "bigView" ,
adsrc : 200,
order : 1,
max_turn : 1,
spec :{
onBeforeRender: function (){
},
onAfterRender: function (){
},
isCloseBtn: true // 是否有关闭按钮
}
});
|
现在已经把中文乱码解决了
import json
#打印字典
dict = {‘name’: ‘张三’}
print json.dumps(dict, encoding=“UTF-8”, ensure_ascii=False) >>>{‘name’: ‘张三’}
#打印列表
list = [{‘name’: ‘张三’}]
print json.dumps(list, encoding=“UTF-8”, ensure_ascii=False) >>>[{‘name’: ‘张三’}]
客官点个赞呗! (0)