requests爬取中⽂⽹页时中⽂字符变英⽂的解决⽅法 在使⽤python requests库爬取⽹页时,源代码中的中⽂字符在爬取下来后变成了英⽂字符
例如:
import requests
r = ('apps.webofknowledge', headers = {'User-Agent': 'Mozilla/5.0'})
[:1000])
结果为:
'<!DOCTYPE html>
<html>
<head><link rel="icon" href="images.webofknowledge/WOKRS5272R3/images/wok_favicon.ico" type="image/x-icon"/><title>Web of Science [v.5.27.2] - All Databases Home </title><link rel="stylesheet"
href="images.webofknowledge/WOKRS5272R3/css/WoKcommon.css" type="text/css" /><link rel="stylesheet" href="images.webofknowledge/WOKRS5272R3/css/WoKcomponents.css" type="text/css" /><link
rel="stylesheet" h'
⽽⽹页源代码确是这样的:
显然,源代码中的中⽂字符“所有数据库主页”在爬下来后变成了英⽂“All Databases Home” 解决⽅法:
在请求头headers中添加‘ Accept-Language':'zh-CN',即请求代码变为: acceptlanguageimport requests
r = ('apps.webofknowledge', headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language':'zh-CN' })
[:1000])
结果就OK了:
'<!DOCTYPE html>
<html>
<head><link rel="icon" href="images.webofknowledge/WOKRS5272R3/images/zh_CN/wok_favicon.ico"
type="image/x-icon"/><title>Web of Science [v.5.27.2] - 所有数据库主页 </title><link rel="stylesheet"
href="images.webofknowledge/WOKRS5272R3/css/WoKcommon.css" type="text/css" /><link rel="stylesheet" href="images.webofknowledge/WOKRS5272R3/css/WoKcomponents.css" type="text/css" /><link
rel="stylesheet" href="'