–爬虫
请求网站并提取数据的自动化程序
get与post的不同
get会把请求的参数放在url里,post会提交一个表单
url:统一资源定位符
request发送的请求一般都是document类型,网页中的图片是二次加载的
request:1.请求方式(get,post).2.请求url.3.请求头:User-Agent.4.请求体:请求时额外携带的数据
相应状态:200代表成功,301是跳转,404是找不到,502是服务器错误
相应头
响应体
能抓取网页文本,图片,视频,
解析方式:1.直接处理 2.json解析 3.正则表达式 4.BeautifulSoup 5.pyQuery 6. xPath
如何解决javascript渲染问题
分析Ajax请求 Selenium/WebDriver Splash
如何保存:文本、关系型数据库、非关系型数据库、二进制文件
re提取文本
xpath根据路径提取
css选择器根据标签提取
import requests
response=requests.get('http://www.jianshu.com')
exit()if not response.status_code=200 else print("Requests Successful")
判断请求状态码
高级操作:
1.文件上传
import requests
files={'file':open('文件名称','rb')}
response=requests.post("http://httpbin.org/post",files=files)
print(response.text)
2.获取cookie
import requests
response=requests.get('https://www.baidu.com')
print(response.cookies) #cookies是列表的形式,cookies用来做会话维持
for key,value in response.cookies.items():
print(key+'='+value)
3.会话维持:
模拟登录
import requests
s=requests.Session()#定义一个session使设置和获取在同一个浏览器进行
s.get('http://httpbin.org/cookies/set/number/123456789')#设置cookies的值
response=s.get('http://httpbin.org/cookies')
print(response.text)
4.证书验证:
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response=requests.get('https://www.12306.cn',verify=False)#解除证书验证
print(response.status_code)
5.代理设置:
import requests
proxies={
"https":"http://127.0.0.1:9743",
"https":"https://127.0.0.1:9743",
}
response=requests.get("https://www.taobao.com",proxies=proxies)
print(response.status_code)
#如果有用户名和密码
import requests
proxies={
"http":"http://user:password@127.0.0.1:9743/",
}
response=requests.get("https://www.taobao.com",proxies=proxies)
print(response.status_code)
#sock代理
pip install 'requests[socks]'
import requests
proxies={
'http':'socks5://127.0.0.1:9742',
'https':'socks5://127.0.0.1:9742'
}
response=requests.get("https://www.taobao.com",proxies=proxies)
print(response.status_code)
6.超时设置
import requests
from requests.exceptions import ReadTimeoout
tey:
response=requests.get("https://www.taobao.com",timeout=1)设定反应时间,规定时间不反应抛出异常
print(response.status_code)
except ReadTimeout:
print('TimeOut')
7.认证设置
import requests
from requests.auth import HTTPBasicAuth
r=requests.get('http://120.27.34.24:9001',auth=HTTPBasicAuth('user','123'))
print(r.status_code)
8.异常处理
import requests
from requests.exceptions import ReadTimeout,HTTPError,RequestException
try:
response=requests.get('http://httpbin.org/get',timeoout=0.5)
print(response.status_code)
except ReadTimeout:
print("Timeout")
except HTTPError:
print("http error")
except RequestException:
print("error")
正则表达式:
泛匹配
import re
content="Hello 123 4567 World_this is a Regex Demo"
result=re.match("^Hello.*Demo$",content)
print(result)
print(result.group())
print(result.span())#输出字符的范围
匹配目标
import re
content="Hello 123 4567 World_this is a Regex Demo"
result=re.match('^Hello\s(\d+)\sWorld.*Demo$',content)
print(result)
print(result.group(1))
print(result.span())#输出字符的范围
贪婪匹配
import re
content="Hello 123 4567 World_this is a Regex Demo"
result=re.match('^He.*(\d+).*Demo$',content)
print(result)
print(result.group(1))
非贪婪匹配
import re
content="Hello 123 4567 World_this is a Regex Demo"
result=re.match('^He.*?(\d+).*Demo$',content)
print(result)
print(result.group(1))
转义
import re
content='price is $5.00'
result=re.match('price is \$5\.00',content)
print(result)
总结,尽量使用泛匹配,使用括号得到匹配目标,尽量使用非贪婪模式,有换行符用re.S
为了匹配方便,尽量用search
实验练习,爬取豆瓣读书
i
mport requests
import re
content=requests.get("https://book.douban.com/").text
pattern=re.complie('<li.*?cover.*?href="(.*?)".*?title="(.*?)".*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>',re.S)
results=re.findall(pattern,content)
print(results)
for result in results:
url,name,author,date=result
author=re.sub('\s',"",author)
date=re.sub('\s',"",date)
print(url,name,author,date)
因篇幅问题不能全部显示,请点此查看更多更全内容