爬虫入门 | 凯

属性	说明
r.status_code	HTTP状态码
r.text	HTTP响应内容的字符串形式
r.encoding	header中的响应编码方式
r.apparent_encoding	备选编码方式
r.content	HTTP内容的二进制形式

规模	库
小规模，速度不敏感	Requests库
中规模，速度敏感	Scrapy库
全网数据	定制开发

import requests
url = "https://www.amazon.cn/gp/product/xxx"
try:
    kv = {'user-agent': 'Mozilla/5.0'}
    r = requests.get(url, headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    // 打印1000-2000行的内容
    print(r.text[1000:2000])
    // 打印响应的长度
    print(len(r.text))
except:
    print("爬取失败")

图片爬取全代码

import requests
import os
url = "http://image.xxx.com/sss.jpg"
root = "D://pics//"
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

目录

Requests 库入门

网络爬虫的尺寸

Robots协议

修改请求头

图片爬取全代码