爬虫代理池-python – Maple's Blog

背景

近期需要写一些网络爬虫，用于释放部分工作量，由于其中部分需要采集大量的信息，难免会触及到网站的反爬机制。

代理池

前期公司提供了一些ip,但由于量巨大，所以在开始寻找代理吃，就发现了pyhon的ProxyPool。

ProxyPool

爬虫代理IP池项目,主要功能为定时采集网上发布的免费代理验证入库，定时验证入库的代理保证代理的可用性，提供API和CLI两种使用方式。同时你也可以扩展代理源以增加代理池IP的质量和数量。
https://github.com/jhao104/proxy_pool

使用教程

搭建代理服务

docker run -d -it --name redis -p 6379:6379 redis
docker run --env DB_CONN=redis://:password@ip:port/0 -p 5010:5010 jhao104/proxy_pool:latest

使用代理服务
启动web服务后, 默认配置下会开启 http://127.0.0.1:5010 的api接口服务:

api	method	Description	params
/	GET	api介绍	None
/get	GET	随机获取一个代理	可选参数: `?type=https` 过滤支持https的代理
/pop	GET	获取并删除一个代理	可选参数: `?type=https` 过滤支持https的代理
/all	GET	获取所有代理	可选参数: `?type=https` 过滤支持https的代理
/count	GET	查看代理数量	None
/delete	GET	删除代理	`?proxy=host:ip`

python 示例

import requests

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").json()
def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
# your spider code
def getHtml():
    # ....
    retry_count = 5
    proxy = get_proxy().get("proxy")
    while retry_count > 0:
        try:
            html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
            # 使用代理访问
            return html
        except Exception:
            retry_count -= 1
    # 删除代理池中代理
    delete_proxy(proxy)
    return None

使用备注说明

代理池均来自三方免费代理
代理池中的代理虽然会判断是否有效，但任然需要自行二次判断。
get_proxy获取到当无法使用，就需要使用delete_proxy 进行删除，以免后续再次提取。
代理池可能经过一段时间使用后会为空。

一	二	三	四	五	六	日
« 12月				3月 »
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28