前言
爬网站的时候遇到了cf拦截,根据百度到的尝试添加参数还是无法跳过
service = Service('msedgedriver.exe')
options = Options()
# 开启开发者模式
options.add_experimental_option('excludeSwitches', ['enable-automation'])
# 禁用Blink运行时功能
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Edge(service=service)
undetected-chromedriver
Optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io Automatically downloads the driver binary and patches it.
- Tested until current chrome beta versions
- Works also on Brave Browser and many other Chromium based browsers, some tweaking
- Python 3.6++**
我主要使用的Edge,介绍说会自动下载Chrome,并没有体验到,于是自己安装了Chrome浏览器
代码跟之前selenium的相差不大,成功解决了问题,再没出现过Cf拦截
from pyquery import PyQuery as pq
import re
import time
from undetected_chromedriver import ChromeOptions
import undetected_chromedriver as uc
options = ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = uc.Chrome(options=options)
driver.get('http://...')
html_source = driver.page_source
doc = pq(html_source)
titles = doc.find('tag')
引用
1.ultrafunkamsterdam/undetected-chromedriver:https://github.com/ultrafunkamsterdam/undetected-chromedriver
2.Chrome Headless Detection (Round II):https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
3.selenium爬虫如何防止被浏览器特征抓取反爬,undetected_chromedriver他来了。:https://blog.csdn.net/wywinstonwy/article/details/118479162
评论 (0)