Python爬虫笔记技术篇-软件玩家

admin管理员组
文章数量:1122852

前言
requests出现中文乱码
使用代理
BeautifulSoup的使用
Selenium的使用
- 基础使用
- Selenium加载时间过长
- Selenium使用Chrome,隐藏Chrome
多进程下无法退出exe
scrapy
爬虫小Demo
- 爬取知乎发现页面的今日最热
- 爬取某张图片
- 爬取视频

前言

本篇为技术篇,,会讲解各种爬虫库的使用,至于库的安装在安装篇已经介绍了

requests出现中文乱码

这种情况是网页没有设置编码,获取不到,所以使用了默认的编码,这个时候中文就会出现乱码的情况

只需要多加一行代码即可

response.encoding='gb2312'

使用代理

免费的代理我试着不行,暂时不研究了,写一下付费的代理是怎么使用的

我购买的是讯代理,购买之后先在白名单里面添加自己的IP,然后点击生成API

选择订单,选择一个城市,然后生成json

接下来使用Python获取代理IP,我是获取了之后存到数据库了

import pyodbc
import json
import requests
import sys
import time


headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac 0S X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/52.0.2743.116 Safari/537.36'}
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=192.168.3.8,1433;DATABASE=VaeDB;UID=sa;PWD=test123')
cursor = conn.cursor()
r = requests.get('http://apXXXXXXXXXXXXXXXXXXXXX40300', headers=headers)
print(r.text)
jsonobj = json.loads(str(r.text))

datetime=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

cursor.execute(
    'update ProxyIP set IP=?,UpdateDate=? where Id=1', (jsonobj['RESULT'][0]['ip']+':'+jsonobj['RESULT'][0]['port'],datetime))
connmit()
sys.exit()

怎么使用代理,这个分为好几个情况,我使用的是Selenium,就写这个,requests的用到再补充

conn = pyodbc.connect(
    'DRIVER={SQL Server};SERVER=111.108.8.2,1433;DATABASE=VaeDB;UID=sa;PWD=testxxxx')

cursor = conn.cursor()
cursor.execute("""
select IP from dbo.ProxyIP
"""
)
data = cursor.fetchone()
proxyip=str(data[0])
chrome_options=webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=http://'+ proxyip)
browser = webdriver.Chrome(chrome_options=chrome_options)

BeautifulSoup的使用

我使用BeautifulSoup爬取了好几万的数据了,对于普通的网站,BeautifulSoup真的很好用

#BeautifulSoup初始化可以自动更正HTML格式,补全没有闭合的元素
print (soup.prettify())#以标准的缩进格式输出
print(soup.title)#标题
print(soup.title.string)#标题里面的内容
print(soup.title.name)#title的节点名称,就是title
print(soup.p)#第一个p元素的内容
print(soup.p.attrs)#第一个p元素的所有属性和值
print(soup.p['class'])#第一个p元素class属性的值
print(soup.p['name'])#第一个p元素name属性的值
print(soup.p.b.string)#第一个p标签下的b标签的文本内容
print(soup.p.contents)#第一个p元素下的所有子节点,不包括孙子节点
#第一个p元素所有的子节点
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)

print(soup.p.parent)#第一个p元素的父节点
#第一个p元素所有的父节点
print(soup.p.parents)
print(list(enumerate(soup.p.parents)))

print(soup.p.next_sibling)#第一个p元素的下一个兄弟节点,注意有回车的时候要写两个next_sibling
print(list(enumerate(soup.p.next_siblings)))#第一个p元素后面的所有兄弟节点
print(soup.p.previous_sibling)#第一个p元素的上一个兄弟节点
print(list(enumerate(soup.p.previous_siblings)))#第一个p元素前面的所有兄弟节点

#########################################################
#下面这些是比较常用的,上面的了解一下即可

# 判断某个标签是否有属性,例如img标签有一个alt属性,有时候img没有alt属性,我就可以判断一下,否则出错
if img.attrs.get('alt'):

soup.find(id='text-7').find_all(name='li')
#根据文本查找到该标签
# 例如下面的,根据Description查找含有Description的第一个p元素
test = soup.find(lambda e: e.name == 'p' and 'Description' in e.text)
# 其实如果是直接子元素的话,也可以使用parent,但是这个很少用,适用情况不多
test= soup.find(text=repile('Description')).parent

#查找某个属性为包含的标签
#标签的属性有很多值,例如img标签的alt属性,有item和img两个值,可以通过如下查找
noscript.find_all('img',attrs={'alt':repile('item')})

#判断属性里面是否有某个值
if 'Datasheet' in img['alt']:

#替换所有的br换行符号
html = get_one_page(url)
    return html.replace('<br>', '').replace('<br />', '').replace('<br/>', '')

#去除最后一个逗号
datasheet_url.rstrip(',')

#去除关键字和空格,只要后面的内容
#例如 Function : Sensitive Gate Silicon Controlled Rectifiers
#得到的就是Sensitive Gate Silicon Controlled Rectifiers
return re.sub(keywords+'.*?[\s]:.*?[\s]', '', child.find(text=repile(keywords)).string)

#返回某个符号之前的字符
import re

text="K6X4008C1F-BF55 ( 32-SOP, 55ns, LL )"

b=re.search('^[^\(]*(?=\()',text,re.M)
if b:
    print(b.group(0))
    print(len(b.group(0)))
else:
    print('没有')
    
#关键地方是,这里是匹配的(  括号需要\来转义一下    
^[^\(]*(?=\()
#如果是逗号,可以写        
^[^,]*(?=,)
#如果是单词,比如我想匹配Vae这个单词,如下
text='XuSong Vae hahaha'
text2='VV Vae hahaha'
b=re.search('^[^Vae]*(?=Vae)',text,re.M)        
#这个例子很重要,text是可以正则出XuSong的,但是下面的VV就正则不出来了,因为^是后面的Vae的任意一个单词,只要前面包含就不行,VV包含了V,所以就不行了,我尝试着给Vae加括号,也不行.然后我就想了一个办法,把Vae替换成逗号之类的符号不就可以了,只要是一个字符就行,如下
text='XuSong Vae hahaha'
text2='VV Vae hahaha'
b=re.search('^[^,]*(?=,)',text.replace('Vae',','),re.M)    

#一段HTML元素中去除a标签,但是保留a标签的值
return re.sub('(<\/?a.*?>)', '', description_element)

#有时候想获取一段HTML元素内容,因为有的排版在,比如ul和li元素,排版是在的,如果使用text就是一串文本,换行都没了,可以这样
str(child.find(class_='ul2'))   #获取到这段HTML元素之后,使用str函数变成字符串即可

#下一个兄弟元素最好使用find_next_sibling()
#等待验证,和next_sibling比较一下再说

#Python爬虫数据插入到MongoDB
import pymongo
client = pymongo.MongoClient("mongodb://admin:test123@192.168.3.80:27017/")
db = client.datasheetcafe
collection = db.datasheetcafe
collection.insert_one(message)

Selenium的使用

在爬取一个新网站的时候,我发现网站上的网页数据全都是动态加载的,浏览器加载之后数据才会显示,这个时候BeautifulSoup就没用了,完全获取不到HTML节点

这种情况,可以使用Selenium进行动态加载,我使用这个已经爬取了近200万的数据

基础使用

#有不显示浏览器的,但是我选择了Chrome浏览器
browser = webdriver.Chrome()
#先获取url,再选择元素
browser.get(url)
div = browser.find_element_by_css_selector('.information')
#find_element_by_css_selector这个东西是css选择器,如果选择一个就使用这个,想要选择一堆就加个s,使用find_elements_by_css_selector
#后面的类就是. id就是# 标签名就直接写,如下
table = browser.find_element_by_css_selector('#part-specs') 
table.find_elements_by_css_selector('tr')
# 获取属性的值
tr.get_attribute('class')
#对了,下面这个获取的可不是HTML元素,是一个WebElement元素
table = browser.find_element_by_css_selector('#part-specs')
#所以如果你想要HTML元素这样写
table = browser.find_element_by_css_selector('#part-specs').get_attribute('outerHTML')
#获取里面的文本值就这样写
browser.find_element_by_css_selector('.part-number').text
#至于怎么判断元素是否存在的,我写了if但是没有,所以我利用try catch帮助解决
try:
    button=browser.find_element_by_css_selector('#show-secondary-part-list')
    button.click()
except Exception as e:
    print('没有这个元素')

Selenium加载时间过长

有时候Selenium加载一个网页时间过长了,所以必须设定一个超时时间

browser = webdriver.Chrome()
browser.set_page_load_timeout(20)
browser.set_script_timeout(20)

try:
    browser.get(address)
except TimeoutException:
    browser.execute_script('window.stop()')
    
div = browser.find_element_by_css_selector('.information') 
....

先设定加载时间为20秒,如果20都没加载出来,就算了,直接停止加载,爬取网页内容吧,如果一个HTML元素都没抓取到,空白页面,那就try catch跳过这个网页

Selenium使用Chrome,隐藏Chrome

默认使用Chrome是加载出来的,也可以隐藏Chrome浏览器,那个无界面PhantomJS已经被淘汰了

chrome_options=webdriver.ChromeOptions()
# chrome_options.add_argument('--headless') 如果想不弹出chrome浏览器就开启这两行,那个PantomJS啥的已经过期了
# chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--proxy-server=http://'+ proxyip)
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation']) #这一行是为了防止网站识别出我是selenium,参考:https://zhuanlan.zhihu/p/65077940
browser = webdriver.Chrome(chrome_options=chrome_options)

多进程下无法退出exe

我的爬虫爬了一段时间就停止了,selenium控制的浏览器要么变成了空白页,要么访问失败,最可怕的是页面卡着不动了,这个时候只能重启爬虫程序了,所以我把爬虫发布成exe了,我使用Windows自带的计划任务,隔一段时间就启动exe爬虫,然后问题来了

我必须关闭exe的弹窗,不然加载越来越多的exe,内存会崩的,于是我采用了Python多进程

一个进程去爬虫,一个进程去计时,如果计时10分钟,就退出exe

但是没有执行,无论我使用sys.exit()还是os._exit()都无法退出exe,至今不知道为什么

没办法只好采取了另外一种方法,杀进程

import os

os.system('taskkill /im conhost.exe /F')
# 这两个不能同时执行,我写了两个py,发布两个exe执行
os.system('taskkill /im chromedriver.exe /F')
os.system('taskkill /im chrome.exe /F')

我把控制台和Chrome全杀了.......这样Windows计划任务会重启exe,然后过10分钟我再全杀了.....

对了,Win10的计划任务也搞了我好久,就是不成功,详情见Windows计划任务

scrapy

暂留......

scrapy startproject tutorial

scrapy genspider quotes quotes.toscrape

scrapy crawl quotes

爬虫小Demo

爬虫小Demo应该不会再更新了,我爬取的东西不可能写出来了,仅仅介绍技术,授人以渔吧

爬取知乎发现页面的今日最热

import requests
import re
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac 0S X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/52.0.2743.116 Safari/537.36'}
r =requests.get("https://www.zhihu/explore", headers=headers)
pattern = repile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles = re.findall(pattern,r.text)
print(titles)

讲解:headers里面有浏览器标识,不加这个知乎会禁止抓取

爬取某张图片

import requests
import re
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac 0S X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/52.0.2743.116 Safari/537.36'}
r =requests.get("https://avatars0.githubusercontent/u/13572737?s=460&v=4", headers=headers) 
with open('Vae.jpg','wb') as f:
    f.write(r.content)

这个不仅可以爬取图片,爬取视频,音频,也是这样的
==注意:爬取图片的时候,不写headers会报错,找不到源==
下面是把图片保存到制定位置的代码,多了一个判断文件夹是否存在,不存在就创建文件夹的操作,我是根据图片的链接截取作为文件夹的名称的,使用创建文件夹的os需要导入os

import os
def get_allpath(imgurl):
    info=imgurl[imgurl.index('uploads'):]
    infos=info.split('/')
    return infos[0]+"\\"+infos[1]+infos[2]+"\\"+infos[3]

def get_path(imgurl):
    info=imgurl[imgurl.index('uploads'):]
    infos=info.split('/')
    return infos[0]+"\\"+infos[1]+infos[2]

def create_makedirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)

def get_image(child, path):
    for img in child.find_all(name='img'):
        imgurl = img.attrs.get('data-lazy-src')
        if imgurl:
            if 'gif' in imgurl:
                create_makedirs(path+get_path(imgurl))
                r = requests.get(imgurl,headers=headers)
                with open(path+get_allpath(imgurl), 'wb') as f:
                    f.write(r.content)
                    
       #最后我在调用的时候,直接输入放到哪里的路径即可             
            path = 'D:\\datasheetcafe\\'
            get_image(child, path)

爬取视频

其实和爬取图片是一样的,只不过换了url而已,这里以爬取我的哔哩哔哩视频为例

如图是我的哔哩哔哩发的一个视频,点击F12,然后在NetWork里面找到视频的那个请求,一般是最大Size的那个

点进去,看到的那一串就是视频地址

复制视频地址到python爬虫里面,改一下存储为.mp4

import requests
import re
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac 0S X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/52.0.2743.116 Safari/537.36'}
r =requests.get("https://113-219-141-2.ksyungslb/upos-sz-mirrorks32u.acgvideo/upgcxcode/63/76/19157663/19157663-1-48.mp4?e=ig8euxZM2rNcNbKVhwdVhoMMhwdVhwdEto8g5X10ugNcXBlqNxHxNEVE5XREto8KqJZHUa6m5J0SqE85tZvEuENvNC8xNEVE9EKE9IMvXBvE2ENvNCImNEVEK9GVqJIwqa80WXIekXRE9IMvXBvEuENvNCImNEVEua6m2jIxux0CkF6s2JZv5x0DQJZY2F8SkXKE9IB5QK==&deadline=1562826409&gen=playurl&nbs=1&oi=1947754487&os=ks3u&platform=pc&trid=28dcba5166d84f6f8b078fccbdd41f2e&uipk=5&upsig=cae257a7f3504f28137a3036304288df&uparams=e,deadline,gen,nbs,oi,os,platform,trid,uipk&mid=32059965&ksy_gslb_referer=https%3A%2F%2Fwww.bilibili%2Fvideo%2Fav11592898", headers=headers) 
with open('Vae.mp4','wb') as f:
    f.write(r.content)

转载于:https://wwwblogs/yunquan/p/11169484.html

本文标签：爬虫笔记技术 python

版权声明：本文标题：Python爬虫笔记技术篇内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/biancheng/1726378909a1084569.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

Python爬虫笔记技术篇

前言

requests出现中文乱码

使用代理

BeautifulSoup的使用

Selenium的使用

基础使用

Selenium加载时间过长

Selenium使用Chrome,隐藏Chrome

多进程下无法退出exe

scrapy

爬虫小Demo

爬取知乎发现页面的今日最热

爬取某张图片

爬取视频

更多相关文章

[课堂笔记]小白菜鸟重装系统

新版MacBook安装笔记（分多个区、Win7和Lion双系统、Ghost非安装）

qt界面和python怎么交互_Python GUI界面编程-初识

虚拟环境安装pip_Win10+cuda10.0+cudnn7.4+tensorflow-gpu1.14.0 安装笔记

安装Python 提示缺少Windows 7 Service Pack 1不一样的方法 。Log File提示缺少KB2533625

如何发布python 程序到手机_Python如何发布程序的详细教程

局部页面切换url为什么不变_python爬虫 - 翻页url不变网页的爬虫探究！

Android微信内h5页面唤起浏览器打开页面的技术分析和实现

win7安装python3.8失败_在安装python 3.8的时候的一些问题和解决方法。

正版python在哪下载,python软件在哪下载

windows 下配置python WSGI

python编的程序转换成android可执行代码_有python代码怎么编成可执行的exe程序?

14爬虫：scrapy实现翻页爬取

现在还不会爬虫？

#嵌入式预科 笔记

印象笔记无法连接服务器解决办法无法同步

WIN7系统下安装Python 3.X失败解决方式

Android Studio笔记之webview ——实现app本身打开网页而不跳转到其他浏览器（内置浏览器）

成功解决win7安装python过程，Setup failed,需要安装Windows 7 Service Pack 1

pyenv-win —— windows 端 python 版本管理工具

发表评论

推荐文章

plugin development - WordPress Reset password Strength set to medium

regex - For loop PowerShell command within a batch file removes apostrophe - Stack Overflow

migration - Is there a way to migrate my wordpress blog to another website that uses a different domain?

php - Endpoint exists and shows up in My Account dashboard. Need to add endpoint to My Account DROPDOWN menu in Porto theme

categories - Elementor : display only one category in loop grid

热门文章

java - Spring cloud OpenFeign Integrate Consul1.20.1 cannot be called normally： ensure the path starts with &#39;v1&#39;

Pagination doesn&#39;t work on static front page

plugin development - How to Maintain url on form submit

categories - Add Category name to Post Title (h1)

php - WordPress: Custom User Role cannot access Custom Post Type | &quot;Sorry, you are not allowed to access this page&

wp admin - How to use native wordpress translation domain inside a custom plugin?

rest api - How to call Wordpress API Internally

Custom Plugin not Displaying in the Website Production environment (Divi)

c# - How to create a deb package with Process .NET? - Stack Overflow

metabox - Custom fields empty after refreshing page

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

javascript - Send client side data to Prisma - Stack Overflow

r - Prior distribution for zero-inflated Poisson MCMCglmm? - Stack Overflow

html - javascript question concerning multiple customised dropdowns in same page - Stack Overflow

pandas - Python: BeautifulSoup scraping yield data - Stack Overflow

android - Activity transaction in kotlin - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

安装Python 提示缺少Windows 7 Service Pack 1不一样的方法。Log File提示缺少KB2533625

#嵌入式预科笔记

java - Spring cloud OpenFeign Integrate Consul1.20.1 cannot be called normally： ensure the path starts with 'v1'

Pagination doesn't work on static front page

php - WordPress: Custom User Role cannot access Custom Post Type | "Sorry, you are not allowed to access this page&