Requests+Xpath 爬取豆瓣读书TOP并生成txt,csv,json,excel文件

编程

更新时间：2023-12-03111

admin管理员组
文章数量:1122854

Requests+Xpath 爬取豆瓣读书TOP并生成txt,csv,json,excel文件

说明：
##来源：.html

1 Requests+Xpath 爬取豆瓣读书TOP

‘’’
Requests+Xpath 爬取豆瓣读书TOP

安装 Python 应用包
pip install requests
pip install lxml

获取元素的Xpath信息并获得文本：

手动获取：定位目标元素，在网站上依次点击：右键 > 检查
file=s.xpath(‘元素的Xpath信息/text()’)

快捷键“shift+ctrl+c”，移动鼠标到对应的元素时即可看到对应网页代码：

在电影标题对应的代码上依次点击右键 > Copy > Copy XPath，获取书名的Xpath：

‘’’

#注意：在浏览器上复制xpath 有时会多出多余的标签，则要删除，比如：tbody，这是因为浏览器为了规范。
‘’'分别复制《红楼梦》、《活着》、《百年孤独》、《1984》的 xpath 信息进行对比：

//[@id=“content”]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a
//[@id=“content”]/div/div[1]/div/table[2]/tbody/tr/td[2]/div[1]/a
//[@id=“content”]/div/div[1]/div/table[3]/tbody/tr/td[2]/div[1]/a
//[@id=“content”]/div/div[1]/div/table[4]/tbody/tr/td[2]/div[1]/a

比较可以发现书名的 xpath 信息仅仅 table 后的序号不一样，并且跟书的序号一致，于是去掉序号（去掉 tbody），我们可以得到通用的 xpath 信息：

//*[@id=“content”]/div/div[1]/div/table/tr/td[2]/div[1]/a
‘’’

3 爬取页面多个信息时的数据准确匹配问题

strip(“(”) 表示删除括号， strip() 表示删除空白符。

‘’’

问题：我们默认书名和评分是正确的信息,如果某一项少爬或多爬了信息，匹配错误

思路：书名的标签肯定在这本书的框架内，以每本书为单位，分别取获取对应的信息，完全匹配

//[@id=“content”]/div/div[1]/div/table[1] #整本书
//[@id=“content”]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a #书名
//*[@id=“content”]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[2] #评分

我们发现，书名和评分 xpath 的前半部分和整本书的 xpath 一致的，那我们可以通过这样写 xpath 的方式来定位信息：

file=s.xpath(“//*[@id=“content”]/div/div[1]/div/table[1]”)
title =div.xpath(“./tr/td[2]/div[1]/a/@title”)
score=div.xpath(“./tr/td[2]/div[2]/span[2]/text()”)

‘’’

4翻页

‘’’
=0 #第一页
=25 #第二页
=50 #第三页

以每页25为单位，递增25，只是 start=()的数字不一样

写一个循环

for a in range(3):
url = ‘={}’.format(a25)
#3个页面，用 a25 保证以25为单位递增

‘’’

import requests
from requests.exceptions import RequestException
from lxml import etree
import time, json, csv,xlwt,xlrd
import pandas as pd
from xlutils.copy import copy# 第一步：#定义一个方法获取网页信息
def get_one_page(url):try:headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko)''Chrome/91.0.4472.114 Mobile Safari/537.36',# 'Cookie':'bid=W55k4D_fSXM; __utmz=30149280.1625041982.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __gads=ID=8362a35481680256-22421ee50bca00f1:T=1625041983:RT=1625041983:S=ALNI_MZ0vPA34VtqtmD29r6pJirZIU8xWQ; __utmz=81379588.1625042017.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; _gid=GA1.2.2054347723.1625042225; _ga=GA1.2.100943019.1625041982; __utma=30149280.100943019.1625041982.1625041982.1625125287.2; __utmc=30149280; __utmt_douban=1; __utma=81379588.444136571.1625042017.1625042017.1625125287.2; __utmc=81379588; __utmt=1; ap_v=0,6.0; _pk_ref.100001.3ac3=%5B%22%22%2C%22%22%2C1625125288%2C%22https%3A%2F%2Fwww.baidu%2Flink%3Furl%3Dz7cq9P4PInnWSErPqMmM4Lb6ZcQs4UjgaqSRLtPLfprPNKb3wDCbQYp3ZwLXM4BG%26wd%3D%26eqid%3Dd7960ea80001906f0000000560dc2c5b%22%5D; _pk_ses.100001.3ac3=*; __utmb=30149280.4.10.1625125287; __utmb=81379588.4.10.1625125287; _pk_id.100001.3ac3=6f14de07186baf65.1625042017.2.1625125324.1625042755.; dbcl2="207739408:CqN+9NKe/JI"'}response = requests.get(url, headers=headers)if response.status_code == 200:# print(response.text)'''with open('douban_dushu_top.html', 'a', encoding='utf-8') as f:f.write(response.text)'''return response.textelse:return Noneexcept RequestException:return None# 第二步：#定义一个方法使用xpath解析
def parse_one_page(html):s = etree.HTML(html)file = s.xpath('//*[@id="content"]/div/div[1]/div/table')time.sleep(3)result_lists = []for div in file:result_list = []title = div.xpath("./tr/td[2]/div[1]/a/@title")[0]href = div.xpath("./tr/td[2]/div[1]/a/@href")[0]score = div.xpath("./tr/td[2]/div[2]/span[2]/text()")[0]num = div.xpath("./tr/td[2]/div[2]/span[3]/text()")[0].strip("(").strip().strip(")").strip()scrible = div.xpath("./tr/td[2]/p[2]/span/text()")###以列表形式保存为txt格式if len(scrible) > 0:result_list = [title, href, score, num, scrible[0]]#print(result_list)else:result_list = [title, href, score, num]result_lists.append(result_list)#print(result_lists)return result_lists'''##以字典形式保存if len(scrible) > 0:# 字典形式yield {  # yield 生成器'title': title,'href': href,'score': score,'num': num,'scrible':scrible[0]}else:# 字典形式yield {  # yield 生成器'title': title,'href': href,'score': score,'num': num,}'''# 第三步： 写入文件txt
def write_to_file_txt(conent):# 写入文件with open('douban_dushu_top1.txt', 'a', encoding='utf-8') as f:# 以列表形式保存#f.write(','.join(conent))#f.write('\n' + '=' * 50 + '\n')# 以字典形式保存f.write(json.dumps(conent, indent=2, ensure_ascii=False))# 第三步： 写入文件json
def write_to_file_json(conent):with open('douban_dushu_top1.json', 'a', encoding='utf-8') as f:# print(type(json.dump(conent)))# ensure_ascii=False设置可以输出中文f.write(json.dumps(conent, indent=2, ensure_ascii=False))# 第三步： 写入文件csv
def write_to_file_csv(conent):''':param conent::return:'''#以pd方式写入csv时会出现编号，也是以字典形式传递'''df=pd.DataFrame(conent)df.to_csv('douban_dushu_top1.csv',encoding='utf-8')'''#以字典形式写入csvheader = ['title','href','score','num','scrible']with open('douban_dushu_top1.csv', 'a', encoding='utf-8', newline='') as f:writer = csv.DictWriter(f,header)writer.writeheader()writer.writerows(conent)#以列表形式写入csv'''header = ['title', 'href', 'score', 'num', 'scrible']with open('douban_dushu_top.csv','a',encoding='utf-8') as f:writer= csv.writer(f,dialect='excel')writer.writerow(header)for item in conent:writer.writerow(item)'''# 第三步： 写入文件excel
def write_excel_xls_hotal(path,sheet_name,content):index = len(content)# 获取需要写入数据的行数# 新建一个工作簿workbook=xlwt.Workbook()# 在工作簿中新建一个表格sheet = workbook.add_sheet(sheet_name)# 像表格中写入数据（对应的行和列）for i in range(0,index):for j in range(0, len(content[i])):sheet.write(i,j,content[i][j])# 保存工作簿workbook.save(path)print("xls格式表格写入数据成功！")#追加数据
def write_excel_xls_append(path, content):index = len(content)  # 获取需要写入数据的行数workbook = xlrd.open_workbook(path)  # 打开工作簿sheets = workbook.sheet_names()  # 获取工作簿中的所有表格worksheet = workbook.sheet_by_name(sheets[0])  # 获取工作簿中所有表格中的的第一个表格rows_old = worksheet.nrows  # 获取表格中已存在的数据的行数new_workbook = copy(workbook)  # 将xlrd对象拷贝转化为xlwt对象new_worksheet = new_workbook.get_sheet(0)  # 获取转化后工作簿中的第一个表格for i in range(0, index):for j in range(0, len(content[i])):new_worksheet.write(i + rows_old, j, content[i][j])  # 追加写入数据，注意是从i+rows_old行开始写入new_workbook.save(path)  # 保存工作簿print("xls格式表格【追加】写入数据成功！")# 第四步：翻页
def main(offset):# 每一页都是有规律的变化，都是offset的值在改变url = '={}'.format(offset * 25)print(url)html = get_one_page(url)items = parse_one_page(html)#以字典形式保存成csv格式#write_to_file_csv(items)#save_excel(items)return itemsif __name__ == '__main__':#以列表形式写入excelbook_name_xls = 'douban_dushu_top1.xlsx'sheet_name_xls = 'douban_dushu_top'value_title = [['title', 'href', 'score', 'num', 'scrible'], ]write_excel_xls_hotal(book_name_xls, sheet_name_xls, value_title)for i in range(6):item=main(i)write_excel_xls_append(book_name_xls, item)time.sleep(2)#保存成json格式'''for i in range(2):items = main(i)#write_to_file_json(items)#以列表形式保存for item in items:#以字典形式保存write_to_file_json(item)time.sleep(2)'''# 保存成csv格式'''for i in range(2):items = main(i)write_to_file_csv(items)#以列表或者字典形式保存time.sleep(2)'''# 保存成txt格式'''for i in range(2):items = main(i)for item in items:write_to_file_txt(item)#以列表或者字典形式保存time.sleep(2)'''

本文标签： RequestsXpath 爬取豆瓣读书TOP并生成txt csv Json Excel文件

版权声明：本文标题：Requests+Xpath 爬取豆瓣读书TOP并生成txt,csv,json,excel文件内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/biancheng/1701567582a438261.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

Requests+Xpath 爬取豆瓣读书TOP并生成txt,csv,json,excel文件

Requests+Xpath 爬取豆瓣读书TOP并生成txt,csv,json,excel文件

1 Requests+Xpath 爬取豆瓣读书TOP

3 爬取页面多个信息时的数据准确匹配问题

strip(“(”) 表示删除括号， strip() 表示删除空白符。

4翻页

更多相关文章

Word,Excel未保存，突然断电？找回死机后未储存的Word,Excel文件

Json

数据处理 `.txt`, `.json`, `.csv`, `.excel`, `.pkl` 相互转化

【JSON

Requests+Xpath 爬取豆瓣读书TOP并生成txt,csv,json,excel文件

ajax同步与异步,json

解决sparkstreaming读取kafka中的json数据，消费后保存到MySQL中，报_corrupt_record和name错误的！！

通过使用http请求本地json模拟实现分页展示功能（ajax+jquery+json）

js从服务器获取csv文件,前端js直接导出生成csv文件

3-数据提取方法1（json）（6节课学会爬虫）

AI: 探讨ChatGPT API支持输出JSON格式的意义与影响

Chrome浏览器安装JSON插件

发表评论

推荐文章

pagination - Using WP_List_Table and search_box(): How to Paginate the Found Search Results When Sending by Method &#171;Pos

swift - How can I solve the out-of-range percentage issue for sliders? - Stack Overflow

Changing permalinks gives me 404 errors on nginx

Convert a Stored Procedure to a Stored Function in MYSQL - Stack Overflow

QT IDE下载及安装（最新版本）

热门文章

AJAX login without a plugin does not work. when add a action to function.php

python - Finding keypoints in mediapipe array - Stack Overflow

url rewriting - Duplicate Custom Post Type and Taxonomy Slug

rss - How to show a feed that requires userpass within a sidebar widget?

Rust交叉编译，在Windows上编译Linux可执行程序

kali利用永恒之蓝攻击win7

apache spark - Is there a way to create a schema for this column? - Stack Overflow

sveltekit - Svelte 5 and Superforms: Error Submiting Value of a DatePicker Component - Stack Overflow

Wordpress SMTP plugins causing issues with admin-ajax.php

Kubernetes informer not listening to cluster events - Stack Overflow

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

aws appsync - AWS Amplify CLI - Generating wrong resolvers - Stack Overflow

java - How to manage request-scoped data and MDC effectively in gRPC unary calls? - Stack Overflow

javascript - Send client side data to Prisma - Stack Overflow

r - Prior distribution for zero-inflated Poisson MCMCglmm? - Stack Overflow

html - javascript question concerning multiple customised dropdowns in same page - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

pagination - Using WP_List_Table and search_box(): How to Paginate the Found Search Results When Sending by Method «Pos