admin管理员组文章数量:1134248
I am writing this code:
import requests
from bs4 import BeautifulSoup
url = '/'
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
I want to get data like episode ID and name for all 80 episodes but when run this code it just gives me 50 episodes and others are under the pagination '30 more'.
I tried many things as discovering the HTML code of the website and finding the class
<div class="sc-f09bd1f5-1 hoKmdt pagination-container">
<span class="ipc-see-more sc-33e570c-0 cMGrFN single-page-see-more-button">
<button class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-height ipc-btn--core-base ipc-btn--theme-base ipc-btn--button-radius ipc-btn--on-accent2 ipc-text-button ipc-see-more__button" tabindex="151" aria-disabled="false">
<span class="ipc-btn__text">
<span class="ipc-see-more__text">
30 more
</span>
</span>
<svg xmlns="; width="24" height="24" class="ipc-icon ipc-icon--expand-more ipc-btn__icon ipc-btn__icon--post" viewBox="0 0 24 24" fill="currentColor" role="presentation">
<path opacity=".87" fill="none" d="M24 24H0V0h24v24z"></path>
<path d="M15.88 9.29L12 13.17 8.12 9.29a.996.996 0 1 0-1.41 1.41l4.59 4.59c.39.39 1.02.39 1.41 0l4.59-4.59a.996.996 0 0 0 0-1.41c-.39-.38-1.03-.39-1.42 0z"></path>
</svg>
</button>
</span>
</div>
but I couldn't find a way to get all the data
I am writing this code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/title/tt5189554/episodes/'
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
I want to get data like episode ID and name for all 80 episodes but when run this code it just gives me 50 episodes and others are under the pagination '30 more'.
I tried many things as discovering the HTML code of the website and finding the class
<div class="sc-f09bd1f5-1 hoKmdt pagination-container">
<span class="ipc-see-more sc-33e570c-0 cMGrFN single-page-see-more-button">
<button class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-height ipc-btn--core-base ipc-btn--theme-base ipc-btn--button-radius ipc-btn--on-accent2 ipc-text-button ipc-see-more__button" tabindex="151" aria-disabled="false">
<span class="ipc-btn__text">
<span class="ipc-see-more__text">
30 more
</span>
</span>
<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" class="ipc-icon ipc-icon--expand-more ipc-btn__icon ipc-btn__icon--post" viewBox="0 0 24 24" fill="currentColor" role="presentation">
<path opacity=".87" fill="none" d="M24 24H0V0h24v24z"></path>
<path d="M15.88 9.29L12 13.17 8.12 9.29a.996.996 0 1 0-1.41 1.41l4.59 4.59c.39.39 1.02.39 1.41 0l4.59-4.59a.996.996 0 0 0 0-1.41c-.39-.38-1.03-.39-1.42 0z"></path>
</svg>
</button>
</span>
</div>
but I couldn't find a way to get all the data
Share Improve this question edited Jan 8 at 9:09 HedgeHog 24.9k5 gold badges17 silver badges40 bronze badges asked Jan 5 at 1:01 Kasper JcobKasper Jcob 12 Answers
Reset to default 0In this scenario specific, all of the URLs you need follow the same pattern:
https://www.imdb.com/title/tt5189554/episodes/?season=<season-number>
Therefore you could just perform the requests in a loop that iterates through the seasons. Example:
for season in range(1, 5):
url = f"https://www.imdb.com/title/tt5189554/episodes/?season={season}"
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
titles = soup.findAll("div", { 'class': "ipc-title__text" })
for title in titles:
print(title.text)
Based on the comment that mentioned selenium
here is a basic example to handle the issue and load the additional episodes by clicking the specific element - May also take a closer look to use the imdb api or perform additional requests
to load additional content:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
# call the episodes of season
driver.get('https://www.imdb.com/title/tt5189554/episodes/?season=1')
# search for the get more episodes button and click it
element = driver.find_element(By.CSS_SELECTOR, '.ipc-see-more__button')
driver.execute_script("arguments[0].click();", element)
# give some time to load additional episodes
time.sleep(2)
# convert drivers page source into bs4 object
soup = BeautifulSoup(driver.page_source)
data = []
# iterate the episodes and select specific content
for e in soup.select('article.episode-item-wrapper'):
data.append({
'title': e.h4.get_text(),
'link': 'https://www.imdb.com'+e.a.get('href')
})
print(data)
Output
[{'title': 'S1. E1 ∙ Un sueño, sobre ruedas',
'link': 'https://www.imdb.com/title/tt5378740/?ref_=ttep_ep1'},
{'title': 'S1. E2 ∙ Una nueva historia, sobre ruedas',
'link': 'https://www.imdb.com/title/tt5585752/?ref_=ttep_ep2'},
{'title': 'S1. E3 ∙ Nuevas aventuras, sobre ruedas',
'link': 'https://www.imdb.com/title/tt5585754/?ref_=ttep_ep3'},
...
'title': 'S1. E79 ∙ La final de la InterContinental, sobre ruedas (Parte 1)',
'link': 'https://www.imdb.com/title/tt6139312/?ref_=ttep_ep79'},
{'title': 'S1. E80 ∙ La final de la InterContinental, sobre ruedas (Parte 2)',
'link': 'https://www.imdb.com/title/tt6139318/?ref_=ttep_ep80'}]
本文标签: pythonHow to load additional episodes for a series via IMDB paginationcontainerStack Overflow
版权声明:本文标题:python - How to load additional episodes for a series via IMDB pagination-container? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736800590a1953475.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论