admin管理员组

文章数量:1134248

I am writing this code:

import requests
from bs4 import BeautifulSoup

url = '/'
headers = {
    "Connection": "keep-alive",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

I want to get data like episode ID and name for all 80 episodes but when run this code it just gives me 50 episodes and others are under the pagination '30 more'.

I tried many things as discovering the HTML code of the website and finding the class

<div class="sc-f09bd1f5-1 hoKmdt pagination-container">
        <span class="ipc-see-more sc-33e570c-0 cMGrFN single-page-see-more-button">
            <button class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-height ipc-btn--core-base ipc-btn--theme-base ipc-btn--button-radius ipc-btn--on-accent2 ipc-text-button ipc-see-more__button" tabindex="151" aria-disabled="false">
                <span class="ipc-btn__text">
                    <span class="ipc-see-more__text">
                        30 more
                    </span>
                </span>
                <svg xmlns="; width="24" height="24" class="ipc-icon ipc-icon--expand-more ipc-btn__icon ipc-btn__icon--post" viewBox="0 0 24 24" fill="currentColor" role="presentation">
                    <path opacity=".87" fill="none" d="M24 24H0V0h24v24z"></path>
                    <path d="M15.88 9.29L12 13.17 8.12 9.29a.996.996 0 1 0-1.41 1.41l4.59 4.59c.39.39 1.02.39 1.41 0l4.59-4.59a.996.996 0 0 0 0-1.41c-.39-.38-1.03-.39-1.42 0z"></path>
                </svg>
            </button>
        </span>
    </div>

but I couldn't find a way to get all the data

I am writing this code:

import requests
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/title/tt5189554/episodes/'
headers = {
    "Connection": "keep-alive",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

I want to get data like episode ID and name for all 80 episodes but when run this code it just gives me 50 episodes and others are under the pagination '30 more'.

I tried many things as discovering the HTML code of the website and finding the class

<div class="sc-f09bd1f5-1 hoKmdt pagination-container">
        <span class="ipc-see-more sc-33e570c-0 cMGrFN single-page-see-more-button">
            <button class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-height ipc-btn--core-base ipc-btn--theme-base ipc-btn--button-radius ipc-btn--on-accent2 ipc-text-button ipc-see-more__button" tabindex="151" aria-disabled="false">
                <span class="ipc-btn__text">
                    <span class="ipc-see-more__text">
                        30 more
                    </span>
                </span>
                <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" class="ipc-icon ipc-icon--expand-more ipc-btn__icon ipc-btn__icon--post" viewBox="0 0 24 24" fill="currentColor" role="presentation">
                    <path opacity=".87" fill="none" d="M24 24H0V0h24v24z"></path>
                    <path d="M15.88 9.29L12 13.17 8.12 9.29a.996.996 0 1 0-1.41 1.41l4.59 4.59c.39.39 1.02.39 1.41 0l4.59-4.59a.996.996 0 0 0 0-1.41c-.39-.38-1.03-.39-1.42 0z"></path>
                </svg>
            </button>
        </span>
    </div>

but I couldn't find a way to get all the data

Share Improve this question edited Jan 8 at 9:09 HedgeHog 24.9k5 gold badges17 silver badges40 bronze badges asked Jan 5 at 1:01 Kasper JcobKasper Jcob 1
Add a comment  | 

2 Answers 2

Reset to default 0

In this scenario specific, all of the URLs you need follow the same pattern:

https://www.imdb.com/title/tt5189554/episodes/?season=<season-number>

Therefore you could just perform the requests in a loop that iterates through the seasons. Example:

for season in range(1, 5):
      url =  f"https://www.imdb.com/title/tt5189554/episodes/?season={season}"
      headers = {
          "Connection": "keep-alive",
          "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
      }
      response = requests.get(url,headers=headers)
      soup = BeautifulSoup(response.text, 'html.parser')

      titles = soup.findAll("div", { 'class': "ipc-title__text" })
      for title in titles:
          print(title.text)

Based on the comment that mentioned selenium here is a basic example to handle the issue and load the additional episodes by clicking the specific element - May also take a closer look to use the imdb api or perform additional requests to load additional content:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

driver = webdriver.Chrome()
# call the episodes of season
driver.get('https://www.imdb.com/title/tt5189554/episodes/?season=1')

# search for the get more episodes button and click it
element = driver.find_element(By.CSS_SELECTOR, '.ipc-see-more__button')
driver.execute_script("arguments[0].click();", element)

# give some time to load additional episodes
time.sleep(2)

# convert drivers page source into bs4 object
soup = BeautifulSoup(driver.page_source)

data = []

# iterate the episodes and select specific content
for e in soup.select('article.episode-item-wrapper'):
    data.append({
        'title': e.h4.get_text(),
        'link': 'https://www.imdb.com'+e.a.get('href')
    })

print(data)

Output

[{'title': 'S1. E1 ∙ Un sueño, sobre ruedas',
  'link': 'https://www.imdb.com/title/tt5378740/?ref_=ttep_ep1'},
 {'title': 'S1. E2 ∙ Una nueva historia, sobre ruedas',
  'link': 'https://www.imdb.com/title/tt5585752/?ref_=ttep_ep2'},
 {'title': 'S1. E3 ∙ Nuevas aventuras, sobre ruedas',
  'link': 'https://www.imdb.com/title/tt5585754/?ref_=ttep_ep3'},
...
'title': 'S1. E79 ∙ La final de la InterContinental, sobre ruedas (Parte 1)',
  'link': 'https://www.imdb.com/title/tt6139312/?ref_=ttep_ep79'},
 {'title': 'S1. E80 ∙ La final de la InterContinental, sobre ruedas (Parte 2)',
  'link': 'https://www.imdb.com/title/tt6139318/?ref_=ttep_ep80'}]

本文标签: pythonHow to load additional episodes for a series via IMDB paginationcontainerStack Overflow