selenium webdriver - How can I scrape the data within the "informations détaillées" section -软件玩家

admin管理员组
文章数量:1415145

Hi guys i am brand new with webscraping.

I am trying to webscrape the the data within the "informations détaillées" section of this webpage (:/12148/cb42768809f/date) so that I will be able to fill in a SQL database with every field it has.

This is a Test URL. I got a list of 500 URL similar to this one that I requested with this website's API. And I intend to apply my Python function to all the URLs of this list once it works.

Any piece of advice to help me extract the information I need from this webpage please? TY so much!

First I tried to use beautifulsoup but the problem is that the "informations détaillées" section appears only when you click the dropdown button.

I tried several beautiful soup bites of codes though, like the following one, but it didn't work:

def get_metadata_bs4(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    try:
        title = soup.find("h1").text.strip() if soup.find("h1") else "Titre inconnu"
        
        publisher = soup.select_one("dl dd:nth-of-type(1)").text.strip() if soup.select_one("dl dd:nth-of-type(1)") else "Auteur inconnu"
        
        Date_of_publication = soup.select_one("dl dd:nth-of-type(2)").text.strip() if soup.select_one("dl dd:nth-of-type(2)") else "Date inconnue"

        return {"title": title, "author": author, "Date of publication": Date_of_publication}
    except Exception as e:
        print(f"Erreur pour {url}: {e}")
        return None

# Tester avec un seul lien
url_test = ":/12148/cb42768809f/date"
print(get_metadata_bs4(url_test))

So I tried selenium, but this is is my first time using this Python Library… I tried to find the correct X-Path of the source code and replace "metadata-class" with this X-path within the following code block:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdrivermon.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# Configuration Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Mode sans interface graphique
driver = webdriver.Chrome(options=chrome_options)

def get_metadata_from_notice(url):
    driver.get(url)
    time.sleep(2)  # Laisser le temps de charger
    
    try:
        # Cliquer sur le dropdown "Informations détaillées"
        dropdown = WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'Informations détaillées')]"))
        )
        dropdown.click()
        time.sleep(2)  # Attendre le chargement après le clic
    except Exception as e:
        print(f"⚠️ Erreur lors du clic sur {url} : {e}")
        return None

    try:
        # Extraction des métadonnées après ouverture du dropdown
        metadata_section = driver.find_element(By.XPATH, "//div[@class='metadata-class']")  # À remplacer par la bonne classe
        metadata_text = metadata_section.text
        return {"url": url, "metadata": metadata_text}
    except Exception as e:
        print(f"⚠️ Impossible de récupérer les métadonnées pour {url} : {e}")
        return None

# Test sur une URL
test_url = ":/12148/cb42768809f/date"
print(get_metadata_from_notice(test_url))

# Fermer Selenium
driver.quit()

but it keeps giving me results like this one:

⚠️ Impossible de récupérer les métadonnées pour :/12148/cb42768809f/date
⚠️ Erreur sur :/12148/cb452698066/date : Message: 
Stacktrace:
    GetHandleVerifier [0x00007FF7940A02F5+28725]
    (No symbol) [0x00007FF794002AE0]
    (No symbol) [0x00007FF793E9510A]
    (No symbol) [0x00007FF793EE93D2]
    (No symbol) [0x00007FF793EE95FC]
    (No symbol) [0x00007FF793F33407]
    (No symbol) [0x00007FF793F0FFEF]
    (No symbol) [0x00007FF793F30181]
    (No symbol) [0x00007FF793F0FD53]
    (No symbol) [0x00007FF793EDA0E3]
    (No symbol) [0x00007FF793EDB471]
    GetHandleVerifier [0x00007FF7943CF30D+3366989]
    GetHandleVerifier [0x00007FF7943E12F0+3440688]
    GetHandleVerifier [0x00007FF7943D78FD+3401277]
    GetHandleVerifier [0x00007FF79416AAAB+858091]
    (No symbol) [0x00007FF79400E74F]
    (No symbol) [0x00007FF79400A304]
    (No symbol) [0x00007FF79400A49D]
    (No symbol) [0x00007FF793FF8B69]
    BaseThreadInitThunk [0x00007FFC0A7D259D+29]
    RtlUserThreadStart [0x00007FFC0BA0AF38+40]

Hi guys i am brand new with webscraping.

I am trying to webscrape the the data within the "informations détaillées" section of this webpage (https://gallica.bnf.fr/ark:/12148/cb42768809f/date) so that I will be able to fill in a SQL database with every field it has.

This is a Test URL. I got a list of 500 URL similar to this one that I requested with this website's API. And I intend to apply my Python function to all the URLs of this list once it works.

Any piece of advice to help me extract the information I need from this webpage please? TY so much!

First I tried to use beautifulsoup but the problem is that the "informations détaillées" section appears only when you click the dropdown button.

I tried several beautiful soup bites of codes though, like the following one, but it didn't work:

def get_metadata_bs4(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    try:
        title = soup.find("h1").text.strip() if soup.find("h1") else "Titre inconnu"
        
        publisher = soup.select_one("dl dd:nth-of-type(1)").text.strip() if soup.select_one("dl dd:nth-of-type(1)") else "Auteur inconnu"
        
        Date_of_publication = soup.select_one("dl dd:nth-of-type(2)").text.strip() if soup.select_one("dl dd:nth-of-type(2)") else "Date inconnue"

        return {"title": title, "author": author, "Date of publication": Date_of_publication}
    except Exception as e:
        print(f"Erreur pour {url}: {e}")
        return None

# Tester avec un seul lien
url_test = "https://gallica.bnf.fr/ark:/12148/cb42768809f/date"
print(get_metadata_bs4(url_test))

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdrivermon.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# Configuration Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Mode sans interface graphique
driver = webdriver.Chrome(options=chrome_options)

def get_metadata_from_notice(url):
    driver.get(url)
    time.sleep(2)  # Laisser le temps de charger
    
    try:
        # Cliquer sur le dropdown "Informations détaillées"
        dropdown = WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'Informations détaillées')]"))
        )
        dropdown.click()
        time.sleep(2)  # Attendre le chargement après le clic
    except Exception as e:
        print(f"⚠️ Erreur lors du clic sur {url} : {e}")
        return None

    try:
        # Extraction des métadonnées après ouverture du dropdown
        metadata_section = driver.find_element(By.XPATH, "//div[@class='metadata-class']")  # À remplacer par la bonne classe
        metadata_text = metadata_section.text
        return {"url": url, "metadata": metadata_text}
    except Exception as e:
        print(f"⚠️ Impossible de récupérer les métadonnées pour {url} : {e}")
        return None

# Test sur une URL
test_url = "https://gallica.bnf.fr/ark:/12148/cb42768809f/date"
print(get_metadata_from_notice(test_url))

# Fermer Selenium
driver.quit()

but it keeps giving me results like this one:

⚠️ Impossible de récupérer les métadonnées pour https://gallica.bnf.fr/ark:/12148/cb42768809f/date
⚠️ Erreur sur https://gallica.bnf.fr/ark:/12148/cb452698066/date : Message: 
Stacktrace:
    GetHandleVerifier [0x00007FF7940A02F5+28725]
    (No symbol) [0x00007FF794002AE0]
    (No symbol) [0x00007FF793E9510A]
    (No symbol) [0x00007FF793EE93D2]
    (No symbol) [0x00007FF793EE95FC]
    (No symbol) [0x00007FF793F33407]
    (No symbol) [0x00007FF793F0FFEF]
    (No symbol) [0x00007FF793F30181]
    (No symbol) [0x00007FF793F0FD53]
    (No symbol) [0x00007FF793EDA0E3]
    (No symbol) [0x00007FF793EDB471]
    GetHandleVerifier [0x00007FF7943CF30D+3366989]
    GetHandleVerifier [0x00007FF7943E12F0+3440688]
    GetHandleVerifier [0x00007FF7943D78FD+3401277]
    GetHandleVerifier [0x00007FF79416AAAB+858091]
    (No symbol) [0x00007FF79400E74F]
    (No symbol) [0x00007FF79400A304]
    (No symbol) [0x00007FF79400A49D]
    (No symbol) [0x00007FF793FF8B69]
    BaseThreadInitThunk [0x00007FFC0A7D259D+29]
    RtlUserThreadStart [0x00007FFC0BA0AF38+40]

Share Improve this question edited Mar 23 at 22:50 jottbe 4,5312 gold badges18 silver badges35 bronze badges asked Feb 7 at 9:15 Moran Hanane 132 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

No need to use Selenium, just a simple curl request in a shell will yield the result :

curl https://gallica.bnf.fr/services/ajax/notice/ark:/12148/cb42768809f/date

How did I found this ? Simply open the devtools of your browser, select the network tab and click on "Informations détaillées", a new GET entry will appear.

As @iliak mentioned, you can get the information by a get request. You have to insert services/ajax/notice/ in your URLs. Then you have to parse the json to get the data.

For selenium, try the following code. It gets the information and formats the data using pandas.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdrivermon.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

# Configuration Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Mode sans interface graphique
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)

def get_metadata_from_notice(url):
    driver.get(url)

    details = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div#moreInfosRegion")))
    details.click()
    metadata_section = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "dl.noticeDetailsArea")))
    # metadata_text = metadata_section.text
    # return {"url": url, "metadata": metadata_text}

    titles = metadata_section.find_elements(By.XPATH,"./dt")
    data =[]
    for title in titles:
        content = title.find_element(By.XPATH,"./following-sibling::dd[1]").text
        data.append({"Title":title.text, "Content":content})
    return data


# Test sur une URL
test_url = "https://gallica.bnf.fr/ark:/12148/cb42768809f/date"
df = pd.DataFrame(get_metadata_from_notice(test_url))
print(df)

# Fermer Selenium
driver.quit()

OUTPUT:

                 title                                            content
0              Title :   Bulletin paroissial (Valence (Drôme), Paroiss...
1              Title :   Bulletin paroissial mensuel de la cathédrale ...
2          Publisher :                                        F. Rouet ()
3   Publication date :                                               1907
4            Subject :   Guerre mondiale (1914-1918) -- Aspect religie...
5       Relationship :     http://catalogue.bnf.fr/ark:/12148/cb42768809f
6           Language :                                             french
7           Language :                                             French
8         Identifier :                        ark:/12148/cb42768809f/date
9             Source :   Bibliothèque nationale de France, département...
10
11

本文标签：

版权声明：本文标题：selenium webdriver - How can I scrape the data within the "informations détaillées" section 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1745227101a2648659.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

selenium webdriver - How can I scrape the data within the &quot;informations d&#233;taill&#233;es&quot; section

2 Answers 2

更多相关文章

html - JavaScript: Extract value from div element - Stack Overflow

javascript - How can i display a red border in a text area when it doesnt fulfill the requirements? - Stack Overflow

database - Pages from custom table

namespaces - javascript &quot;static imports&quot; - Stack Overflow

android - How to maintain column during focus traversal LazyVerticalGrid on TV - Stack Overflow

javascript - node.js process.stdout.write TypeError - Stack Overflow

javascript - How to fix dynamic code evaluation issue in fortify scan because of using eval() in java script - Stack Overflow

cookies - How to save history in the client&#39;s browser without login?

javascript - How to keep div hidden after page refresh? - Stack Overflow

javascript - chrome.tabs.createexecuteScript &gt; call function that belongs to the page - Stack Overflow

javascript - How does preventDefault and type submit work with forms? - Stack Overflow

java - issue in saving PostUpdateEventListener,PostInsertEventListener, PostDeleteEventListener, getting duplicate error in case

javascript - NodeJS error with connect-busboy, &quot;TypeError: Cannot call method &#39;on&#39; of undefined&quo

chatbot - WhatsApp Business API requires that the user sends the first message before the business can reply - Stack Overflow

Prototype Ajax.Updater Eval Javascript Functions - Stack Overflow

how to get json response form parse.com using query.find(); using javascript in web app? - Stack Overflow

Is there a way to extract MRI protocol parameters from an EXAR1 file using Python? - Stack Overflow

python - ProcessPoolExecutor() with asyncio hangs randomly - Stack Overflow

javascript - Change font size in a div - Stack Overflow

visual studio 2022 - Adding a client-side library in vs2022, the file name saved in lowercase, thought the cdn file name is mixe

发表评论

推荐文章

underscore.js - Finding nested duplicate arrays in JavaScript. (Nested Array uniq in lodashunderscore) - Stack Overflow

php - Convert latlong to pixel X&amp;Y co-ordinates - Stack Overflow

java - Integration tests for akka http endpoints - Stack Overflow

wp title - Assistance with wp_title function

javascript - Setting PUBLIC_URL in different .env files (create-react-app) - Stack Overflow

热门文章

javascript - How to style the microphone of the Google Chrome speech input control - Stack Overflow

javascript - Onclick keeps going back to its original value - Stack Overflow

functions - Add other social networks to TwentyNineteen_SVG_Icons class in child theme?

Auto login after reset password

Tracking when an external link is clicked in Qualtrics with javascript - Stack Overflow

javascript - jQuery: adding change event handler with predefined function - Stack Overflow

javascript - jQuery Ajax tooltip - Stack Overflow

Post multiple values via javascript ajax to php - Stack Overflow

javascript - Minimizing HTTP Connections vs. Parallel Downloads - Stack Overflow

uploads - Multisite stuck at 1MB for max file size

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

javascript - Set onclick function to anchor element in laravel - Stack Overflow

Call javascript function when add_attachment hook fires

javascript - Finding some way to click the skip ad button that youtube has blocked - Stack Overflow

visual studio 2022 - Adding a client-side library in vs2022, the file name saved in lowercase, thought the cdn file name is mixe

javascript - Change font size in a div - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

selenium webdriver - How can I scrape the data within the "informations détaillées" section

namespaces - javascript "static imports" - Stack Overflow

cookies - How to save history in the client's browser without login?

javascript - chrome.tabs.createexecuteScript > call function that belongs to the page - Stack Overflow

javascript - NodeJS error with connect-busboy, "TypeError: Cannot call method 'on' of undefined&quo

php - Convert latlong to pixel X&Y co-ordinates - Stack Overflow