admin管理员组

文章数量:1122846

I am trying to scrape .php and have the results returned in a dataframe. There are about a dozen tables on the page, and ideally I could loop through all of them while appending a new column denoting which market they correspond to. I have been unable to find the direct API link in the developer tools and then tried using requests/BeautifulSoup like this unsuccessfully, but was able to get the visible part of the first table using Selenium with the code below (it returns one column but I figured I could find a way to reshape it later).

import selenium
from selenium import webdriver
from selenium.webdrivermon.by import By
import time
import os
from dotenv import load_dotenv
load_dotenv()

##Some code to login in

def get_nba_props():
    driver = webdriver.Chrome()
    driver.get('.php')
    data = []
    markets = driver.find_elements(By.CLASS_NAME, "prop-table")
    for market in markets:
        bodies = market.find_elements(By.CLASS_NAME, "webix_ss_body")
        for body in bodies:
            cells = body.find_elements(By.CLASS_NAME, "webix_cell")  
            for cell in cells:
                cell_text = cell.text
                data.append(cell_text)
            df = pd.DataFrame(data, columns=['Value'])

    driver.quit()
    return df

Have tried a few variations but can't seem to get anything to work (both in terms of the other tables and in returning the non-visible rows). New to web scraping/Selenium so any help is greatly appreciated. I am a site subscriber so using the 'export CSV' button is also a possibility if that's an easier route.

I am trying to scrape https://www.rotowire.com/betting/nba/player-props.php and have the results returned in a dataframe. There are about a dozen tables on the page, and ideally I could loop through all of them while appending a new column denoting which market they correspond to. I have been unable to find the direct API link in the developer tools and then tried using requests/BeautifulSoup like this unsuccessfully, but was able to get the visible part of the first table using Selenium with the code below (it returns one column but I figured I could find a way to reshape it later).

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import os
from dotenv import load_dotenv
load_dotenv()

##Some code to login in

def get_nba_props():
    driver = webdriver.Chrome()
    driver.get('https://www.rotowire.com/betting/nba/player-props.php')
    data = []
    markets = driver.find_elements(By.CLASS_NAME, "prop-table")
    for market in markets:
        bodies = market.find_elements(By.CLASS_NAME, "webix_ss_body")
        for body in bodies:
            cells = body.find_elements(By.CLASS_NAME, "webix_cell")  
            for cell in cells:
                cell_text = cell.text
                data.append(cell_text)
            df = pd.DataFrame(data, columns=['Value'])

    driver.quit()
    return df

Have tried a few variations but can't seem to get anything to work (both in terms of the other tables and in returning the non-visible rows). New to web scraping/Selenium so any help is greatly appreciated. I am a site subscriber so using the 'export CSV' button is also a possibility if that's an easier route.

Share Improve this question asked Nov 22, 2024 at 23:16 AMJAMJ 236 bronze badges 2
  • 1 The csv option is much better than parsing the html. – LMC Commented Nov 22, 2024 at 23:25
  • Thanks - any guidance on how I could implement that to create a dataframe of all the tables without having to manually open a bunch locally downloaded CSV files? – AMJ Commented Nov 29, 2024 at 19:47
Add a comment  | 

1 Answer 1

Reset to default 0

for the requests method: the data is here but you must parse some raw javascript and as LMC said in the comments, getting the csv is a better solution. but the javascript is easy to parse here: so why not? the data is mostly alone in one big line and needs only some trimming.

assuming you already have the soup, this should do:

import json

data = {}

for prop_table in soup.select('.prop-table'):
    
    # each prop-table has a script tag in it containing the data
    # the data is alone on its line, isolate it
    raw_javascript = [
        line.strip()
        for line in prop_table.script.text.splitlines()
        if line.strip().startswith('data')
    ]

    # [0]: there's only one line starting with "data" per script
    # [6:-1]: remove the "data: " part and the trailing comma
    json_string = raw_javascript[0][6:-1]

    # section['data-prop']: PTS, REB, AST...
    data[section['data-prop']] = json.loads(json_string)

print(data)

本文标签: pythonScraping RotoWire Player Props and Returning Them as a DataFrameStack Overflow