admin管理员组

文章数量:1126376

I'm trying to write a Selenium script that scrapes Twitter for new tweets from a specific user after the script starts running. The goal is to print and save tweets to a CSV if they are posted after the script begins execution.

Here’s the code I’m working with:

from selenium import webdriver
from selenium.webdrivermon.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
from datetime import datetime, timezone

class TwitterScraper:
    def __init__(self, username, max_scrolls=10):
        self.username = username
        self.max_scrolls = max_scrolls
        # Set start time to timezone-aware datetime
        self.start_time = datetime.now(timezone.utc)
        options = webdriver.ChromeOptions()
        options.add_argument("--disable-gpu")
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        self.driver = webdriver.Chrome(options=options)
        self.url = f'/{username}'
        self.last_tweet_time = None

    def start(self):
        self.driver.get(self.url)
        scroll_attempts = 0

        while scroll_attempts < self.max_scrolls:
            try:
                WebDriverWait(self.driver, 10).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, "article"))
                )
                
                tweets = self.driver.find_elements(By.CSS_SELECTOR, "article")
                for tweet in tweets:
                    timestamp_element = tweet.find_element(By.TAG_NAME, "time")
                    tweet_time = datetime.fromisoformat(timestamp_element.get_attribute("datetime"))

                    # Ensure tweet_time is timezone-aware
                    if tweet_time.tzinfo is None:
                        tweet_time = tweet_time.replace(tzinfo=timezone.utc)
                    else:
                        tweet_time = tweet_time.astimezone(timezone.utc)

                    if tweet_time > self.start_time:
                        tweet_text = tweet.text
                        print("New Tweet Detected:")
                        print(tweet_text)
                        
                        # Save to CSV
                        df = pd.DataFrame([{ "text": tweet_text, "time": tweet_time }])
                        df.to_csv('latest_tweet.csv', mode='a', header=False, index=False)
                        print("Latest tweet saved to latest_tweet.csv")
                
                # Scroll down
                self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(5)
                scroll_attempts += 1
            
            except Exception as e:
                print(f"Error: {e}")
                scroll_attempts += 1

        self.driver.quit()

if __name__ == "__main__":
    scraper = TwitterScraper("FabrizioRomano", max_scrolls=20)
    scraper.start()

Issue: When I run the script, the Chrome browser opens but closes after a few seconds without printing or detecting any tweets. I plan to switch to headless mode later, but for now, I just need the core functionality to work. The output is empty.

I suspect the issue is related to either:

How tweets are detected (perhaps the article or time selectors are incorrect or not being found).

The scrolling logic or the start_time comparison. Would appreciate any guidance on why the browser closes so quickly without scraping tweets and if the approach for detecting tweets posted after the script starts is correct.

本文标签: pythonSelenium Twitter Scraper Closes Immediately – Not Detecting New TweetsStack Overflow