javascript - scraping a website that requires you to scroll down - Stack Overflow

IT技术

更新时间：2025-03-100

admin管理员组
文章数量:1293988

I am trying to scrape this website here:

However, it requires that I scroll down in order to collect additional data. I have no idea how to scroll down using Beautiful soup or python. Does anybody here know how?

The code is a bit of a mess but here it is.

import scrapy
from scrapy.selector import Selector
from testtest.items import TesttestItem
import datetime
from selenium import webdriver
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
import re
import time

class MLStripper(HTMLParser):


class MySpider(scrapy.Spider):
        name = "A1Locker"

        def strip_tags(html):
            s = MLStripper()
            s.feed(html)
            return s.get_data()

     allowed_domains = ['']
    start_urls = ['
 louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
 category=all']

     def parse(self, response):

                 url='
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=Small'
                driver = webdriver.Firefox()
                driver.get(url)
                html = driver.page_source
                soup = BeautifulSoup(html, 'html.parser')
        url2='
meramec-bottom-rd-facility/unit-sizes-prices#/units?category=Medium'
        driver2 = webdriver.Firefox()
                driver2.get(url2)
                html2 = driver.page_source
                soup2 = BeautifulSoup(html2, 'html.parser')                
                #soup.append(soup2)
                #print soup
        items = []
        inside = "Indoor"
                outside = "Outdoor"
        inside_units = ["5 x 5", "5 x 10"]
        outside_units = ["10 x 15","5 x 15", "8 x 10","10 x 10","10 x 
20","10 x 25","10 x 30"]
        sizeTagz = soup.findAll('span',{"class":"sss-unit-size"})
        sizeTagz2 = soup2.findAll('span',{"class":"sss-unit-size"})
        #print soup.findAll('span',{"class":"sss-unit-size"})



        rateTagz = soup.findAll('p',{"class":"unit-special-offer"})


        specialTagz = soup.findAll('span',{"class":"unit-special-offer"})
        typesTagz = soup.findAll('div',{"class":"unit-info"},)

        rateTagz2 = soup2.findAll('p',{"class":"unit-special-offer"})


        specialTagz2 = soup2.findAll('span',{"class":"unit-special-offer"})
        typesTagz2 = soup2.findAll('div',{"class":"unit-info"},)
        yield {'date': datetime.datetime.now().strftime("%m-%d-%y"),
                'name': "A1Locker"
                   }
        size = []
        for n in range(len(sizeTagz)):
                    print len(rateTagz)
                    print len(typesTagz)

                    if "Outside" in (typesTagz[n]).get_text():



                            size.append(re.findall(r'\d+',
 (sizeTagz[n]).get_text()))
                            size.append(re.findall(r'\d+',
 (sizeTagz2[n]).get_text()))
                            print "logic hit"
                for i in range(len(size)):
                        yield {
                    #soup.findAll('p',{"class":"icon-bg"})
                    #'name': soup.find('strong', {'class':'high'}).text

                    'size': size[i]
                    #"special": (specialTagz[n]).get_text(),
                    #"rate": re.findall(r'\d+',(rateTagz[n]).get_text()),
                    #"size": i.css(".sss-unit-size::text").extract(),
                    #"types": "Outside"

    }
            driver.close()

The desired output of the code is to have it display the data collected from this webpage:

To do so would require being able to scroll down to view the rest of the data. At least that is how it would be done in my mind.

Thanks, DM123

I am trying to scrape this website here:

However, it requires that I scroll down in order to collect additional data. I have no idea how to scroll down using Beautiful soup or python. Does anybody here know how?

The code is a bit of a mess but here it is.

import scrapy
from scrapy.selector import Selector
from testtest.items import TesttestItem
import datetime
from selenium import webdriver
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
import re
import time

class MLStripper(HTMLParser):


class MySpider(scrapy.Spider):
        name = "A1Locker"

        def strip_tags(html):
            s = MLStripper()
            s.feed(html)
            return s.get_data()

     allowed_domains = ['https://www.a1lockerrental.']
    start_urls = ['http://www.a1lockerrental./self-storage/mo/st-
 louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
 category=all']

     def parse(self, response):

                 url='http://www.a1lockerrental./self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=Small'
                driver = webdriver.Firefox()
                driver.get(url)
                html = driver.page_source
                soup = BeautifulSoup(html, 'html.parser')
        url2='http://www.a1lockerrental./self-storage/mo/st-louis/4427-
meramec-bottom-rd-facility/unit-sizes-prices#/units?category=Medium'
        driver2 = webdriver.Firefox()
                driver2.get(url2)
                html2 = driver.page_source
                soup2 = BeautifulSoup(html2, 'html.parser')                
                #soup.append(soup2)
                #print soup
        items = []
        inside = "Indoor"
                outside = "Outdoor"
        inside_units = ["5 x 5", "5 x 10"]
        outside_units = ["10 x 15","5 x 15", "8 x 10","10 x 10","10 x 
20","10 x 25","10 x 30"]
        sizeTagz = soup.findAll('span',{"class":"sss-unit-size"})
        sizeTagz2 = soup2.findAll('span',{"class":"sss-unit-size"})
        #print soup.findAll('span',{"class":"sss-unit-size"})



        rateTagz = soup.findAll('p',{"class":"unit-special-offer"})


        specialTagz = soup.findAll('span',{"class":"unit-special-offer"})
        typesTagz = soup.findAll('div',{"class":"unit-info"},)

        rateTagz2 = soup2.findAll('p',{"class":"unit-special-offer"})


        specialTagz2 = soup2.findAll('span',{"class":"unit-special-offer"})
        typesTagz2 = soup2.findAll('div',{"class":"unit-info"},)
        yield {'date': datetime.datetime.now().strftime("%m-%d-%y"),
                'name': "A1Locker"
                   }
        size = []
        for n in range(len(sizeTagz)):
                    print len(rateTagz)
                    print len(typesTagz)

                    if "Outside" in (typesTagz[n]).get_text():



                            size.append(re.findall(r'\d+',
 (sizeTagz[n]).get_text()))
                            size.append(re.findall(r'\d+',
 (sizeTagz2[n]).get_text()))
                            print "logic hit"
                for i in range(len(size)):
                        yield {
                    #soup.findAll('p',{"class":"icon-bg"})
                    #'name': soup.find('strong', {'class':'high'}).text

                    'size': size[i]
                    #"special": (specialTagz[n]).get_text(),
                    #"rate": re.findall(r'\d+',(rateTagz[n]).get_text()),
                    #"size": i.css(".sss-unit-size::text").extract(),
                    #"types": "Outside"

    }
            driver.close()

The desired output of the code is to have it display the data collected from this webpage: http://www.a1lockerrental./self-storage/mo/st-louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?category=all

To do so would require being able to scroll down to view the rest of the data. At least that is how it would be done in my mind.

Thanks, DM123

Share Improve this question edited Aug 10, 2017 at 18:09 asked Aug 10, 2017 at 17:49 Daveyman123 3153 silver badges13 bronze badges

what have you tried? what's the current output and what is the desired output? – Kevin Pasquarella Commented Aug 10, 2017 at 17:51
You need to provide the code you have tried, what it does do and output, and what it fails to ouptput, in addition I am unable to see a link to a website – Professor_Joykill Commented Aug 10, 2017 at 17:52
ok give me a second i will add the code – Daveyman123 Commented Aug 10, 2017 at 18:03
Possible duplicate of: stackoverflow./questions/14147441/… – Greenstick Commented Aug 10, 2017 at 18:25
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") I managed to get it working with this code – Daveyman123 Commented Aug 10, 2017 at 19:48

Add a ment |

2 Answers 2

Sorted by: Reset to default 4

The website you're trying to scrape is loading content dynamically using JavaScript. Unfortunately, many web scrapers, such as beautiful soup, cannot execute JavaScript on their own. There are a number of options, however, many in the form of headless browsers. A classic one is PhantomJS, but it may be worth taking a look at this great list of options on GitHub, some of which may play nicely with beautiful soup, such as Selenium.

Keeping Selenium in mind, the answer to this Stackoverflow question may also help.

There is a webdriver function that provides this capability. BeautifulSoup doesn't do anything besides parse the site.

Check this out: http://webdriver.io/api/utility/scroll.html

本文标签： javascriptscraping a website that requires you to scroll downStack Overflow

版权声明：本文标题：javascript - scraping a website that requires you to scroll down - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741592433a2387206.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - scraping a website that requires you to scroll down - Stack Overflow

2 Answers 2

更多相关文章