admin管理员组文章数量:1331633
I am writing a spider with scrapy, however, I e across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?
I am writing a spider with scrapy, however, I e across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?
Share edited Jan 11, 2013 at 3:01 valentinas 4,3371 gold badge22 silver badges27 bronze badges asked Jan 11, 2013 at 2:56 user806135user806135 532 silver badges9 bronze badges 1- A webbrowser does not store the markup of a page, it holds a DOM. – Bergi Commented Jan 11, 2013 at 3:07
4 Answers
Reset to default 5You can use scraper with Webkit engine available out there.
One of them is dryscrape.
Example:
import dryscrape
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[@name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[@href]'):
print link['href']
# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"
See more info at:
https://github./niklasb/dryscrape
https://dryscrape.readthedocs/en/latest/index.html
If you need a full js engine, there are a number of ways you can drive webkit from Python. Until recently, these sort of things were done with Selenium. Selenium drives an entire browser.
More recently there are newer and simpler ways to run a webkit engine (which includes the v8 javascript engine) from Python. See this SO question: Headless Browser for Python (Javascript support REQUIRED!)
It references this blog as an example Scraping Javascript Webpages with Webkit . It looks to do more or less just what you need.
I'm trying to find an answer to the same problem for a few days now.
I suggest you try QT framework with WebKit. There are two python bindings. One is PyQt and the other one is PySide. You can use them directly if you want to create something more plex or you want to have 100% control over your code.
For trivial stuff like executing JavaScript in a browser environment you can use Ghost.py. It has some sort of documentation and some problems when using it from the mand line but otherwise it's just great.
If you need to process JavaScript you'll need to implement a JavaScript engine. This makes your spider much more plex. Mainly because JavaScript almost always modifies the DOM based on time or an action taken by the user. This makes it extremely challenging to process JS in a crawler. If you really need to process JavaScript in your spider you can have a look at the JavaScript engine by Mozilla: https://developer.mozilla/en/docs/SpiderMonkey
本文标签: javascriptHow to get the source code of webbrowser with pythonStack Overflow
版权声明:本文标题:javascript - How to get the source code of webbrowser with python - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742261362a2442576.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论