admin管理员组文章数量:1279145
As mentioned in a previous question, I'm coding a crawler for the QuakeLive website.
I've been using WWW::Mechanize to get the web content and this worked fine for all the pages except the one with matches. The problem is that I need to get all these kind of IDs:
<div id="ffa_c14065c8-d433-11df-a920-001a6433f796_50498929" class="areaMapC">
These are used to build specific matches URLs, but I simply can't.
I managed to see those IDs only via FireBug and no page downloader, parser, getter I tried was able to help here. All I can get is a simpler version of the page which code is the one you can see by "showing source code" in Firefox.
Since FireBug shows the IDs I can safely assume they are already loaded, but then I can't understand why nothing else gets them. It might have something to do with JavaScript.
You can find a page example HERE
As mentioned in a previous question, I'm coding a crawler for the QuakeLive website.
I've been using WWW::Mechanize to get the web content and this worked fine for all the pages except the one with matches. The problem is that I need to get all these kind of IDs:
<div id="ffa_c14065c8-d433-11df-a920-001a6433f796_50498929" class="areaMapC">
These are used to build specific matches URLs, but I simply can't.
I managed to see those IDs only via FireBug and no page downloader, parser, getter I tried was able to help here. All I can get is a simpler version of the page which code is the one you can see by "showing source code" in Firefox.
Since FireBug shows the IDs I can safely assume they are already loaded, but then I can't understand why nothing else gets them. It might have something to do with JavaScript.
You can find a page example HERE
Share Improve this question edited Oct 11, 2010 at 18:11 brian d foy 133k31 gold badges212 silver badges605 bronze badges asked Oct 10, 2010 at 15:55 GurzoGurzo 7171 gold badge8 silver badges21 bronze badges 1- 2 It's added to the DOM using JavaScript after the page has been loaded, so retrieving the page source will only show you the container divs that haven't been filled with this data yet. FireBug monitors the site and keeps track of everything that happens even after the initial load, that's why it also sees this data. No idea how you can retrieve this though. – Alec Commented Oct 10, 2010 at 16:04
6 Answers
Reset to default 8To get at the DOM containing those IDs you'll probably have to execute the javascript code on that site. I'm not aware of any libraries that'd allow you to do that, and then introspect the resulting DOM within perl, so just controlling an actual browser and later asking it for the DOM, or only parts of it, seems like a good way to go about this.
Various browsers provide ways to be controlled programatically. With a Mozilla based browser, such as Firefox, this could be as easy as loading mozrepl
into the browser, opening a socket from perl space, sending a few lines of javascript code over to actually load that page, and then some more javascript code to give you the parts of the DOM you're interested in back. The result of that you could then parse with one of the many JSON modules on CPAN.
Alternatively, you could work through the javascript code executed on your page and figure out what it actually does, to then mimic that in your crawler.
The problem is that mechanize mimics the networking layer of the browser but not the rendering or javascript execution layer.
Many folks use the web browser control provided by Microsoft. This is a full instance of IE in a control that you can host in a WinForm, WPF or plain old Console app. It allows you to, among other things, load the web page and run javascript as well as send and receive javascript mands.
Here's a reasonable intro into how to host a browser control: http://www.switchonthecode./tutorials/csharp-snippet-tutorial-the-web-browser-control
A ton of data is sent over ajax requests. You need to account for that in your crawler somehow.
It looks like they are using AJAX. I can see where the requests are being sent using FireBug. You may need to either pick up on this by trying to parse and execute javasript that affects the DOM.
You should be able to use WWW::HtmlUnit - it loads and executes javascript.
Read the FAQ. WWW::Mechanize doesn't do javascript. They're probably using javascript to change the page. You'll need a different approach.
本文标签: How can Perl39s WWWMechanize expand HTML pages that add to themselves with JavaScriptStack Overflow
版权声明:本文标题:How can Perl's WWW::Mechanize expand HTML pages that add to themselves with JavaScript? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741246892a2365049.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论