admin管理员组文章数量:1332389
basically, I am trying to scrape webpages with php but I want to do so after the initial javascript on a page executes - I want access to the DOM after initial ajax requests, etc... is there any way to do this?
basically, I am trying to scrape webpages with php but I want to do so after the initial javascript on a page executes - I want access to the DOM after initial ajax requests, etc... is there any way to do this?
Share asked Jun 26, 2012 at 18:54 JustinJustin 3,6343 gold badges29 silver badges39 bronze badges 4- What have you tried? Your question is a bit ambiguous. If you can post some trial code, we'll get a clearer picture. – Jonathan M Commented Jun 26, 2012 at 18:55
- 4 I think OP wants to grab the contents of a web page, and if it contains JS, it should be executed as if the page was opened in a browser. – madfriend Commented Jun 26, 2012 at 18:56
- i'm using Simple HTML Dom simplehtmldom.sourceforge/manual.htm to scrape webpages, but so many webpages today are dynamic and I'd like the initial javascript to execute before grabbing the code... if this makes any sense! – Justin Commented Jun 26, 2012 at 18:57
- possible duplicate of Server side browser that can execute JavaScript – Bergi Commented Jun 26, 2012 at 19:02
2 Answers
Reset to default 2Short answer: no.
Scraping a site gives you whatever the server responds with to the HTTP request that you make (from which the "initial" state of the DOM tree is derived, if that content is HTML). It cannot take into account the "current" state of the DOM after it has been modified by Javascript.
I'm revising this answer because there are now several projects that do a really good job of this:
2020 update: Puppeteer is a Node.js library that can control a Chromium browser, with experimental support for Firefox also.
2020 update: Playwright is a Node.js library that can control multiple browsers.
You need to install Node.js and write JavaScript code to interact with both of these projects. Especially with async
and await
they work quite well, and you can use any Node.js/npm modules in your code.
There are also other projects like Selenium but I wouldn't remend them.
- PhantomJS is a headless version of WebKit, and there are some helpful wrappers such as CasperJS.
- Zombie.js which is a wrapper over jsdom written in Javascript (Node.js).
You need to write JavaScript code to interact with both of these projects. I like Zombie.js better so far, since it is easier to set up, and you can use any Node.js/npm modules in your code.
Old answer:
No, there's no way to do that. You'd have to emulate a full browser environment inside PHP. I don't know of anyone who is doing this kind of scraping except Google, and it's far from prehensive.
Instead, you should use Firebug or another web debugging tool to find the request (or sequence of requests) that generates the data you're actually interested in. Then, use PHP to perform only the needed request(s).
本文标签: php filegetcontentsAFTER javascript executesStack Overflow
版权声明:本文标题:php file_get_contents - AFTER javascript executes - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742291863a2447952.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论