admin管理员组文章数量:1344238
We have a requirement to cache web pages as accurately as possible, so that we can go back and view a version of a page at any previous point in time. We'd like to be able to view the page as it really was - with the right css, javascript, images etc.
Are there any OS libraries (any language) that will fetch a page, download all externally-linked assets and re-write the links such they they point to the locally-cached assets?
Or is this a case of rolling our own?
Thanks
Edit: I realise that without rendering dynamically generated links etc that this is not going to be 100% possible unless we do DOM rendering. However for the time being we can probably live without this.
We have a requirement to cache web pages as accurately as possible, so that we can go back and view a version of a page at any previous point in time. We'd like to be able to view the page as it really was - with the right css, javascript, images etc.
Are there any OS libraries (any language) that will fetch a page, download all externally-linked assets and re-write the links such they they point to the locally-cached assets?
Or is this a case of rolling our own?
Thanks
Edit: I realise that without rendering dynamically generated links etc that this is not going to be 100% possible unless we do DOM rendering. However for the time being we can probably live without this.
Share Improve this question edited Oct 22, 2010 at 13:39 Richard H asked Oct 22, 2010 at 13:20 Richard HRichard H 39.2k38 gold badges114 silver badges141 bronze badges 1- Richard, please choose the right answer, or tell us what you still need that the presented solutions don't. – Paulo Coghi Commented May 16, 2014 at 21:50
3 Answers
Reset to default 9I suggest HTTrack: http://www.httrack./
Because the software is free, open source, and supports both visual interface and mand line, I believe that you can integrate it or customize it to your needs smoothly.
See the description:
"HTTrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your puter.
It arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online.
It can also update an existing mirrored site, and resume interrupted downloads."
In what OS you can run it:
WebHTTrack for Linux/Unix/BSD: Debian, Ubuntu, Gentoo, RPM package (Mandriva & RedHat), OSX (MacPorts), Fedora and FreeBSD i386 packages.
WinHTTrack for Windows 2000/XP/Vista/Seven
--
Update: the project is active and the latest version was submitted in 04/01/2017
why dont apply a base href to the pages, replace internal absolute links with relative absolutes and keep the structure?
You could use the mht/mhtml format to save as a unified document.
Wiki description: http://en.wikipedia/wiki/MHTML
A quick search will reveal some sources of code to do this.
版权声明:本文标题:javascript - Saving a web page and externally linked assets as an independent static resource - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743728163a2528726.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论