admin管理员组

文章数量:1427503

We have a couple of legacy sites undergoing an upgrade. It would be useful to be able to screenshot every page and then md5 sum the results for both domains, and then test if everything which renders matches 100%.

I am unsure of how to do this - we have looked at cheerio which would crawl the site but be unable to screenshot, and nightwatch which can take screenshots but not crawl the site. Does anyone have experience doing this?

We have a couple of legacy sites undergoing an upgrade. It would be useful to be able to screenshot every page and then md5 sum the results for both domains, and then test if everything which renders matches 100%.

I am unsure of how to do this - we have looked at cheerio which would crawl the site but be unable to screenshot, and nightwatch which can take screenshots but not crawl the site. Does anyone have experience doing this?

Share Improve this question asked Jun 7, 2018 at 8:59 jackdawjackdaw 2,3145 gold badges33 silver badges54 bronze badges 1
  • @Patrick Roberts - have you actually experienced this while screenshotting wikipedia? – pguardiario Commented Jun 8, 2018 at 9:25
Add a ment  | 

2 Answers 2

Reset to default 3

An easy solution is to use Chrome in headless mode which can also be controlled with many Node modules like Puppeteer.

Taken from the Google Developers page:

chrome --headless --disable-gpu --screenshot https://www.chromestatus./

About crawling, you can use a mix of Cheerio and Puppeteer to crawl links and take screenshots. Alternatively you could find some tool that allows to export a sitemap (example) with all the website URLs, at this point it should be easy to loop through them and take a screenshot of each.

You could use StormCrawler with Selenium and write a custom NavigationFilter to take the screenshot and store the md5sum of it in the document metadata. See tutorial for an introduction to SC+Selenium.

The next step could be to write a custom indexer and dump the URLs with the md5s into a database or file. Finally, you'd do the same for the newer version of the site and pare the content of the files or rows in the table.

本文标签: javascriptIs there a way to take a screenshot of every page on a websiteStack Overflow