Friday 6 December 2013

Capturing the source and a screen shot of a web page via cron

You can capture the source of a web page using wget like this –

 wget http://somesite/somepage.html  -O 

However, I needed to do this periodically, capture a screenshot and have the resulting filenames set to the date and time that the screenshots were captured. Doing this in a one liner would have been a great exercise in code golf, but would have been nightmarish to debug.

To solve my problem I created a crude shell script, curle-capture that does exactly what I needed it to.

It can be ran from cron like this –

 27 * * * * cd /home/jamiecurle/captures; ./capture.sh https://news.ycombinator.com/ > /dev/null 2>&1 

And it will output into a folder named after the domain ( in this case news.ycombinator.com ) and create the following –

  • 00-27-01.12-06-2013.html
  • 01-27-01.12-06-2013.png

Perfect, now I can collect data for natural language processing and have a visual representation of how it looked in the wild.

In order for the screenshot components to work, you need to have phantomjs installed and on your path.

The source, as always, is on Github.