Capturing the source and a screen shot of a web page via cron
You can capture the source of a web page using
wget like this –
wget http://somesite/somepage.html -O
However, I needed to do this periodically, capture a screenshot and have the resulting filenames set to the date and time that the screenshots were captured. Doing this in a one liner would have been a great exercise in code golf, but would have been nightmarish to debug.
To solve my problem I created a crude shell script, curle-capture that does exactly what I needed it to.
It can be ran from cron like this –
27 * * * * cd /home/jamiecurle/captures; ./capture.sh https://news.ycombinator.com/ > /dev/null 2>&1
And it will output into a folder named after the domain ( in this case news.ycombinator.com ) and create the following –
Perfect, now I can collect data for natural language processing and have a visual representation of how it looked in the wild.
In order for the screenshot components to work, you need to have phantomjs installed and on your path.