If you’ve been reading for a while, you’ve probably noticed that I’m a big fan of working with as much information as I can get. And what’s the best way to get that information from websites when you’re not much of a programmer? Scraping.
There are loads of really good web scraping tools out there, many of which can be adapted for a variety of purposes. But which ones find their way into my day-to-day toolkit? That’s what today’s post is all about.
Scrapebox
No list of scrapers would be complete without arguably the greatest of them all: Scrapebox. The sheer volume of functionality that this tool offers is amazing, whether you want to use it for data acquisition, keyword research, link prospecting or blasting blog comments (if you’re going to insist on that, at least be a bit creative with it), Scrapebox has you covered, and the ridiculously cheap price makes it a must-have in any digital marketer or data analyst’s toolbox.
I can’t say enough good things about Scrapebox – if you invest the time in learning how to use it properly and you’ve got some solid proxies, all the data you need can be yours. I’m not a Scrapebox affiliate, but if they had a programme, I’d be all over it. I can’t wait for version 2.
Outwit Hub Pro
If you’re into getting information from webpages at scale really fast, Outwit Hub Pro is my pick of the litter.
The ease with which Outwit makes building a custom scraper, combined with the fact that it will work with almost any site, including ones that have dynamically-generated content (AJAX pages and the like) and the sheer speed with which it can harvest information from a few thousand URLs makes it indispensable to me.
On the subject of speed, this is the scraper that puts the biggest load on a server due to how quickly it makes its calls, so if you’re scraping a lot from one website in particular, be nice and break your job up into a few runs or slow down your crawl. Or don’t. I’m not your boss.
Kimonify
If you need to scrape information from pages that change regularly, or you need to pull that information into an application, you might be best off turning those pages into an API.
Kimonify by Kimono Labs is an innovative scraper that does just that through either a bookmarklet or a Chrome add-in and it’s brilliant. I use it on a lot of pages that change often and its never let me down.
Once you’ve got your API set up, you can either pull it straight into Excel like I discussed here, you can regularly call it into Google Sheets using the handy importdata function (more on that another time) or you can work it into your own app or page.
Kimonify is free for most levels of usage and is absolutely brilliant. Give it a go.
Import.io
If you like the sound of Outwit but don’t have the funds, or you fancy playing with Kimonify but would rather have a more varied feature set, Import.io is the tool for you.
With an ever-increasing range of features combined with a great user interface and a price tag of free for most applications, Import.io is the do-it-all scraper. It works on its own variation of the Chrome browser and it does a hell of a lot. Their email newsletter is well worth signing up for too.
Although it can be a bit on the clunky side at times, particularly with new features, the amount of functionality that the guys have crammed into this tool combined with the price tag makes this an essential addition to any data acquisition toolkit.
XPathonURL In SEO Tools For Excel
If you just need to get a bit of information from a page into a spreadsheet quickly, you can’t beat the XPathonURL function in the brilliant SEO Tools For Excel add-in. If you’re a regular reader, you’ll know that this is one of the tools that I can’t do without; so much so that I’ve gone pro. Again, I’m not an affiliate, just sharing one of the best tools out there.
Just open Excel, pop the URL into a cell, find the element you want pulled into your spreadsheet using your browser’s dev tools and type =XPathonURL(YOURURLCELL,//YOURELEMENT) and hit enter. There’s a lot more that you can do with it, but for quick scrapes or working it into a template, you can’t beat it.
How About You?
As I’ve been moving more into the data analysis field, I’ve come across a lot of other scrapers and ways to get information from the web at scale, including Rvest and Scrapy, but these are the ones that find their way into my day-to-day toolkit.
How about you? Do you have a particular favourite? Are there any others out there that I should check out? Drop me a line on Twitter or leave a comment and let me know.