Handling Data II: Scraping

On September 26, 2013, in Big Data, HR Technology, HRExaminer, by John Sumser

Vendors can make data acquisition painful. It turns out that ownership really means 'the person with the slimier lawyers'.

Vendors can make data acquisition painful. It turns out that ownership really means ‘the person with the slimier lawyers’.

Vendors can make data acquisition painful. With precedents beginning back in the heyday of Monster’s run, the world’s sources of people data have tried to claim ownership because they had possession. It turns out that ownership really means ‘the person with the slimier lawyers’.

If you look back at our who owns data series, you’ll discover that ownership isn’t always what you think it is. As mentioned yesterday ,it’s not just those high volume vendors. Somehow, by using this Applicant Tracking System or that Talent Management System, your data can effectively become theirs at the push of a button.

It’s yours if you can use it and theirs if they can stop you.

(It’s astonishing but true. In our day of open architecture and APIs, plenty of legacy firms are making money by making data access tough. Claiming special processes and handling. As is always the case, the ‘hold on tight approach always yields the most resistance. It’s easy to find the info necessary to defeat even the toughest methods designed to keep you from your data.)

When you can’t get data reliably and easily (a report, predictable output, an RSS Feed, an export file), you can always resort to the data hunter’s solution of choice: scraping. Fundamentally anything that can be viewed on a browser can be ‘scraped’. That is, the data can be extracted with or without the help of the vendor. While it’s easy to increase the hassle for freeloaders (HR has a ton of these), paying customers can easily scrape and acquire the data they’ve paid for.

It’s worth taking the time to read up on scraping. Wikipedia has a great article, there are large numbers of easy to digest articles on the topic. What you need to understand is that any data you need can be acquired using this technique. It is essentially a function of being on the web. While many user agreements prohibit the technique, enforcement in inconsistent and contradictory. It is unlikely that any firm will be punished for reaquiring its own data this way.

Indeed (recently sold for $1B and the best known job board brand, built its business on scraping).

Here’s the compact tutorial:

  • If you can see a page on a web browser you can scrape it.
  • In Chrome (for example), the View>Developer>View Source Command shows the code for the page.
  • Scraping simply means copying that “source” and removing the stuff you don’t want.
  • It’s easy to automate.

Scraping at scale (like Indeed’s massive process that scrapes most jobs in the world) involves fixing a lot of moving parts. Smaller companies (like Broadbean) are particularly good at processing and managing scrape feeds.

Scraping is a method of last resort (but proof that no data acquisition problem is insoluble). As Indeed did, it’s better to persuade the people who’s data you are scraping to provide a predictable flow with agreed upon structure. When that is missing, you and up with a hard to control variable cost.

Like most of the techniques used to integrate data from a variety of sources, scraping means wrestling with the grunge. Getting any two piles of data in synch requires patience, attention to detail, the ability to see beyond the obvious. And, it means doing it over and over again.

Every time an employee record changes (name, benefits, bio details, address, title, dependents), those new bits flow into the various systems and spreadsheets that give leadership the data it requires. Every time a re[port is run or an analysis is made, the people on the job start with the premise that the first run is broken.


The series:


Read previous post:
Photo of Handline Big Data Part 1 Article on HRExaminer September 25, 2013 by John Sumser
Handling Data I

Over the next several weeks, we're going to look at the problems and opportunities for using data in HR.