Last week, we finally launched a project that has been on my to-do list for the last two years: California’s first-ever searchable database of lobbying interests, which allows users to quickly see all of the special interest groups that lobbied any given bill in the state legislature.
It had been on my list for months to crank out this project before the end of the 2011 legislative session, which is coming up later this week. So when the most recent round of reports arrived in late July, we made the decision to tackle the project as quickly as we could.
Along the way, we learned a few lessons about data processing and product development, which I hope you will find useful:
1. Technology can only take you so far
In computer-assisted reporting land, this has been the mantra for as long as I can remember. Who knew that the same principle applied to data processing?
First, some background on California lobbyist disclosures. Each quarter, special interest groups are required to fill out a form outlining the bills and issues that they’re actively lobbying. Problem is, the form doesn’t collect that data in any kind of structured way. Most of the data we wanted was crammed into narrative text fields using dozens, if not hundreds, of different data entry conventions.
Assembly Bill 52, for example, might have been referred to in different disclosures as “AB52,” “AB: 52,” “Assembly 52,” “Assembly Bills: 32, 47, 52,” “California Assembly, including bills 41 and 52,” “Assembly insurance bill number 52” and dozens more.
We knew right away that there was no way we could write a program to account for all these permutations – especially given our time constraints – so we decided to rely on human labor instead. Specifically, we outsourced the bill parsing to Amazon’s Mechanical Turk service.
We’ve used MTurk with good success in the past, but we’ve never tried it for anything of this scale before: About 7,000 reports containing more than 40,000 bills for the first six months of 2011 alone.
Anyone who has ever worked with human data entry clerks knows you do so at your own peril. Any time an unpredictable person, rather than a predictable algorithm, is the one parsing your data, you never know what kinds of strange errors you might run into.
Bottom line: You have to mistake-proof the process if you want usable results – something we did by using statistical sampling. We started by taking a handful samples from the dataset and running them through MTurk one at a time. Each iteration, we learned something new about the data entry mistakes of the Turkers and built in frontend form validation to prevent them.
In the end, our tests showed that our final results came back with an error rate of between 2 and 7 percent – most of which were easily fixable. It wasn’t perfect, but it wasn’t bad either. And you can’t beat the price: Including testing, it cost us about $300 to enter all of the reports from 2011. We expect keeping the data maintained will cost even less – around $75 a quarter.
2. It takes data to make data
The real brilliance of Google isn’t its search algorithms, or its cool standalone products, or its money-machine ad network (though that’s pretty brilliant too) – it’s how the company uses its data to gather more data.
Every search, every e-mail stored in Gmail, every route mapped out on Google Maps, provides Google with valuable information. We wanted to apply the same principle on a smaller scale to this project.
To that end, we decided to record anonymized information about the bills people are searching. It won’t make us billionaires, but it will show us which bills people are buzzing about, which can in turn help inform our decisions about new products and even help generate stories.
This is the point where people cry “Privacy!” And understandably so – it’s a slippery slope. But first consider the benefits: One thing news applications do very well is generate traffic. Traffic isn’t just good for CPM, it’s also good – and arguably even more valuable – for the data it generates. Once your audience gets big enough, you can mine real intelligence from its collective action.
Anymore if you’re a website not learning from your users, you’re bringing brass knuckles to a bazooka fight. The important thing is to gather that information responsibly, and to draw insight from the collective action of your users
3. Start with a minimum viable product
Think about the lifecycle of a major newsroom project. An investigation might go on for months before it produces a single story. That story crescendos with a blowout, front-page hit. If you’re lucky, it resonates and generates follow-ups for months to come. If you’re not – maybe the timing is off, maybe the subject falls flat, or maybe Lindsay Lohan got arrested that day and bumped you below the fold – all your hard work lands with a thud, never to be heard from again.
That approach might work in investigations, where discretion is still important, but it’s far from the best way to develop news applications.
Lean Startup methodology has a concept called the Minimum Viable Product. It’s basically the simplest, earliest version of a product that you can still push out the door with a smile on your face.
The idea is to put the product in front of users quickly and to start gathering feedback, which you can then use to iterate and improve the product. No need to spend weeks building whiz-bang features before you even know whether people will use them.
We have big plans for our lobbying application, but we started simple. The interface is nothing more than a Google-type search box that returns a list of names. That’s been successful enough to generate dozens of pieces of feedback, which we’re using to help prioritize new features.
We’ll continue to build up that way as the project continues. The process minimizes waste, keeps our investment low, and in turn makes it easier to walk away from the project if one day it stops being useful.