Since we launched our Politics Verbatim project a couple of years ago, I’ve been hung up on what should be a simple problem: How can we automate the extraction of quotes from news articles, so it doesn’t take a squad of bored-out-of-their-minds interns to keep track of what politicians say in the news?

You’d be surprised at how tricky this is. At first glance, it looks like something a couple of regular expressions could solve. Just find the text with quotes in it, then pull out the words in between! But what about “air quotes?” Or indirect quotes (“John said he hates cheeseburgers.”)? Suffice it to say, there are plenty of edge cases that make this problem harder than it looks.

When I took over management of the combined Center for Investigative Reporting/Bay Citizen technology team a couple of months ago, I encouraged everyone to have a personal project on the back burner – an itch they wanted to scratch either during slow work days or (in this case) on nights and weekends.

This is mine: the citizen-quotes project, an app that uses simple machine learning techniques to extract more than 40,000 quotes from every article that ran on The Bay Citizen since it launched in 2010. The goal was to build something that accounts for the limitations of the traditional method of solving quote extraction – regular expressions and pattern matching. And sure enough, it does a pretty good job.

The other hope is to help make these techniques more accessible. We’ve had great results using simple machine learning techniques for several projects this year, and we’re starting to realize that higher-order data science can bring real value to the practice of data journalism. The biggest barrier is that so much of it can be painfully opaque if you don’t have a background in math.

But stuff like this doesn’t have to be hard. And it has a place in journalism. We’ve thoroughly documented the project’s code for anyone who’s interested in learning, and here I’ll also offer up a few lessons and cautionary tales for any data journalist brave enough to run down the rabbit hole.

Regular expressions vs. machine learning

One practical lesson I’ve learned tinkering with machine learning over the last couple of years is that, applied correctly, classifiers can do a much better job of information extraction and pattern recognition than regular expressions. Don’t take my word for it – bitly’s Hilary Mason thinks so, too.

Here’s a simple example: address extraction. This is the regular expression EveryBlock used for extracting addresses from raw text when it was open-sourced a few years back. I’m sure it does a fantastic job, but wow – my deepest sympathies to whoever had to write it. It also has an obvious drawback: If a chunk of text doesn’t perfectly fit the pattern, it’s not going to be matched.

Machine learning offers a different approach: Train the computer to look at the constituent parts of an address and assign a probability that, taken together, composes a valid address string.

Take something like 123 Fake St., Brattleboro, Vermont. We might tell the computer to note that it begins with numbers; that is contains a known street abbreviation, city and state; that the numbers are followed by a capitalized word; etc. Given a manually classified training set, the computer will learn which of those characteristics tend to indicate true address strings and assign a judgment accordingly.

Not that pattern matching is bad. The best open-source work I’ve seen around quote extraction has come from regular expressions and pattern matching. There’s this awesome On the Record project that Alex Redstone, James Addison & Co. put together at the inaugural San Francisco News Hack Day last month; this blog post from a few years back about using Java-based LingPipe for the task; and some work from the RAVINE project at Carnegie Mellon University that relies on context-free grammars.

But quotes are a tricky business. Lots of things look like quotes that aren’t, and some things are more quote-like than others. The ideal approach would be able to account for some of that fuzziness in a way that pattern matching doesn’t.

Maximum entropy models

That’s the goal we had in mind with citizen-quotes. Rather than using concrete patterns, we came up with 12 common characteristics that often indicate quotes and used a subset of them to train an algorithm to tell the difference between quotes and non-quotes. Specifically, we used the implementation of a maximum entropy model classifier provided by Python’s NLTK package.

A maxent model is a supervised classifier that requires a couple of inputs to do its job: a set of hand-tagged training data to learn from and a series of characteristics, known as features, to help it distinguish between quotes and non-quotes. (For example, does the paragraph contain an attribution word like “said?”)

For training, we fed the algorithm a set of several hundred randomly selected paragraphs from our database of The Bay Citizen content, which we tagged by hand as being quotes or not.

For features, we developed a set of about a dozen, of which we ended up using six: Does the paragraph contain common attribution words (said, asked, etc.); does it contain quote marks; does it have a common attribution word within five words of a quote mark (“I love tacos,” Smith said.); how many words does it have in quotes (helps deal with the “air quotes” problem); what are the five words that fall immediately after a closed quote; and what is the last word in the paragraph. Each feature is represented by a function that takes paragraph text as an input and returns either a boolean or categorical variable. You can see them here.

Using those inputs, NLTK’s maxent implementation uses your choice of optimization algorithms to figure out which features are the most useful, then uses those weighted features to determine whether an unseen paragraph should be classified as a quote. The math is pretty accessible. There’s a great explainer of the intuition here.

The end result is that a paragraph like this is classified with high certainty as a quote:

“The family and friends that we spoke to are shocked that he’s in this position,” Olmo said.

But in a case like this, the classifier is much less certain about what to do but still properly classifies it as a non-quote.

A young couple carried a large quilted banner that read, “Oakland Rise Up!” They sat on the floor, holding the sign, in front of the board members for the first half of the meeting.

Evaluating the algorithm over numerous random subsets of our training set, we find that the algorithm typically finds, at most, a handful of positives or false negatives. And most of those are records that could probably be classified either way. Not bad. There have been very few glaring errors.

A quick note about pronoun co-referencing

Another tricky problem that we didn’t invest as much effort in solving was pronoun co-referencing – or the act of teaching the computer that the “he” in paragraph five of a story corresponds to the “John Smith” in paragraph one.

This being a demo, we went for the simplest possible solution: OpenCalais. You can see the code for our approach here. There’s not much to learn from it, other than that OpenCalais is a decent tool for the job. Most data journalists tend to use it primarily for tasks like named entity extraction, but it’s worth noting that it has other uses as well.

Possible applications

As I mentioned before, our work in this area was inspired by our Politics Verbatim project. At the time, we had interns spend an hour or two a day parsing and classifying quotes from news articles. Had we used a system like this, we would have been able to cut down significantly on manual labor. With some adjustments – and maybe some help from Mechanical Turk in dealing with low-certainty cases – this approach makes the task of tracking politicians’ printed statements a lot more scalable.

I’ve also long been fascinated with the idea of extracting value from news archives. Quotes are a relatively large and clunky unit of value, but they still can help answer some interesting questions for a news organization. Who is quoted most often on different beats? How much real estate do we devote to quoting opposing sides of arguments? How representative are the people we quote of our community? How do colorful or inflammatory quotes and sources affect traffic or engagement?

Even the simple quote browser demo we wrote (launchable from the code on our GitHub page) could have some interesting applications. If you let it scroll by for a while, you’re bound to see some interesting quotes that will compel you to read the stories from which they came. Quotes provide a distinct, intimate and novel entry point into content, using the words of people in the community.

Citizen-quotes is just a small nights-and-weekends project that took a couple of weeks to build, but if you think the technology might be useful to a project you’re working on, let me know. We’d love to collaborate!

Chase Davis is the director of technology for California Watch and its parent organization, the Center for Investigative Reporting. He also writes about money and politics issues for California Watch. Chase previously worked as an investigative reporter at The Des Moines Register and the Houston Chronicle and is a founding partner of the media-technology firm Hot Type Consulting. He is a graduate of the Missouri School of Journalism.