We’ve had several problems pop up this year that have called for comparing a bunch of documents to a bunch of other documents, typically to find which ones are similar.

It’s a simple problem on its face, but a difficult one to scale. Comparing thousands of documents to one another can call for tens of millions of individual comparisons. Tens of thousands of documents can mean hundreds of millions or even billions of comparisons, assuming you want to compare everything to everything else.

Needless to say, the problem pretty quickly outgrows a single machine. So I spent some time over the weekend breaking it down into a MapReduce workflow so we could run it over an Amazon Elastic MapReduce cluster.

The process wasn’t without its headaches. Translating something into a MapReduce process seems a little unnatural — especially if you, like me, come from a background in relational databases. Thankfully, Tamer Elsayed, Jimmy Lin and Douglas W. Oard, of the University of Maryland, laid out a great method for solving pairwise document similarity using MapReduce.

I couldn’t find an implementation of their algorithm online, so I wrote one myself. I doubt it’s perfect, but it seems to work alright. Hopefully it gives someone else a head start. You can find it on our Github page.

The algorithm uses the term weighting scheme from the example in the paper, which isn’t especially useful. But it’s not too hard to swap in TF-IDF or another more useful comparison tool. The process is split into two parts, each of which is represented by a mapper and a reducer.

The first step is to break down your documents into an inverted index (it helps to run some preprocessing beforehand — stemming, punctuation and stopword removal, etc.) That’s what inv-index-mapper.py and inv-index-reducer.py are for.

The next step is to run pairwise comparisons using that index in order to create a similarity score for each set of documents. That’s where pairwise-mapper.py and pairwise-reducer.py come in.

I don’t claim to be an expert, so have a look at the code and feel free to correct my mistakes. My contact info is in the package README.

Creative Commons License

Republish our articles for free, online or in print, under a Creative Commons license.

Chase Davis is the director of technology for California Watch and its parent organization, the Center for Investigative Reporting. He also writes about money and politics issues for California Watch. Chase previously worked as an investigative reporter at The Des Moines Register and the Houston Chronicle and is a founding partner of the media-technology firm Hot Type Consulting. He is a graduate of the Missouri School of Journalism.