A resident peeks out her front door after hearing yelling in the courtyard at the Hacienda housing complex. Hacienda has paid security guards, but some have admitted that the place intimidates them.

Credit: Lacy Atkins/San Francisco Chronicle

After reporter Amy Julia Harris broke the story of squalid living conditions in one of the Bay Area’s worst public housing complexes, we wanted to visualize the deplorable environment that residents had lived in for years and that the Richmond Housing Authority had ignored.

Harris’ reporting gave us a way to do that. Her stories prompted an independent inspection of the Hacienda housing complex. Inspectors surveyed the conditions in each occupied apartment. Once the review was completed, Harris got a copy of the results.

While the inspection report contained the most complete list of problems in the housing project to date, the information came on paper and was not structured in a way to do meaningful analysis. So we used several techniques to create the data we wanted.

Structuring unstructured data

The reports from Hacienda gave us the apartment number and a list of problems inspectors found. To digitize the document, I scanned it and used optical character recognition to make the text computer-readable, then pulled the text to a database. Each apartment number began with the floor it was on, followed by the two-digit unit number. For example, unit 525 stood for floor 5, apartment 25. I wrote a Python script to break up the problems by floor. This allowed me to see how many units on each floor were affected.

The biggest problem with the data was that the reports weren’t in an easy-to-use format. The inspection results were basically big blobs of text in all capital letters separated by line breaks.

The inspection reports for the Hacienda public housing complex in Richmond, Calif., came on paper in a format that wasn’t easy for data analysis.
Credit: /CIR

Using the line breaks as my guide, I wrote a second script to break up the descriptions of problems into individual chunks and associate them back to the unit to which they belonged. The data went from this:


To this:

floor: 1
unit: 01

Finally, we created a rubric based generally on categories that the U.S. Department of Housing and Urban Development uses to classify each problem, such as electrical, vermin and plumbing. Categorizing often required making a decision, so we convened a group of reporters to discuss each problem and assign a category. This created the final dataset we used to power the graphic:

floor: 1
unit: 01
    category: 'electrical'
    category: 'door-windows'

Extracting information from the paper inspection reports was a pain, but it didn’t stop us from doing the project.

Too often, we assume that if an agency gives us only paper that it is being obstinate and refusing to offer the data digitally. Sometimes that’s true, but sometimes no other data exists. So it’s worth working with what you have and finding creative ways to get the data you want. (In our case, Richmond officials ended up sending us the digital version about two weeks before we published.)

To manually enter the data, we used the administrative interface provided by the Django Web framework, which makes data entry much easier. The Hacienda data, with 108 units and an average of four problems per unit, took about a day. (Just make sure you have someone check your work!)

Building the graphic

Once we had the inspection data, we needed to build the actual graphic. Our initial idea was to use 3-D graphics to create a rotating model of the building. I contacted an architect, who told me that many buildings built in the last 10 to 20 years have 3-D models available. Hacienda was built in 1966, so I needed to find an alternative approach.

While I was experimenting with the 3-D idea, I started prototyping a 2-D version using scalable vector graphics. That format allowed me to easily animate the building and edit the shape of the building because SVG files, like HTML files, are Web documents that can be manipulated with CSS and JavaScript. We ended up using a 2-D graphic that had 3-D characteristics. So let’s just call that semi-3-D.

To create a semi-3-D view of Hacienda, we used Google Earth to find a building perspective that would illustrate the building’s shape and layout.
Credit: /CIR

To create that semi-3-D view of Hacienda, I had to find a building perspective that would provide a sense of the building’s shape and layout. I used Google Earth to locate Hacienda and experimented with the perspective until I found one that worked.

This involved using Google Earth's built-in perspective tools to navigate around the building and find a bird's-eye view. For inspiration, I used Dribbble, an online network where designers share their ideas, and searched for other infographics of buildings.

After finding the right perspective, I grabbed my legacy tools – a pencil and paper – and traced the building from my computer screen. From there, I scanned the image and used Inkscape to create the SVG outline of the building.

Inkscape, which is similar to Adobe Illustrator, is free software for editing SVG files. It’s simple to use and allows you to create Web graphics from almost any image source. To learn more about tracing images in Inkscape, check out the online tutorial.

After some experimenting in Inkscape, I created the SVG we used for the final project: 

This outline of Hacienda was created with Inkscape, which allows you to create Web graphics from almost any image source.
Credit: /CIR

Journey through Hacienda took about two months from conception to launch. I tried a lot of new techniques in this graphic, including 3-D, digitizing of paper documents and animation. While we didn’t use all these techniques for this app, we’ll be able to apply them to future projects.

Republish our articles for free, online or in print, under a Creative Commons license.

Aaron Williams is a news applications developer for Reveal, focusing on front-end development, data analysis and data visualization. He previously served as a web producer for the Los Angeles Times and received a bachelor's degree in journalism from San Francisco State University. Williams is based in Reveal's Emeryville, California, office.