When I took over leadership of our newly created data team earlier this year, one of the first things I wanted to do was bring some sanity to how we build, deploy and host our news applications.
Put another way, I wanted to undo the Gordian knot I tied when I first came to California Watch more than two years ago. At the time, I didn’t care about configuring servers. I didn’t even care about building news applications – I was a reporter, and I wanted to report. So one day, to get it out of the way as fast as possible, I set up the mother of all awful web servers, with dev and production both on one box.
Blasphemy, I know. How we’ve made it more than two years without an epic meltdown is still a mystery to me. For better or worse, now that I’ve accepted my fate by moving into a full-time technology gig, I can’t afford to be indifferent anymore.
Enter Heroku.
If you’re not familiar, Heroku is the leading platform-as-a-service stack that has been all the rage among Rails developers over the last few years. The company added Django support late last year, which made it worth a look. To oversimplify, it basically lets you host apps without worrying about configuring or maintaining servers. Minus some initial setup, app deployment is as simple as pushing to git.
As it turns out, Heroku is also a great fit for the classic news applications use case: small, self-contained apps with relatively predictable traffic patterns, built and run by developers who either a.) don’t have a ton of sysadmin experience; or b.) wish they didn’t have a ton of sysadmin experience because they would rather be building apps instead.
A baseline configuration is free – and, configured correctly, more than enough to run a news app during non-peak traffic. Apps can scale horizontally for traffic bursts (like launch day) at a very reasonable cost.
You can read better guides elsewhere about the service; its pricing; and how to get started. This post is intended to outline a few specific lessons we learned using Heroku for news apps, which so far has turned out to be an excellent decision.
Pick a WSGI setup
Heroku’s docs recommend running Django apps using gunicorn – a speedy, lightweight WSGI server that runs behind nginx instances already installed on Heroku’s dynos.
Its launch parameters can be defined in a Procfile, which is basically a list of processes to be executed whenever the app is deployed or the dyno is reset. For our setup, we’re running gunicorn with 10 asynchronous workers using gevent.
Load testing found that configuration to be fine for our purposes, but you could just as easily use a server like Tornado or CherryPy.
Run your own database server
This isn’t a knock on Heroku’s dedicated Postgres instances, which actually have some very cool features. But a price tag starting at $200 a month was a little steep for our liking, so we configured our own (substantially cheaper) EC2 instance instead.
A few lessons here: First, use a connection pooler (we went with the ultra-lightweight pgbouncer). With this configuration, dyno performance is more likely to be bounded by an offsite database than its own CPU or RAM. Second, a quick once-over with pgtune can really boost your performance and reliability.
Finally – and this is important – make sure you’re using EC2. Heroku is built on top of EC2, and its primary nodes are located in Amazon’s US-East region. Parking your database server in the same region will dramatically cut down on latency issues. At first we tried running our database server on Rackspace Cloud, which is based in Texas. It didn’t turn out well.
Stress testing our new setup using using ab and Blitz, we had no problem clearing several hundred requests per second with minimal delays and no drops – more than enough to handle significant off-peak traffic.
These guides were huge in helping us figure this out.
Get to know buildpacks
A nearly deal-breaking concern I had when we first started looking at Heroku was its lack of GeoDjango support. The system-level software on a generic dyno is pretty sparse – basically just enough to let you run a bare-bones Django project. If you want to install anything else – like GEOS, GDAL, Proj.4, for example – you need to understand Heroku buildpacks.
Buildpacks allow you to provision a dyno with any additional software you need. It would be nice if the process of creating buildpacks was a bit more intuitive, but our approach was to precompile GEOS, GDAL and Proj.4 specifically for the Heroku environment using the Heroku bash console, then store those binaries on S3, where our buildpack can fetch them.
No GeoDjango buildpack existed when we started looking at Heroku, so we wrote our own. We’ve got a half-dozen apps running on it now, and it works fine for our purposes. If you want to hack on it and improve it, it’s on our Github.
Any cache helps
This one should go without saying. Every Heroku app comes with 5MB of free cache space if you install the memcached addon. It doesn’t sound like enough, but do it anyway. But be aware that you’ll need to make some changes to your app’s production settings in order for it to work.
And while you’re at it, serve your static media through S3.
Use New Relic
There’s a funny thing about Heroku’s free single-dyno setup: namely, your dyno automatically idles down after periods of inactivity to conserve resources. That would be fine, but when someone hits your inactive dyno, its web servers will take upwards of 10-20 seconds to spool up again. Not good.
It pains me to say this, but you can prevent this problem with a regular uptime monitoring service like New Relic, which you can install as a Heroku add-on. On top of monitoring your app’s performance, which can help you decide when/how to scale it, New Relic will also ping your app once a minute to be sure it’s up and running – keeping its dyno and web server alive in the process. Here’s how you set it up.
The reason it pains me is because it feels like cheating. Apps that run on multiple dynos don’t face the same idling issues, but it’s not worth an extra $35 a month (per app!) to ensure the minimal users of our retired apps don’t have to deal with spool-up times.
A big part of me wishes we could pay Heroku a small amount to keep our single-dyno apps from idling, but for now New Relic uptime monitoring is the next best option.
Wrapping up
Heroku is only one part of our new infrastructure setup – which borrows a couple pages from other news apps shops around the country. On one end of the continuum are static apps, baked out to HTML and hosted on S3; in the middle is Heroku; and on the far end are larger special-needs apps, like our Politics Verbatim project, that continue to run on dedicated servers in the Rackspace Cloud.
It’s all built around a philosophy that our developers and I have better things to do than tune servers all day, so we want to cut out as much sysadmin work as possible. Glitches happen. Systems erode. The more apps we build, the more we have to worry about something breaking.
Or at least we did. Hopefully not anymore.