So you’ve heard that tuning the Ruby garbage collector is essential for Rails performance, but all those tuning parameters in the REE documentation look scary and indecipherable? Well, I took an afternoon and played around with Birdstack‘s GC config, and it turns out that not only is it actually really important to tune those parameters, they’re not really that scary.
First, some terminology:
Ruby Heap: A collection of ~20-40 byte “slots” that keep track of every object allocated in a Ruby program
C Heap: Memory allocated by Ruby for storing data for your objects if the slot is too small. If you’re storing an integer, it’ll probably fit in the slot. If you’re storing a string containing the entirety of your latest novel, Ruby will need a slot to keep track of the object and some space on the C heap to store your bestseller.
Garbage Collection: Unlike C/C++, Ruby takes care of deallocating objects for you. It doesn’t do this as soon as you stop using an object, it waits and does period sweeps through the Ruby heap (the collection of slots), figures out what objects aren’t being used and then deletes them.
Ruby Enterprise Edition: A distribution of MRI 1.8.x with all sorts of nice patches. To tune Ruby’s garbage collector, you’ll need to be running REE.
If you want to know more about the Ruby garbage collector, go read the Engine Yard article on MRI Memory Allocation. Also, you should really take a look at the REE documentation to see more about these parameters. And if you really want to know what’s going on, just go read the gc.c source. It’s really not too complicated.
OK, so REE gives us a bunch of knobs to turn for GC and memory allocation, and a lot of diagnostic output. The first step is to turn on that output so we can decide what knobs to fiddle with. Here’s how I set up my Rails app to give me an approximate idea of what GC activity happens with every request.
GitHub Gist: http://gist.github.com/464624
Once you’ve got that set up, get your app running in a staging environment that closely mimics your production environment, and start up an instance of your app with GC logging turned on. I found that I needed to use just ‘script/server’ with WEBrick instead of Passenger or I wouldn’t get anything in my logs. Probably something with how Passenger handles file descriptors when forking; I didn’t feel like debugging it. You might also be interested in a my post on setting machine-wide Ruby GC defaults.
I started with these environment variables:
RUBY_GC_STATS=1
RUBY_GC_DATA_FILE=/tmp/ruby_gc.txt
And here’s what I got for my first request:
***
Request completed: http://localhost:8080/people
GC collections: 7
GC time (us): 563772
Bytes allocated since last GC run: 1856079
Bytes allocated since last request: 43486971
Number of allocations since last request: 887968
Live objects: 558106
Total allocated objects since interpreter start: 2557208
HEAP[ 0]: size= 10000
HEAP[ 1]: size= 20001
HEAP[ 2]: size= 38001
HEAP[ 3]: size= 70402
HEAP[ 4]: size= 128722
HEAP[ 5]: size= 233698
HEAP[ 6]: size= 422655
***
Woah, after only 1 request, we’ve done 7 GC runs, allocated 7 Ruby heaps, and spent over half a second doing nothing but garbage collecting. To be fair, this includes loading WEBrick, and I’m on a resource-starved virtual machine on my laptop, so things will be a bit different once we’re in full production. But this is still pretty horrible!
Let’s start with the easy optimization: heap allocations. After 1 request, we’ve got 558,106 objects, so it’s pretty silly to stick with the default heap size of 10,000 and make Ruby allocate more heaps when it runs out of room, so let’s up that to 600,000 and try again.
RUBY_GC_STATS=1
RUBY_GC_DATA_FILE=/tmp/ruby_gc.txt
RUBY_HEAP_MIN_SLOTS=600000
And here’s the request:
***
Request completed: http://localhost:8080/people
GC collections: 6
GC time (us): 502798
Bytes allocated since last GC run: 4013508
Bytes allocated since last request: 43482680
Number of allocations since last request: 887880
Live objects: 616405
Total allocated objects since interpreter start: 2557151
HEAP[ 0]: size= 600000
HEAP[ 1]: size= 610000
***
Well, that’s a bit better. We cut out 1 GC run, 5 Ruby heap allocations, and about 60ms of GC time. We ran over our 600,000 item heap limit (616,405 live objects), so let’s increase that to 650,000 for the next run. But that doesn’t explain why there were so many GC runs.
The next important tuning variable is RUBY_GC_MALLOC_LIMIT. By default, it’s set to 8000000, which means that every time we allocate more than 8 million bytes on the C heap (not the Ruby heap), we run GC. Well, on the first request, we allocated 43,482,680 bytes! Divide that by 8 million, and you get almost 5.5, so yup, 6 GC runs makes sense. But if we know that we’ll be allocating nearly 50 MB every request, let’s just increase that number.
RUBY_GC_STATS=1
RUBY_GC_DATA_FILE=/tmp/ruby_gc.txt
RUBY_HEAP_MIN_SLOTS=650000
RUBY_GC_MALLOC_LIMIT=50000000
Here’s the request:
***
Request completed: http://localhost:8080/people
GC collections: 3
GC time (us): 339061
Bytes allocated since last GC run: 2796396
Bytes allocated since last request: 43482627
Number of allocations since last request: 887879
Live objects: 582949
Total allocated objects since interpreter start: 2557151
HEAP[ 0]: size= 650000
***
Much better! We cut out 3 GC runs, and over 150ms of GC time. But why is it still doing 3 GC runs? Well, it turns out that was just because either WEBrick or Rails initialized a ton of objects on startup. If we go back in the logs and look at the GC runs, we can see that two of the runs happened before the request started, and each time the freelist had 0. That means the Ruby heap space really was out of room to store more objects.
Garbage collection started
objects processed: 0650000
live objects : 0403654
freelist objects : 0000000
freed objects : 0246346
…
Garbage collection started
objects processed: 0650000
live objects : 0428574
freelist objects : 0000000
freed objects : 0221426
…
***
Request starting: http://localhost:8080/people
***
Garbage collection started
objects processed: 0650000
live objects : 0489192
freelist objects : 0000000
freed objects : 0160808
GC runs themselves aren’t a bad thing: if you use a language with garbage collection, you’ll have to collect dead objects sooner or later. The larger your heap, the fewer times you’ll have to do garbage collection, but a large heap also means that doing GC will take a long time. I figure a reasonable goal is an average of one GC run per request cycle.
So let’s run a few more requests on this same instance:
***
Request completed: http://localhost:8080/people/cghawthorne
GC collections: 1
GC time (us): 97365
Bytes allocated since last GC run: 3556
Bytes allocated since last request: 5147220
Number of allocations since last request: 158437
Live objects: 505713
Total allocated objects since interpreter start: 2736920
HEAP[ 0]: size= 650000
***
…
***
Request completed: http://localhost:8080/people/cghawthorne/lists/3.html
GC collections: 0
GC time (us): 0
Bytes allocated since last GC run: 2175864
Bytes allocated since last request: 1620109
Number of allocations since last request: 43456
Live objects: 562918
Total allocated objects since interpreter start: 2794125
HEAP[ 0]: size= 650000
***
Hey, it worked out! About 1 GC run per request (we’re almost at the point of having to do another one, less than 100,000 objects to go) and no extra heap allocations.
There are several other settings you can tweak that have to do with determining the size of additionally allocated heaps. If your application’s memory size can occasionally grow, you might look into fine tuning them as well.
Update: If you’re curious, I collected some performance data from running Birdstack with and without these modifications.
Without GC tuning, the average VIRT size, as reported by ‘top’, of each Rails process was about 80mb shortly after starting. Over the course of a day, the average time to serve a page was 599ms.
After tuning, the average start size of the Rails processes was 120mb, and over a day, the average serve time was 581ms. After letting it run for another couple days, the average serve time dropped to 475ms.
So what does that mean? Well, the tunings didn’t actually improve response times on the site drastically. A lot of the variance can probably be attributed to different types of traffic. But, the changes certainly didn’t hurt things, and I’m willing to pay a few extra megabytes of RAM for a little more performance. I’m guessing the real problem here is that my server is very much IO-bound rather than CPU-bound. It lives on a virtual host with a limited amount of RAM and pretty slow disk access. So, every page to swap (thankfully not that many) and every disk hit the db has to do (quite a bit more) hurts. I’d be interested to see GC performance numbers for hosts that have better disk and memory resources. I know Twitter got a big performance boost with GC tuning.