Update #6: I have gotten absolutely no where with this with the Debian folks. It looks like we're going to have to take this directly to the kernel developers and see what they have to say. I posted to the linux-kernel mailing list hoping somebody will be able to help. We'll see how long it takes before they call me an idiot :) https://lkml.org/lkml/2014/5/14/243
Update #5: I have decided to take another look into this issue. The issue is with the 3.2+ kernels, as I can replicate this by installing a 3.2 kernel into Squeeze. This will hopefully make it easier to troubleshoot, since I can replicate and fix simply by isolating what is happening specifically to the kernel. I filed a new bug report at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=742643 - Check it out and see if you're seeing the same issues as I am.
Update #4: No traction at the Debian bug report. We ended up rolling back to Squeeze. I am a big proponent of Debian, but I'm definitely a little bummed out right now. We may end up trying to build an Ubuntu 12.04 environment just to see if we run into the same issue.
Update #3: I got no where with the debian-users mailing list, so I submitted a bug report to Debian at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=715269 - We'll see if anything comes of it.
Update #2: This morning I built the latest 2.6.32 kernel for wheezy and it got rid of all the load issues, so if for some reason you have to be on wheezy, that could be a way to go for you.
Update #1: I posted this to the debian-users mailing list, and we have a couple other people at this point who are having the same issue.
At YouVersion, we have been using Debian Squeeze to power our application servers. Personally, I'm a huge fan of Debian and have been for over 10 years now. Once Wheezy was officially released, we started the process of getting ready for Wheezy. We built new AMI's for our developers, setup testing environments using salt-cloud and built new versions of the components that help power API such as nginx, PHP, Python, Gearman and Twemproxy. Everything was going well until we put Wheezy in production.
Our plan was to only upgrade two boxes to Wheezy and then compare metrics to see how we're doing. Our load on our application servers is normally between .5 and 1 under Squeeze. Under Wheezy, we're somewhere around 3, which troubled us greatly. Worse yet, the Wheezy boxes didn't hold up on our Sunday traffic levels, php-fpm just wasn't responding quick enough and monit had to restart php-fpm a few times before we took them out of service.
During our troubleshooting, the first thing we noticed was that most of our stack (nginx, PHP, uWSGI/Python) was taking more virtual memory in Wheezy than Squeeze. While this isn't necessarily a big deal, it could be under the right circumstances. We decided instead of doing an in-place upgrade to Wheezy, we'd do a fresh install. Thankfully, SoftLayer makes super easy to do through their portal and we had a new app server loaded with a fresh OS in less than an hour. This got rid of the virtual memory issue, but our load still remained high. The worst part was that we couldn't attribute the high load to anything in particular very easily. CPU usage was the same, memory usage was smaller in Wheezy and the I/O system all checked out as fine. The only difference we found was that our interrupts are much higher in Wheezy than Squeeze. Specifically, "Rescheduling interrupts" and "timer" are through the roof on Wheezy, compared to Squeeze.
If you're interested, you can check out what I found here
We built a new server with a different board/CPU combination in hopes that the issue was somehow hardware related, but we saw the same #'s there.
Ultimately, we decided to load one server back to Squeeze and keep one with the fresh Wheezy install and see how it would hold up against our Sunday load. We use a custom C program that exports our HAProxy logs into JSON and ship's them to Google's BigQuery service to allow us to easily and quickly query against them. After querying the average response time of all of our servers, despite the high load, our Wheezy box actually performed better than the rest of our app servers by about 3 ms. With the fresh reload, it was also able to stay up with no issues.
So, now we have a conundrum. From a performance perspective, we seem to be in good shape with Wheezy, but the high amount of interrupts and load is causing us tremendous unease about rolling it out to production. Right now, we're not exactly sure what to do. The issue is bothering us so much we're thinking about spending the time to build out a test stack on Ubuntu Precise to see if we're seeing the same thing, since it's on a more similar kernel to Wheezy.