Note: this figure excludes decreases in end-to-end times, which are still waiting to be accurately measured.
With Runner managing all of our utilities, an awesome opportunity for logging was presented: the ability to create something like a distributed
ps. To take advantage of this, I wrote a "task hook" feature which passes task state to an external script. From there, I wrote a hook script which logs all of our data to an influxdb instance. With the influxdb hook in place, we can query to find out which jobs are currently running on hosts and what the results were of any jobs that have previously finished. We can also use it to detect rebooting.
Having this state information has been a real game changer with regards to understanding the pain points of our infrastructure, and debugging issues which arise. Here are a few of the dashboards I've been able to create:
* a started buildbot task generally indicates that a job is active on a machine *
* a global
* spikes in task retries almost always correspond to a infra new problem, seeing it here first allows us to fix it and cut down on job backlogs *
* we reboot after certain kinds of tests and anytime a job fails, thus testers reboot a lot more often *
Costs/Time Saved Calculations
To calculate "time saved" I used influxdb data to figure the time between a reboot and the start of a new round of tasks. Once I had this figure, I subtracted the total number of completed buildbot tasks from the number of reboots over a given period, then multiplied by the average reboot gap period. This isn't an exact method; but gives a ballpark idea of how much time we're saving.
The data I'm using here was taken from a single 24 hour hour period (01/22/15 - 01/23/15). Spot checks have confirmed that this is representative of a typical day.
I used Mozilla's AWS billing statement from December 2014 to calculate the average cost of spot/non-spot instances per hour:
(non-spot) cost: $6802.03 time: 38614hr avg: $0.18/hr
(spot) cost: $14277.72 time: 875936hr avg: $0.02/hr
Finding opex/capex is not easy, however, I did discover the price of adding 200 additional OSX machines in 2015. Based on that, each mac's capex would be just over $2200.
To calculate the "dollars saved" I broke the time saved into AWS (spot/non-spot) and OSX then multiplied it by the appropriate dollar/hour ratio. The results being: $6621.10 per year for AWS and a bit over 20 macs worth of increased throughput, valued at just over $44,000.
You can see all of my raw data, queries, and helper scripts at this github repo: https://github.com/mrrrgn/build-no-reboots-data
Why Are We Only Saving 40%?
The short answer: not rebooting still breaks most test jobs. Turning off reboots without cleanslate resulted in nearly every test failing (thanks to ports being held onto by utilities used in previous jobs, lack of free memory, etc...). However, even with processes being reset, some types of state persist between jobs in places which are proving more difficult to debug and clean. Namely, anything which interacts with a display server.
To take advantage of the jobs which area already working, I added a task "post_flight.py," which decides whether or not to reboot a system after each runner loop. The decision is based partly on some "blacklists" for job/host names which always require a reboot, and partly on whether or not the previous test/build completed successfully. For instance, if I want all linux64 systems to reboot, I just add ".*linux64.*" to the hostname blacklist; if I want all mochi tests to coerce a reboot I add ".*mochitest.*" to the job name blacklist.
Via blacklisting, I've been able to whittle away at breaking jobs in a controlled manner. Over time, as I/we figure out how to properly clean up after more complicated jobs I should be able to remove them from the blacklist and increase our savings.
Why Not Use Containers?
First of all, we have to support OSX and Windows (10-XP), where modern containers are not really an option. Second, there is a lot of technical inertia behind our buildbot centric model (nearly a decade's worth to be precise). That said, a new container centric approach to building and testing has been created: task cluster. Another big part of my work will be porting some of our current builds to that system.
What About Windows
If you look closely at the runner dashboard screenshots you'll notice a "WinX" legend entry, but no line. It's also not included in my cost savings estimates. The reason for this, is that our windows puppet deployment is still in beta; while runner works on Windows, I can't tweak it. For now, I've handed runner deployment off to another team so that we can at least use it for logging. For the state of that issue see: bug 1055794
Of course, continuing effort will be put into removing test types from the "blacklists," to further decrease our reboot percentage. Though, I'm also exploring some easier wins which revolve around optimizing our current suite of runner tasks: using less frequent reboots to perform expensive cleanup operations in bulk (i.e. only before a reboot), decreasing end-to-end times, etc...
Concurrent to runner/no reboots I'm also working on containerizing Linux build jobs. If this work can be ported to tests it will sidestep the rebooting problem altogether -- something I will push to take advantage of asap.
Trying to reverse the entropy of a machine which runs dozens of different job types in random order is a bit frustrating; but worthwhile in the end. Every increase in throughput means more money for hiring software engineers instead of purchasing tractor trailers of Mac Minis.