Linux Poetry's Mozilla Blog Mozilla related blogposts.en-usSat, 02 Jan 2016 00:06:47 -0000Better Secret Injection with TaskCluster Secrets secret injection services simply store a private key and encrypt data for users. The users then add those encrypted payloads to a job, and the payload is decrypted using the private key associated with their account at run time, I see a few problems with this system:<ul><li>Secret usage isn't tracked.</li><li>Secrets exist in an environment even when they aren't being used.</li><li>It encourages secrets to be stored in contexts (environment variables and flat files) which are prone to leakage.</li></ul>Today we deployed <a href="">TaskCluster Secrets</a>, a new service I've been working on for the last two weeks which stores encrypted json payloads on behalf of taskcluster clients. I'm excited about this service because it's going to form the foundation for a new method of secret injection which solves all of the problems listed above.<br/><center><b>How does it work?</b></center><br/>In TaskCluster Secrets, each submitted payload (encrypted at rest) is associated with an OAuth scope. The scope also defines which actions a user may make against the secret. For example, to write a secret named '<code>someNamespace:foo</code>' you'd need an OAuth scope '<code>secrets:set:someNamespace:foo</code>,' to read it you'd need '</code>secrets:get:someNamespace:foo</code>,' and so on.<br/><br/>Tying each secret to a scope, we're able to generate an interesting work flow for access from within tasks. In short, we can generate and inject temporary credentials with read only access. This forces secrets to be accessed via the api and yields the following benefits:<ul><li>Secrets are only brought into a task as they are requested.</li><li>Every secret access is logged via the API server.</li><li>Secrets are only revealed when called by name, instead of just <code>env.</code></li></ul>What's more, we can store the temporary OAuth credentials in an http proxy running alongside of a task instead of within it, so that even the credentials are not exposed by default. This way someone could have a snapshot of your task at startup and not gain access to any private data. \o/<br/><br/><center><b>Case Study: Setting/Getting a secret for TaskCluster GitHub jobs</b></center><br/>1.) Submit a write request to the TaskCluster GitHub API : <code>PUT<b>myOrg</b>/repo/<b>myRepo</b>/key/<b>myKey</b> {githubToken: xxxx, secret: {password: xxxx}, expires: '2015-12-31'}</code><br/><br/>2.) GitHub OAuth token is used to verify org/repo ownership. Once verified, we generate a temporary account and write the following secret on behalf of our repo owner : <code>myOrg/myRepo:myKey {secret: {password: xxxx}, expires: '2015-12-31'}</code><br/><br/>3.) CI jobs are launched alongside HTTP Proxies which will attach an OAuth header to outgoing requests to The attached token will have a scope: <code>secrets:get:myOrg/myRepo:*</code> which allows any secret set by the owner of the <i>myOrg/myRepo</i> repository to be accessed. <br/><br/>4.) Within a CI task, a secret may be retrieved by simple HTTP calls such as: <code>curl $SECRETS_PROXY_URL/v1/secret/myOrg/myRepo:myKey</code><br/><br/>Easy, secure, and 100% logged. GitHub Has Landed<a href="">TaskCluster based CI</a> has landed for Mozilla developers. One can begin using the system today by simply dropping a <code>.taskcluster.yml</code> file into the base of their repository. For an example configuration file, and other documentation please see: <a href=""></a><br/><br/>To get started ASAP steal <a href="">this config file</a> and replace <code>npm install . && npm test</code> section with whatever commands will run your project's test suite. :)<br/><br/><center><blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">TC-GH ci is live: now any Mozillian can hook into TaskCluster by placing a single config file in their repo <a href="">#DocsSoon</a> <a href=""></a></p>&mdash; Morgan Phillips (@mrrrgn) <a href="">August 26, 2015</a></blockquote><br/><script async src="//" charset="utf-8"></script></center><code>git push origin taskcluster</code> you've been around Mozilla during the past two years, you've probably heard talk about <a href="">TaskCluster</a> - the fancy task execution engine that a few awesome folks built for B2G - which will be used to power our entire CI/Build infrastructure soon<sup>ish</sup>.<br/><br/>Up until now, there have been a few <a href="">ways</a> for developers to schedule custom CI jobs in TaskCluster; but nothing completely general. However, today I'd like to give sneak peak at a project I've been working on to make the process of running jobs on our infra <i>extremely</i> simple: <a href="">TaskCluster GitHub</a>.<br/><br/><center><b>Why Should You Care?</b></center><br/>1.) <u>The service watches an entire organization at once</u>: if your project lives in github/mozilla, drop a <code><a href="">.taskclusterrc</a></code> file in the base of your repository and the jobs will <i>just start running after each pull request</i> - dead simple.<br/><br/>2.) <u>TaskCluster will give you more control over the platform/environment</u>: you can choose your own docker container by default, but thanks to the <a href="">generic worker</a> we'll also be able to run your jobs on Windows XP-10 and OSX.<br/><br/>3.) <u>Expect integration with other Mozilla services</u>: For a mozilla developer, using this service over travis or circle should make sense, since it will continue to evolve and integrate with our infrastructure over time.<br/><br/><center><b>It's Not Ready Yet: Why Did You Post This? :(</b></center><br/>Because today the prototype is working, and I'm very excited! I also feel that there's no harm in spreading the word about what's coming. <br/><br/>When this goes into production I'll do something more significant than a blog post, to let developers know they can start using the system. In the meantime here it is handling a replay of <a href="">this</a> pull request. \o/ <sub><i>note:</i> The finished version will do nice things, like automatically leave a comment with a link to the job and its status..</sub><br/><br/><a href=""><img src="" width="650px"></img></a> Serve Developers neatest thing about release engineering, is the fact that our pipeline forms the primary bridge between users and developers. On one end, we maintain the CI infrastructure that engineers rely on for thorough testing of their code, and, on the other end, we build stable releases and expose them for the public to download. Being in this position means that we have the opportunity to impact the experiences of both contributors and users by improving our systems (it also makes working on them a lot of fun).<br/><br/>Lately, I've become very interested in improving the developer experience by bringing our CI infrastructure closer to contributors. In short, I would like developers to have access to the same environments that we use to test/build their code. This will make it:<ul><li>easier to run tests locally</li><li>easier to set up a dev environment</li><li>easier to reproduce bugs (especially environment dependent bugs)</li></ul><img src="" width="600"></img><br/><sub>[The release pipeline from 50,000ft]</sub><br/><br/><center><b>How?</b></center><br/>The first part of my plan revolves around integrating release engineering's CI system with a tool that developers are already using: <a href="">mach</a>; starting with a utility called: <a href="">mozbootstrap</a> -- a system that detects its host operating system and invokes a package manager for installing all of the libraries needed to build firefox desktop or firefox android.<br/><br/>The first step here was to make it possible to automate the bootstrapping process (see bug: <a href="">1151834</a> "allow users to bootstrap without any interactive prompts"), and then integrate it into the standing up of our own systems. Luckily, at the moment I'm also porting some of our Linux builds from <a href="">buildbot</a> to <a href="">TaskCluster</a> (see bug: <a href="">1135206</a>), which necessitates scrapping our old chroot based build environments in favor of docker containers. This fresh start has given me the opportunity begin this transition <a href="">painlessly</a>.<br/><br/>This simple change alone strengthens the interface between RelEng and developers, because now we'll be using the same packages (on a given platform). It also means that our team will be actively maintaining a tool used by contributors. I think it's a huge step in the right direction!<br/><br/><center><b>What platforms/distributions are you supporting?</b></center><br/>Right now, I'm only focusing on Linux, though in the future I expect to support OSX as well. The bootstrap utility supports several distributions (Debian/Ubuntu/CentOS/Arch), though, I've been trying to base all of release engineering's new docker containers on Ubuntu 14.04 -- as such, I'd consider this our <i>canonical</i> distribution. Our old builders were based on CentOS, so it would have been slightly easier to go with that platform, but I'd rather support the platform that the majority of our contributors are using.<br/><br/><center><b>What about developers who don't use Ubuntu 14.04, and/or have a bizarre environment</b><blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">My new dream in life: integrate <a href="">@docker</a> into Mozilla&#39;s build system / developer toolchain <a href=""></a> -- no more fighting w/deps</p>&mdash; Morgan Phillips (@mrrrgn) <a href="">March 2, 2015</a></blockquote> <script async src="//" charset="utf-8"></script></center><br/>One fabulous side effect of using TaskCluster is that we're forced to create docker containers for running our jobs, in fact, they even <a href="">live in mozilla-central</a>. That being the case, I've started a conversation around integrating our docker containers into mozbootstrap, giving it the option to pull down a releng docker container in lieu of bootstrapping a host system.<br/><br/>On my own machine, I've been mounting my src directory inside of a builder and running <code>./mach build</code>, then <code>./mach run</code> within it. All of the source, object files, and executables live on my host machine, but the actual building takes place in a black box. This is a very tidy development workflow that's easy to replicate and automate with a few bash functions [which releng should also write/support].<br/><br/><img src="" width="600px"></img><br/><sub>[A simulation of how I'd like to see developers interacting with our docker containers.]</sub><br/><br/>Lastly, as the final nail in the coffin of hard to reproduce CI bugs, I'd like to make it possible for developers to run our TaskCluster based test/build jobs on their local machines. Either from mach, or a new utility that lives in <a href="">/testing</a>.<br/><br/>If you'd like to follow my progress toward creating this brave new world -- or heckle me in bugzilla comments -- check out these tickets:<ul><li><a href="">Bug 1135206 - [tracker] Drive FF Desktop builds from TaskCluster</a></li><br/><li><a href="">Bug 1133877 - Combine MozBoot with docker build-environments (Docker images based on Mozilla's in house builders)</a></li><br/>And this repo, where I keep my scratch work<br/><br/><li><a href=""></a></li></ul>, Whoop: Pull Up!<iframe width="240" height="165" style="float:left;padding-top:10px;padding-right:25px;" src="" frameborder="0" allowfullscreen></iframe>Since December 1<sup>st</sup> 1975, by FAA mandate, no plane has been allowed to fly without a "Ground Proximity Warning System" GPWS (or one of its successors).<sup><a href="">[1]</a></sup> For good reason too, as it's been figured that 75% of the fatalities just one year prior (1974) could have been prevented using the system.<sup><a href="">[2]</a></sup><br/><br/>In a slew of case studies, reviewers reckoned that a GPWS may have prevented crashes by giving pilots additional time to act before they smashed into the ground. Often, the GPWS's signature "Whoop, Whoop: Pull Up!" would have sounded a full fifteen seconds before any other alarms triggered.<sup><a href="">[3]</a></sup><br/><br/>Instruments like this are indispensable to aviation because pilots operate in an environment outside of any realm where human intuition is useful. Lacking augmentation, our bodies and minds are simply not suited to the task of flying airliners. <br/><br/>For the same reason, thick layers of instrumentation and early warning systems are necessary for managing technical infrastructure. Like pilots, without proper tooling, system administrators often plow their vessels into the earth....<br/><br/><center><b>The St. Patrick's Day Massacre</b></center><br/>Case in point, on Saint Patrick's Day we suffered two outages which could have likely been avoided via some additional alerts and a slightly modified deployment process. <br/><br/>The first outage was caused by the accidental removal of a variable from a config file which one of our utilities depends on. Our utilities are all managed by a dependency system called <a href="">runner</a>, and when any task fails the machine is prevented from doing work until it succeeds. This all-or-nothing behavior is correct, but should not lead to closed trees....<br/><br/>On our runner dashboards, the whole event looked like this (the smooth decline on the right is a fix being rolled out with ansible):<br/><img src="" width="600px"></img><br/><br/>The second, and most severe, outage was caused by an insufficient wait time between retries upon failing to pull from our mercurial repositories. <br/><br/>There was a temporary disruption in service, and a large number of slaves failed to clone a repository. When this herd of machines began retrying the task it became the equivalent of a DDoS attack.<br/><br/>From the repository's point of view, the explosion looked like this:<br/><img src="" width="600px"></img><br/><br/>Then, from runner's point of view, the retrying task:<br/><img src="" width="600px"></img><br/><br/>In both of these cases, despite having the data (via runner logging), we missed the opportunity to catch the problem before it caused system downtime. Furthermore, especially in the first case, we could have avoided the issue even earlier by testing our updates and rolling them out gradually.<br/><br/><center><b>Avoiding Future Massacres</b></center><br/>After these fires went out, I started working on a RelEng version of the Ground Proximity Warning System, to keep us from crashing in the future. Here's the plan:<br/><br/>1.) <a href="">Bug 1146974 - Add automated alerting for abnormally high retries (in runner).</a><br/><br/>In both of the above cases, we realized that things had gone amiss based on job backlog alerts. The problem is, once we have a large enough backlog to trigger those alarms, we're already hosed. <br/><br/>The good news is, the backlog is preceded by a spike in runner retries. Setting up better alerting here should buy us as much as an extra hour to respond to trouble.<br/><br/>We're already logging all task results to influxdb, but, alerting via that data requires a custom nagios script. Instead of stringing that together, I opted to write runner output to syslog where it's being aggregated by <a href="">papertrail</a>. <br/><br/>Using papertrail, I can grep for runner retries and build alarms from the data. Below is a screenshot of our runner data in the papertrail dashboard:<br/><br/><img src="" width="600px"></img><br/><br/>2.) <a href="">Add automated testing, and tiered roll-outs to golden ami generation</a><br/><br/>Finally, when we update our slave images the new version is not rolled out in a precise fashion. Instead, as old images die (3 hours after the new image releases) new ones are launched on the latest version. Because of this, every deploy is an all-or-nothing affair. <br/><br/>By the time we notice a problem, almost all of our hosts are using the bad instance and rolling back becomes a huge pain. We also do rollbacks by hand. Nein, nein, nein.<br/><br/>My plan here is to launch new instances with a weighted chance of picking up the latest ami. As we become more confident that things aren't breaking -- by monitoring the runner logs in papertrail/influxdb -- we can increase the percentage.<br/><br/>The new process will work like this:<ul>00:00 - new AMI generated<br/><br/>00:01 - new slaves launch with a 12.5% chance of taking the latest version.<br/><br/>00:45 - new slaves launch with a 25% chance of taking the latest version.<br/><br/>01:30 - new slaves launch with a 50% chance of taking the latest version.<br/><br/>02:15 - new slaves launch with a 100% chance of taking the latest version.</ul>Lastly, if we want to roll back, we can just lower the percentage down to zero while we figure things out. This also means that we can create sanity checks which roll back bad amis without any human intervention whatsoever. <br/><br/>The intention being, any failure within the first 90 minutes will trigger a rollback and keep the doors open....<br/>ödel, Docker, Bach: Containers Building Containers Docker continues to mature, many organizations are striving to run as much of their infrastructure as possible within containers. Of course, this investment results in a lot of docker-centric tooling for deployment, development, etc... <br/><br/>Given that, I think it makes a lot of sense for docker containers themselves to be built within other docker containers. Otherwise, you'll introduce a needless exception into your automation practices. Boo to that!<br/><br/>There are a few ways to run docker from within a container, but here's a neat way that leaves you with access to your host's local images: just mount the docker from your host system.<br/><br/><i>** note: in cases where arbitrary users can push code to your containers, this would be a dangerous thing to do **</i><ul><code><br/>docker run -i -t \<br/>-v /var/run/docker.sock:/run/docker.sock \<br/>-v $(which docker):/usr/bin/docker \<br/>ubuntu:latest /bin/bash<br/><br/>apt-get install libapparmor-dev \<br/># docker-cli requires this library \<br/>you could mount it as well if you like<br/></code></ul><br/>Et voila!<br/><br/><img width=600px src=""></img><br/><br/> RelEng Containers: Build Firefox Consistently (For A Better Tomorrow) time to time, Firefox developers encounter errors which only appear on our build machines. Meaning -- after they've likely already failed numerous times to coax the failure form their own environment -- they must resort to requesting RelEng to pluck a system from our infrastructure so they can use it for debugging: we call this a <a href="">slave loan</a>, and they happen frequently.<br/><br/><i>Case in point: <a href="">bug #689291</a></i><br/><br/>Firefox is a huge open source project: slave loans can never scale enough to serve our community. So, this weekend I took a whack at solving this problem with <a href="">Docker</a>. So far, five [of an eventual fourteen] containers have been published, which replicate the following aspects of our in house build environments:<ul><li>OS (Centos 6)</li><li>libraries (yum is pointed at our in house repo)</li><li>compilers/interpreters/utilities (installed to /tools)</li><li>ENV variables</li></ul>As usual, you can find my scratch work on GitHub: <a href="">mozilla/build-environments</a><br/><br/><img width="600px" src=""></img><br/><img width="600px" src=""></img><br/><br/><b>What Are These Environments Based On?</b><br/><br/>For a long time, builds have taken place inside of chroots built with <a href=>Mock</a>. We have <a href="">three bare bones mock configs</a> which I used to bake some base platform images: <ul><a href="">mozilla-centos6-x86_64-android</a><br/><a href="">mozilla-centos6-x86_64</a><br/><a href="">mozilla-centos6-i836</a></ul>On top of our Mock configs, we further specialize build chroots via build scripts powered by <a href="">Mozharness</a>. The specifications of each environment are laid out in these <a href="">mozharness configs</a>. To make use of these, <a href="">I wrote a simple script</a> which converts a mozharness config into a Dockerfile. <br/><br/>The environments I've published so far:<ul><a href="">releng_base_linux_64</a><br/><a href="">releng_base_linux_32</a></ul>The next step, before I publish more containers, will be to write some documentation for developers so they can begin using them for builds with minimal hassle. Stay tuned!<br/><br/> -r never <sub>part deux</sub> my <a href="/blog/15/">last post</a>, I wrote about how <a href="">runner</a> and <a href="">cleanslate</a> were being leveraged by Mozilla RelEng to try at eliminating the need for rebooting after each test/build job -- thus reclaiming a good deal of wasted time. Since then, I've had the opportunity to outfit all of our hosts with better logging, and collect live data which highlights the progress that's been made. It's been bumpy, but the data suggests that we have reduced reboots (across all tiers) by around 40% -- freeing up over 72,000 minutes of compute time per day, with an estimated savings of $51,000 per year.<br/><br/><i>Note: this figure excludes decreases in end-to-end times, which are still waiting to be accurately measured.</i><br/><br/><b>Collecting Data</b><br/><br/>With Runner managing all of our utilities, an awesome opportunity for logging was presented: the ability to create something like a distributed <code>ps</code>. To take advantage of this, <a href="">I wrote a "task hook"</a> feature which passes task state to an external script. From there, <a href="">I wrote a hook script</a> which logs all of our data to an influxdb instance. With the influxdb hook in place, we can query to find out which jobs are currently running on hosts and what the results were of any jobs that have previously finished. We can also use it to detect rebooting.<br/><br/>Having this state information has been a real game changer with regards to understanding the pain points of our infrastructure, and debugging issues which arise. Here are a few of the dashboards I've been able to create:<br/><br/><sub>* a started buildbot task generally indicates that a job is active on a machine *</sub><br/><a href=""><img src="" width="600px"></img></a><br/><br/><sub>* a global <code>ps</code>! *</sub><br/><a href=""><img src="" width=600px></img></a><br/><br/><sub>* spikes in task retries almost always correspond to a infra new problem, seeing it here first allows us to fix it and cut down on job backlogs *</sub><br/><a href=""><img src="" width="600px"></img></a><br/><br/><sub>* we reboot after certain kinds of tests and anytime a job fails, thus testers reboot a lot more often *</sub><br/><a href=""><img src="" width="600px"></img></a><br/><br/><b>Costs/Time Saved Calculations</b><br/><br/>To calculate "time saved" I used influxdb data to figure the time between a reboot and the start of a new round of tasks. Once I had this figure, I subtracted the total number of completed buildbot tasks from the number of reboots over a given period, then multiplied by the average reboot gap period. This isn't an exact method; but gives a ballpark idea of how much time we're saving. <br/><br/>The data I'm using here was taken from a single 24 hour hour period (01/22/15 - 01/23/15). Spot checks have confirmed that this is representative of a typical day.<br/><br/><img src="" width="600px"></img><br/><br/>I used Mozilla's AWS billing statement from December 2014 to calculate the average cost of spot/non-spot instances per hour:<br/><br/><code>(non-spot) cost: $6802.03 time: 38614hr avg: $0.18/hr<br/><br/>(spot) cost: $14277.72 time: 875936hr avg: $0.02/hr</code><br/><br/>Finding opex/capex is not easy, however, I did discover the price of adding 200 additional OSX machines in 2015. Based on that, each mac's capex would be just over $2200.<br/><br/>To calculate the "dollars saved" I broke the time saved into AWS (spot/non-spot) and OSX then multiplied it by the appropriate dollar/hour ratio. The results being: $6621.10 per year for AWS and a bit over 20 macs worth of increased throughput, valued at just over $44,000.<br/><br/>You can see all of my raw data, queries, and helper scripts at this github repo: <a href=""></a><br/><br/><b>Why Are We Only Saving 40%?</b><br/><br/>The short answer: not rebooting still breaks most test jobs. Turning off reboots without cleanslate resulted in nearly every test failing (thanks to ports being held onto by utilities used in previous jobs, lack of free memory, etc...). However, even with processes being reset, some types of state persist between jobs in places which are proving more difficult to debug and clean. Namely, <a href="">anything which interacts with a display server.</a><br/><br/>To take advantage of the jobs which area already working, I added a task <a href="">","</a> which decides whether or not to reboot a system after each runner loop. The decision is based partly on some "blacklists" for job/host names which always require a reboot, and partly on whether or not the previous test/build completed successfully. For instance, if I want all linux64 systems to reboot, I just add ".*linux64.*" to the hostname blacklist; if I want all mochi tests to coerce a reboot I add ".*mochitest.*" to the job name blacklist.<br/><br/>Via blacklisting, I've been able to whittle away at breaking jobs in a controlled manner. Over time, as I/we figure out how to properly clean up after more complicated jobs I should be able to remove them from the blacklist and increase our savings.<br/><br/><b>Why Not Use Containers?</b><br/><br/>First of all, we have to support OSX and Windows (10-XP), where modern containers are not really an option. Second, there is a lot of technical inertia behind our buildbot centric model (nearly a decade's worth to be precise). That said, a new container centric approach to building and testing has been created: <a href="">task cluster</a>. Another big part of my work will be porting some of our current builds to that system. <br/><br/><b>What About Windows</b><br/><br/>If you look closely at the runner dashboard screenshots you'll notice a "WinX" legend entry, but no line. It's also not included in my cost savings estimates. The reason for this, is that our windows puppet deployment is still in beta; while runner works on Windows, I can't tweak it. For now, I've handed runner deployment off to another team so that we can at least use it for logging. For the state of that issue see: <a href="">bug 1055794</a><br/><br/><b>Future Plans</b><br/><br/>Of course, continuing effort will be put into removing test types from the "blacklists," to further decrease our reboot percentage. Though, I'm also exploring some easier wins which revolve around optimizing our current suite of runner tasks: using less frequent reboots to perform expensive cleanup operations in bulk (i.e. only before a reboot), decreasing end-to-end times, etc... <br/><br/>Concurrent to runner/no reboots I'm also working on containerizing Linux build jobs. If this work can be ported to tests it will sidestep the rebooting problem altogether -- something I will push to take advantage of asap.<br/><br/>Trying to reverse the entropy of a machine which runs dozens of different job types in random order is a bit frustrating; but worthwhile in the end. Every increase in throughput means more money for hiring software engineers instead of purchasing tractor trailers of Mac Minis. -r never the past month I've worked on achieving the effects of a reboot without actually doing one. Sort of a "virtual" reboot. This isn't a usual optimization; but in Mozilla's case it's likely to create a huge impact on performance.<br/><br/>Mozilla build/test infrastructure is complex. The jobs can be expensive and messy. So messy that, for a while now, machines have been rebooted after completing tasks to ensure that environments remain fresh. <br/><br/>This strategy works marvelously at preventing unnecessary failures; but wastes a lot of resources. In particular, with reboots taking something like two minutes to complete, and at around 100k jobs per day, a whopping 200,000 minutes of machine time. That's nearly five months - <i>yikes</i>!<sup>1</sup><br/><br/>Yesterday I began rolling out these "virtual" reboots for all of our Linux hosts, and it seems to be working well [edit: after a few rollbacks]. By next month I should also have it turned on for OSX and Windows machines. <br/><br/><img width="400px" src=""></img><img width="200px" src=""></img><br/><br/><b>What does a "virtual" reboot look like?</b><br/><br/>For starters [pun intended], each job requires a good amount of setup and teardown, so, a sort of init system is necessary. To achieve this a utility called <a href="">runner</a> has been created. <i>Runner is a project that manages starting tasks in a defined order. If tasks fail, the chain can be retried, or halted.</i> Many tasks that once lived in /etc/init.d/ are now managed by runner including buildbot itself.<br/><br/><img src=""></img><br/><br/>Among runner's tasks are various scripts for cleaning up temporary files, starting/restarting services, and also a utility called <a href="">cleanslate</a>. Cleanslate resets a users running processes to a previously recorded state.<br/><br/>At boot, cleanslate takes a snapshot of all running processes, then, before each job it kills any processes (by name) which weren't running when the system was fresh. This particular utility is key to maintaining stability and may be extended in the future to enforce other kinds of system state as well.<br/><br/><img src=""></img><br/><br/>The end result is this:<br/><br/><i>old work flow</i><br/><br/><code>Boot + init -> Take Job -> Reboot (2-5 min)</code><br/><br/><i>new work flow</i><br/><br/><code>Boot + Runner -> Take Job -> Shutdown Buildslave <br/>(runner loops and restarts slave)</code><br/><br/><sub><br/>[1] What's more, this estimate does not take into account the fact that jobs run faster on a machine that's already "warmed up."</sub> Note on Deterministic Builds I joined Mozilla&#39;s Release Engineering team I&#39;ve had the opportunity to put my face into a firehose of interesting new knowledge and challenges. Maintaining a release pipeline for binary installers and updates used by a substantial portion of the Earth&#39;s population is a whole other kind of beast from ops roles where I&#39;ve focused on serving some kind of SaaS or internal analytics infrastructure. It&#39;s really exciting!<br/><br/>One of the most interesting problems I&#39;ve seen getting attention lately are <i>deterministic builds</i>, that is, builds that produce the same sequence of bytes from source on a given platform at any time.<br/><br/><b>What good are deterministic builds?</b><br/><br/>For starters, they aid in detecting &quot;<a href=>Trusting Trust</a>&quot; attacks. That&#39;s where a compromised compiler produces malicious binaries from perfectly harmless source code via replacing certain patterns during compilation. It sort of defeats the whole security advantage of open source when you download binaries right?<br/><br/>Luckily for us users, a fellow named David A. Wheeler rigorously proved a method for circumventing this class of attacks altogether via a technique he coined "<a href=>Diverse Double-Compiling</a>" (DDC). The gist of it is, you compile a project&#39;s source code with a <i>trusted</i> tool chain then compare a hash of the result with some potentially malicious binary. If the hashes match you&#39;re safe.<br/><br/>DDC also detects the less clever scenario where an adversary patches, otherwise open, source code during the build process and serves up malwareified packages. In either case, it&#39;s easy to see that this works if <i>and only if</i> builds are deterministic.<br/><br/>Aside from security, they can also help projects that support many platforms take advantage of <a href=>cross building</a> with less stress. That is, one could compile arm packages on an x86_64 host then compare the results to a native build and make sure everything matches up. This can be a huge win for folks who want to cut back on infrastructure overhead.<br/><br/><b>How can I make a project more deterministic?</b><br/><br/>One bit of good news is, most compilers are already pretty deterministic (on a given platform). Take hello.c for example:<br/><br/><code>int main() {<br/>&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;Hello World!&quot;);<br/>}</code><br/><br/>Compile that a million times and take the md5sum. Chances are you'll end up with a million identical md5sums. Scale that up to a million lines of code, and there&#39;s no reason why this won&#39;t hold true.<br/><br/>However, take a look at this doozy:<br/><br/><code>int main() {<br/>&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;Hello from %s! @ %s&quot;, __FILE__, __TIME__);<br/>}</code><br/><br/>Having timestamps and other platform specific metadata baked into source code is a huge no-no for creating deterministic builds. Compile that a million times, and you&#39ll likely get a million different md5sums.<br/><br/>In fact, in an attempt to make Linux more deterministic all __TIME__ macros <a href=>were removed</a> and the makefile specifies a compiler option (-Werror=date-time) that turns any use of it into an error.<br/><br/>Unfortunately, removing all traces of such metadata in a mature code base could be all but impossible, however, a fantastic tool called <a href=>gitian</a> will allow you to compile projects within a virtual environment where timestamps and other metadata are controlled. <br/><br/><i>Definitely check gitian out and consider using it as a starting point.</i><br/><br/>Another trouble spot to consider is static linking. Here, unless you&#39;re careful, determinism sits at the mercy of third parties. Be sure that your build system has access to identical libraries from anywhere it may be used. Containers and pre-baked vms seem like a good choice for fixing this issue, but remember that you could also be passing around a tainted compiler!<br/><br/>Scripts that automate parts of the build process are also a potent breeding ground for non-deterministic behaviors. Take this python snippet for example:<br/><br/><code>with open('manifest', 'w') as manifest:<br/>&nbsp;&nbsp;&nbsp;&nbsp;for dirpath, dirnames, filenames in os.walk(&quot;.&quot;):<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for filename in filenames:<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;manifest.write(&quot;{}\n&quot;.format(filename))</code><br/><br/>The problem here is that os.walk will not always print filenames in the same order. :(<br/><br/>One also has to keep in mind that certain data structures become very dangerous in such scripts. Consider this pseudo-python that auto generates some sort of source code in a compiled language:<br/><code><br/>weird_mapping = dict(file_a=99, file_b=1)<br/>things_in_a_set = set([thing_a, thing_b, thing_c])<br/>for k, v in werid_mapping.items():<br/>&nbsp;&nbsp;&nbsp;&nbsp;... generate some code ...<br/>for thing in things_in_a_set:<br/>&nbsp;&nbsp;&nbsp;&nbsp;... generate some code ...</code><br/><br/>A pattern like this would dash any hope that your project had of being deterministic because it makes use of unordered data structures.<br/><br/><i>Beware of unordered data structures in build scripts and/or <b>sort</b> all the things before writing to files.</i><br/><br/>Enforcing determinism from the beginning of a project's life cycle is the ideal situation, so, I would highly recommend incorporating it into CI flows. When a developer submits a patch it should include a hash of their latest build. If the CI system builds and the hashes don't match, reject that non-deterministic code! :)<br/><br/><b>EOF</b><br/><br/>Of course, this hardly scratches the surface on why deterministic builds are important; but I hope this is enough for a person to get started on. It&#39;s a very interesting topic with lots of fun challenges that need solving. :) If you&#39d like to do some further reading, I&#39;ve listed a few useful sources below.<br/><br/><a href=></a><br/><br/><a href=></a><br/><br/><a href=></a>