Go Builder Plan
Go Builder Plan This doc is http://golang.org/s/builderplan Discussion thread: https://groups.google.com/forum/#!topic/golang-dev/sdFD0-2Ed8k Background & Problem This document is about a refresh of the Go buildbot infrastructure. There are two main components. The dashboard at http://build.golang.org/ owns the data about what has been built, what needs to be built, and records logs of failures. It currently runs on App Engine and probably could continue to do so. The second major bit is all the builders themselves. These are currently scattered all over the place on a mix of physical and virtual machines, often on desks in houses. They sit and poll the dashboard, awaiting instructions on something to build.
Here’s an incomplete list of all the problems with the current Go buildbot infrastructure:
The builders are currently configured manually and inconsistently, with no notes on how to reproduce them. If a random build fails, the developer responsible has no easy way to get into that environment (or an identical one) to interactively debug. The init system used on each builder is undefined: screen session? nohup? runsit? upstart? systemd? Stray spinning processes are often found from failed builds. (especially for nacl-*). This is related to the lack of good init system and isolation. Stray system updater processes are often found (especially on Ubuntu hosts) and using notable amounts of CPU doing nothing useful. too many flaky builds Builds are slow (4-15 minutes) The Dashboard running on App Engine can’t check for new commits to the Go or gccgo repos itself, due to the sandbox. We have no trybot mechanism to speculatively build & test mailed changes, or in-development changes requested by developers.
This proposal seeks to address all issues.
In summary: all builder environments are described by a Dockerfile and/or Vagrantfile (or whatever). But no machines are configured by hand. A tool must generate the whole environment and the configuration must be in the repo, under go.tools/dashboard/env/* all builds run in a totally hermetic environment that is created & destroyed after each build. For the Linux and NaCl builds, we now use Docker containers under CoreOS running on Google Compute Engine (GCE). This is the first part that has been implemented. the “dashboard/builder” code will be slowly gutted and might be fully killed off the new “coordinator” server finds work to do (from the dashboard or wherever) and spawns Docker containers or VMs to complete builds. the Coordinator will also take over reporting to the dashboard, so things like temporarily failures and flaky builds can be diagnosed. We can try builds multiple times and learn what is flaky the Coordinator will also find work from “trybot” requests, and mailed CLs the Coordinator will take over polling the Go & gccgo repos to find new work, and tell the App Engine dashboard. we’ll have the official Coordinator that might only accept trybot requests from trusted developers (for security reasons), but anybody can run their own on GCE. we’ll shard builds. Instead of running “all.bash” and having it take 3-4 minutes (for normal Linux builds) or 15 minutes (for linux-amd64-race), we can just run “make.bash”, clone the filesystem, and then spawn a few dozen Docker containers or VMs running the tests in parallel. The Coordinator can then merge their output. The Coordinator currently only runs on a single VM (we use a 16-core one, but any size works). We’ll update it to work with CoreOS’s etcd+fleet and use a bunch of machines. The first one (a smaller instance) can run always and monitor the build queue depth and dynamically start & stop big 16-core ones as needed to change capacity. CoreOS boots quickly. The first small instance can be the status webserver / RPC server too. We then provide a CLI tools to let developers run trybot builds (with just make.bash on all platforms, or also tests, or specific tests), and even provide a mode where the environment is ready and they can then ssh in (the tool can run ssh for them, or spit out the command+key to use). gccgo is moving to this too. cmang@ to write Dockerfiles.
Non-Linux, Non-x86 builds For non-x86 builds, we can use qemu full-system simulation, even simulating different types of ARM. Experiments show that this is slow, but not unreasonably slow, especially with the aforementioned sharding. For Windows, we throw out some of our principles and accept that the VM is manually configured once, and then snapshotted at a point where it starts up, looks at its environment (the GCE instance metadata) and does a single build. The Coordinator then boots a new Windows VM to do a single build. This costs a minimum of 10 minutes, or about a $0.05 USD per build. That’s fine and worth the cost for being hermetic. Details are in https://code.google.com/p/go/issues/detail?id=8640 For FreeBSD, see https://code.google.com/p/go/issues/detail?id=8639 For Darwin, we won’t do anything for now. Ideas welcome. Using Darwin VM hosting seems more painful than it’s worth, I think. Solaris is community-supported. But running open source Solaris whatever on this is an option too.
This doc isn’t very detailed, but that’s the rough outline. Code is at go.tools/dashboard/…
The Dockerfile environments are at eg.: https://code.google.com/p/go/source/browse?repo=tools#hg%2Fdashboard%2Fenv%2Flinux-x86-base
And the Coordinator is at: https://code.google.com/p/go/source/browse/dashboard/coordinator/main.go?repo=tools
The Coordinator is currently running at: http://126.96.36.199/ (if it’s down, that’s because there’s nothing to do and it’s not running, or the infrastructure to start it back up has failed and I’ll probably fix it soon)