The hidden cost of building

The hidden cost of buildingThanks to EJ Ciramella for this thought provoking post. There'll be a Build Doctor T-Shirt on it's way to him soon.

In this down economy, irrespective of size of company involved, people want to save money, limit costs and increase throughput of their systems. One area of savings is the build and continuous integration environment.

With the smattering of continuous integration servers available to said company's release engineering staff, many offer non-acl controlled build buttons. Without this control point, anyone in any department attached to the corporate network can click off a build, regardless of the readiness of the code within source control.

Where I work, we've recently dropped CruiseControl for Hudson. A few colleagues have come by since the rollout asking where the build button is because they don't have access. When pressed about why they'd like to manually spin a build, the resounding answer has been "to see what it looks like in Hudson". This is the exact situation we're trying to avoid and the exact subject I'll try and illuminate within this article.

Before diving any deeper, there are a few things I think release engineering for any company must understand. Each company has a unique workflow, from project concept, to design, to implementation to release and off into support mode. What works great at one place may or may not work at another company. There are industry best practices and white papers aplenty, but if you find it difficult to follow any of these at your company, the best approach in most cases is to take what you can from these documents and carefully plan an evolutionary (not revolutionary) process to reach a tailored solution.

With all of this in mind, let's cover the various steps and the unseen costs to performing builds.

Here is a high-level a list of things that happen when we spin a release type build.

  • SCM label

Initially, we used to label first and ask questions later. The mindset is, if the build fails, a developer could sync (we use Perforce) to a label and get exactly what the build server had used to generate the failure. Since the Perforce plugin for Hudson doesn't operate in this manner (and a few other ways that we ended up altering the plugin to suit), we made the switch to labeling only if the build passes. Since very few developers ever did take time to sync by the labels, the failed build labels were just a waste of space. Either way, depending on the size of the codebase getting the label attached, there is disk space and memory consumption that happens on the Perforce server.

  • Build node (CPU/RAM/HD)

I'm sure that the visitors of this site are savvy enough to understand the beauty and flexibility that come with a distributed build system. I use the term "distributed" here loosely as any given build is not farming out various parts of compilation, just individual jobs (in Hudson parlance) or in some cases, individual maven modules. By the time a build reaches this queue stage, the job has now consumed a executor within the cluster. In our current cluster, we have three slaves and a master. The master and two slaves are only allowed one executor. That third slave has six. If this build is forced to run on the "singleton" nodes, that node becomes unavailable until that build is completed. Thankfully, our build times are short (the longest is 20 mins), but because of the speed of the build fired, this I think mentally cheapens the process. Don't misinterpret this as "let's artificially inflate build times", but keep this in mind when a large refactoring of the build process and associated scripts yields a massive time savings (heck, why not push back to get more testing done in that same time frame?).

  • Client spec updates (Perforce Server)

One of the great features of Perforce is the server is where the "what you have" list is maintained. I've seen arguments for working offline and if a refactoring happens, yadda-yadda-yadda. But in large-scale corporate environments, many institutional services will be unavailable in this mode.

When a person syncs a project (in Perforce terms, but essentially a directory of directories and files), the server updates its files to reflect what the user ends up with. Same is true with the build process. When setting up Hudson, our SCM configuration choices left us with Hudson managing the client specs. This means, in some cases for us, there is one client spec for each node for each job. Even if only one is getting updated, that is still data being written to the server (again).

  • Artifact storage (both live and backup)

Now the build is finished and we're going to retrieve the artifacts for storage. Where I work, the build artifacts are stored in a few ways (all on various NetApp slices). They are as follows:

  • Deployable units - These are the actual applications that you push to our various deployment environment. Because of a facilitating maven project that allows cross application dependency validation, some of these artifacts are stored on generalized NetApp slice (to allow us to keep our artifact storage Continuous Integration agnostic - who knows when we'll switch from Hudson to something else) and replicated to Archiva to allow people to reference certain bits as dependencies from pom files.
  • Libraries - These are effectively the building blocks of the deployable units. The final destination of libraries is Archiva (or your repository manager of choice).

Essentially, there are two places things are stored, Archiva and our "buildartifacs" mount, to reiterate, both of which are NetApp mounts. There is a backup mechanism that keeps around hourly, daily, monthly and yearly restore points. All of this takes up space but if we ever had a complete system failure, we could very quickly return to business as usual.

  • Potential deployment of said artifact

Now that the build is done and someone within the organization has chosen to deploy and test it, they may deploy it (or put a dependency on it and trigger an application build - all with no changes) to a given stack for testing. One of our typical artifacts is close to 280 mb zipped. And to successfully deploy and test, this artifact is extracted on at least two servers and typically has a 139 mb web content artifact also deployed at the same time (as we have to keep these things in sync). Deployments back up the previous deployment (just a few) prior to extracting the new item.

  • Testing requirement of artifact (with no changes)

Once deployed, now comes the tax on the squishy bits (humans) if you have no auto integration testing/smoke/load testing. And if you do have all those tiers of testing, you'll be consuming each one along the way. Couldn't everything be doing something more productive rather than re-testing a deploy that contains no changes?

  • Other

There are other fiddly bits that happen as part of each build like sending emails, various cleanup steps, etc., not to mention the can of worms that is dependency generation (if Build-A happens, Build-B needs to be re-spun to pick up the change in Build-A ad infinitum ad nauseum). What if the build artifact has to actually be transferred to another country for deployment? Transferring 280 mb of data while people are trying to sync an as of yet un-proxied Perforce project or retrieving dependencies from Archiva is not a good way to spend your workday.

Everything above consumes hardware resources, time and space that could be reserved for a legitimate build or better testing. From cpu time to memory allocation to the various stopping points and distribution mechanisms. if people go ahead and deploy and try manual, testing now we're talking about consuming one or more human resources. I cover a single large application above with the quoted sizes, but actually, this maven module generates another 120 mb artifact that is stored in Archiva which is consumed by another deployable unit that is 338 mb zipped.

This is why it's best to either limit or prevent people from firing builds manually. I've taken the tack that if we find people spinning unnecessary builds, we'll revoke their privileges within the Hudson matrix ACL settings. I'm not opposed to taking away further permissions forcing more to rely on the polling aspect of Hudson.

(image care of swimparallel)

DevOps New Zealand