Amazon ECS at The Climate Corporation: Using Amazon ECR and Multiple Accounts for Isolated Regression Testing

This is a post from Nathan Mehl, a Senior Staff Engineer, on the Operations Engineering team at The Climate Corporation.

Like many large users of AWS, The Climate Corporation uses multiple AWS accounts in order to provide strict isolation between a production environment and various pre-production environments (testing, staging, etc.). Applications are continuously deployed from a build/CI system (Jenkins CI) into a testing environment, and then, after either automated or manual testing (depending on the application and team), are “promoted” into a staging/pre-production environment where full regression/load tests are done, and then finally promoted again into production.

But when it came time to evaluate the use of Amazon EC2 Container Registry (Amazon ECR) and Amazon EC2 Container Service (Amazon ECS), our multi-account deployment presented a challenge: if the unit of deployment is a container, and there are potentially multiple full-fledged independent accounts where that container could be running, how do we safely and conveniently manage the lifecycle of an individual container image as it makes its way through the testing pipeline and into staging and production?

One obvious approach that we quickly discarded was simply having an independent ECR deployment in each account. The problem with this approach is that it would be extremely difficult to know the current deployment state of any given image without building and maintaining an external state-tracking system that could easily fall out of sync with ground truth in ECR. Also, “promoting” an image between environments would require copying it from one account’s ECR registry to another’s: in addition to being slow, this would require careful construction of inter-account IAM policies to let a single client pull from ECR in account A and then push to account B.

We chose to have a single “ECR registry of record” in the AWS account that hosts our Jenkins continuous integration system: Jenkins builds create the container images and push them to ECR. An out-of-band scheduled AWS Lambda function process iterates over the list of ECR repositories and applies a registry access policy that allows all of our other accounts read access.

The next question was: how to track the state of any given container image as it moved through the different accounts/environments on its way to production? Again, we could build an external state-tracking system, but after some thought we realized that we could use the tag metadata offered by the v2.2 Docker Manifest specification and the ECR put_image API operation to provide the same information.

To track the state of the container image through the continuous deployment pipelines, we use multiple Docker tags to ensure that the image’s current state is implicit in the current state of its repository in ECR. Each image has a unique tag applied to it at creation time, what we call the BUILD TAG. The build tag is treated as immutable: once created, none of our other tooling alters it. The build tag ties the image back to a particular build in Jenkins and to a particular git hash, for example for build #11 of the “ecs-demo” project:

Image Tag Image Digest
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11 sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b

But as our tooling “promotes” the image from our testing environment to staging to production, it adds or updates additional tags—which we call ENVIRONMENT TAGS—pointing at the same image.

When a new image is built in our Jenkins continuous integration server (usually because a git pull request has been approved and merged to the master branch of that project’s repository), a common build script takes care of building the image, applying the build tag and the “testing” environment tag, and then pushing the image and both of those tags to ECR and then kicking off the deployment into our testing environment. A tool called, simply enough, “promote” can be used either by automated processes or humans to move an image from the testing environment to the staging, or from there to production.

For example, if build #11 had made it through the testing and staging environments but had not yet hit production, there might be two images in play, but five tags:

Image Tag Image Digest
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11 sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-10 sha256:62985ec242857128fa0acea55e3c760e85594d6a2868
testing sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
staging sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
production sha256:62985ec242857128fa0acea55e3c760e85594d6a2868

Later on, when build #11 is promoted to production but build #12 has hit the testing environment, the “testing” tag is applied to the same image as the build tag for build #12:

Image Tag Image Digest
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12 sha256:c79b3e5b3459eb6f0d08a26eb304b8b70235d2eb7622
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11 sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
testing sha256:c79b3e5b3459eb6f0d08a26eb304b8b70235d2eb7622
staging sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
production sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b

The key here is that each image can potentially have multiple tags pointing at it, and the state of those image-to-tag mappings tells us where the image is in the CD pipeline.

So how do we manage all of these tags in ECR?

Tying this all together requires a bit of code: the Docker manifest format does not allow for the same tag to be associated with multiple images, and that’s a good thing as otherwise we would have no way of enforcing the uniqueness of a tag-to-hash mapping.

But that means that in order to find out what that mapping is, we need to iterate over the list of images in the repository using ECR’s batch_get_image API and then derive the mapping of build tag to environment tag: if two tags point to the same image digest, we can infer that they are both tagging the same image:

import re
from collections import defaultdict

BUILD_TAG_RE = re.compile(r'^\d{4}\.\d{2}\.\d{2}T\d{2}\.\d{2}\.\d{2}Z')
ENVS = frozenset(['testing', 'staging', 'production'])
REGISTRY = ''
ECR = '%s.dkr.us-west-2.amazonaws.com' % REGISTRY

def get_tags_by_image(all_images):
    """ Iterate over the output of batch_get_image; return a dictionary
        that maps image digest hashes to a set of image tags."""
    images = defaultdict(set)
    for image in all_images:
        images[image['imageId']['imageDigest']].add(
            image['imageId']['imageTag'])
    return images

def get_equivalent_tags(all_images):
    """ Iterate over the output of batch_get_image; return a dictionary
        that maps environment tag names to build tag names. """
    equivs = {}
    tags_by_image = get_tags_by_image(all_images)
    for tags in tags_by_image.itervalues():
        # any given tag can only apply to one image, but
        # an image may have multiple tags. So each possible
        # tag will only ever appear one time in one of the images
        # dict's value lists. Index those lists by the env tag.
        if len(tags) > 1:
            env_tags = [x for x in tags if x in ENVS]
            build_tags = [x for x in tags if BUILD_TAG_RE.match(x)]
            for env in env_tags:
                for build in build_tags:
                    # this is safe because `env` will only ever
                    # occur once in tags_by_image.values()
                    equivs[env] = build
    return equivs

def get_all_images(ecr, repo, tag_digest_list):
    """ Query ECR for all image manifests for the images listed
        in tag_digest_list. Return the 'images' section of the
        response; warn if there are errors. 
    """
    all_images = ecr.batch_get_image(
        registryId=REGISTRY,
        repositoryName=repo,
        imageIds=tag_digest_list
    )
    if all_images['failures']:
        # not even sure this could ever happen as we're
        # feeding it directly from ecr.list_images()...
        log.warning(
            'Some Docker images could not be fetched:\n %s',
            all_images['failures'])
    return all_images['images']

def get_digests_and_tags(ecr, repo):
    """ Iterate over the paginated output of ecr.list_images for
        our repository; return a list of
        {"imageTag": "foo", "imageDigest": "bar"} dicts
    """
    try:
        response_images = []
        paginator = ecr.get_paginator('list_images')
        for page in paginator.paginate(
                registryId=REGISTRY,
                filter={'tagStatus': 'TAGGED'},
                repositoryName=repo):
            response_images.extend(page['imageIds'])
    except Exception as e:
        log.fatal(
            'Failed to fetch images for {0} from ECR'.format(repo))
        raise e
    return response_images

def main(argv):
    repo = sys.argv[1]
    creds = get_creds()
    ecr = boto3.client('ecr')
    tag_digest_list = get_digests_and_tags(ecr, repo)
    all_images = get_all_images(ecr, repo, tag_digest_list)
    equivs = get_equivalent_tags(all_images)
    print json.dumps(equivs, indent=2)

This code dumps out a JSON object with the current mapping of environment tag to build tag:

$ list_tags.py oe/ecs-demo
{
"staging": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11",
"production": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11",
"testing": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12"
}

The last problem to solve is updating the tags as the image moves along the pipeline. We could, of course, use the standard Docker tools to download the image, tag it, and push it back up:

$ eval $(aws ecr get-login)
$ docker pull oe/ecs-demo:2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12
$ docker tag oe/ecs-demo:2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12 oe/ecs-demo:staging
$ docker push oe/ecs-demo:staging

But pulling down a multi-gigabyte container image in order to change a few bytes worth of tag data is slow and time-consuming. Surely, there’s a better way?

As it happens, there is! The new v2.2 manifest format for Docker images finally separated the tag text from the secure hash of the image layers, and the most recent version of the ECR API lets us push up new image manifests and specify tags to apply to them as part of the request via the ecr.put_image API. All this means that we can easily create new manifests with the same contents but different tags, without having to actually pull down the layers themselves.

So, after we know the current mapping, updating the image’s tags is a matter of finding out the SHA256 hash that the build tag currently maps to, grabbing the manifest for that hash from ECR, and pushing the manifest back up to ECR using the ECR put_image API and setting a new tag there. Building on the code above:

def push_new_image(ecr, repo, manifest, dst_env_tag):
    """ Attach the desired tag to an existing image in our
        repository by creating a new image with the exact
        same manifest but a different tag name.
    """
    response = ecr.put_image(
        registryId=REGISTRY,
        repositoryName=repo,
        imageManifest=manifest,  # same manifest
        imageTag=dst_env_tag) # new tag!
    return response['image']['imageId']

def get_img_manifest(build_tag, all_images):
    """ Iterate over the output of ecr.batch_get_image();
        return the manifest of the first image with
        an ImageTag matching build_tag.
    """
    for image in all_images:
        if image['imageId']['imageTag'] == build_tag:
             return image['imageManifest']
    raise Exception('Manifest Not Found!')

def promote_tag(ecr, repo, equivs, all_images, src_env_tag, dst_env_tag):
    """ Update the state of our Docker repo, telling it that the destination
        environment tag should now be associated with the same image as the
        source environment tag.
    """
    # "equivs" is the mapping of environment tags to build tags, from
    # get_equivalent_tags(), above
    build_tag = equivs[src_env_tag]
    manifest = get_img_manifest(build_tag, all_images)
    response = push_new_image(ecr, repo, manifest, dst_env_tag)
    log.info('Created new image: \n%s', response)

This code:

  1. Finds the build tag that corresponds to the same image hash as our source environment tag, by using the “equivs” dict that uses get_equvalent_tags() built in the previous code example.
  2. Gets the image manifest for that build tag.
  3. Pushes that manifest back up to ECR using put_image, but using the imageTag attribute to attach the name of the destination environment to the manifest.

After this process is done, it’s reflected in the mappings for any new client that comes along to look, for example, if we promoted build #12 from testing to staging:

$ list_tags.py oe/ecs-demo
{
  "staging": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12",
  "production": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11",
  "testing": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12"
}

Of course, that doesn’t actually deploy any code, it just moves the tags around to reflect what the deployment system has done. How that part works might be material for another post.

Conclusion

By using the ecr.put_image() API and ECR support for the v2.2 Docker Manifest format, we’ve implemented state tracking for our containers using nothing but the Docker tagging system. There’s no external state database to keep synced up: the intended state of each image can be found by using ecr.batch_get_image().

Senior Engineering Manager at The Climate Corporation

Posted in Engineering

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: