S3 Distributed Version Restore

At The Climate Corporation, we work with Amazon Web Services a lot. As you can imagine, weather data is very plentiful. We keep lots of data in Amazon’s Simple Storage Service (S3), which is basically a great big filesystem in the sky, and in Mandoline, our array data store, which can use S3 as a backend. It can take some effort to deal with so much data, but we make that effort willingly, since we need it for important things like predicting fertilizer needs for fields.

To help manage all this data, we use S3’s advanced features, such as versioning. When versioning is enabled, every time a file in S3 is updated or deleted, the old version remains stored in S3 as a backup, and we can access it with a special API call. Then, if we accidentally overwrite or delete a file, we can make an API call to list all that file’s versions, choose one, and then use another API call to restore that old version.

However, at Climate, we often manipulate our weather data with automated tools. That means that if we mess up, we can mess up in a really big way. With millions of files and billions of versions, the API for listing and restoring old versions might take weeks to restore them all. In order to fix mistakes made by automated tools, at large scale, we need another automated tool, one that can restore lots of files in a hurry. Unfortunately, as far as we know, there wasn’t one provided.

So, we’ve made just that automated tool, S3DistVersions, to let us restore lots of files efficiently. For it to perform well, it needs lots of simple parallelism, so we used Hadoop MapReduce, a parallel batch processing system. Because its primary task is to access Amazon S3, we designed it to work well inside Elastic MapReduce (EMR), Amazon’s on-demand Hadoop service.

We based the interface for S3DistVersions on that of S3DistCp, a batch tool for copying data in S3. Given a time and a location in S3, S3DistVersions will restore all files there to their state at the specified time. (Because each file has its own list of versions, you can’t specify a version to restore them all to. Instead, we find it easier to specify a time, and S3DistVersions selects the appropriate version for each file.) It takes simple arguments:

java -jar S3DistVersions-0.1.0.jar
  --src s3://mybucket/mypath
  --prefixes s3://mybucket/list-of-s3-prefixes.txt
  --restore-time 2014-01-01T14:00:00+07:00


The hardest thing to deal with was obtaining parallelism while working with S3’s versioning API. Suppose we wanted to restore all the files under the location s3://mybucket/mypath. Unfortunately, S3 only permits access to the versions of files under a location via a (serial) paging API, so we have to make a long sequence of API calls. Sure, each request might return a thousand versions, but it would take a long time to get through a million files and all their versions at that rate. In order to restore those files in a useful amount of time, we need to access the version information in parallel.

(Note: We can’t just start by listing the files under s3://mybucket/mypath because S3’s file-listing API won’t show deleted files. Instead, we have to use the version-listing API.)

To get parallelism while listing versions, we ask the user to provide a list of “prefixes”–the beginnings of the filenames that appear under s3://mybucket/mypath. Then, we can make requests for versions in those separate places in parallel. (For instance, given prefixes /a/1 and /b/2, we can scan for versions under s3://mybucket/mypath/a/1 and s3://mybucket/mypath/b/2 in parallel.) We need that parallelism in order to get through our billions of versions in a reasonable amount of time; for smaller restorations, it’s perfectly fine to omit the list of prefixes. And of course, parallelism is easy in the restore step, so the prefix list is only needed for parallelism in listing versions.

In practice, we find that it’s not hard to use a prefix list like this. We have our files spread out in many different subdirectories, so we use that list of subdirectories as our prefixes.

We designed this program to follow the recommended advice for partitioning keys. S3 keeps track of filenames (keys) in an index that is spread across many different Amazon computers in sequential chunks. It’s sort of like the way conferences have registration tables organized by segments of the alphabet (e.g. ‘A-C’, ‘D-F’, …) to distribute the load. Just as you wouldn’t want to have all the people with ‘A’ names come in at the same time, we don’t want our program to access the list of files in sequential order. Therefore, when S3DistVersions interacts with S3 files or lists their versions, it first shuffles the filenames or filename prefixes. That way, it spreads the load across Amazon’s S3 index cluster.


First of all, for this project, as for most things we do, we used Clojure, a programming language that’s based on Lisp and runs on the Java Virtual Machine (JVM). Those parents give it simplicity and speed, plus access to a large range of useful libraries. It’s also an opinionated language that prefers a functional, immutable style. We like that opinion; for instance, functional code is easier to reason about in bulk, since we can be confident it doesn’t have weird side effects. That also happens to work great with parallel data processing such as MapReduce.

We also often use Cascalog, a MapReduce tool for Clojure. However, here we wanted tight control over keys, grouping, and whether work happens in mappers or reducers, so we tried Parkour, which had some exceptions in EMR. Ultimately, we used clojure-hadoop. It isn’t as elegant as Cascalog in some ways, but it does give us the control we want.


Now, we are confident that when we mess up lots of files in S3, we can fix the problem at scale. We are pleased to release S3DistVersions, and we hope that it will help others work with S3’s versioning more easily. We welcome contributions and feedback, so please let us know how it works for you.

Tagged with: , , , ,
Posted in Engineering, Software Engineering

A Summary of Kyle Kingsbury’s Distributed Systems Workshop

Last week The Climate Corporation invited Kyle Kingsbury — inventor of Jepsen and distributed-systems-expert-about-town — to lead a workshop on distributed computing theory and practice at our San Francisco office.

Kyle’s work focuses on the necessity of theory to distributed computing practice — it’s a task that’s almost impossible to get right on a first try, and you need all the theoretical assistance you can get. You can see empirical application of distributed computing theory to real-world system in his series of Jepsen blog posts.

This post will follow a rough outline of Kyle’s workshop, starting with theoretical definitions and concepts, and ending with applying those definitions and concepts to discuss and evaluate distributed computing practice. Enjoy!

What is a Distributed System?

Kyle began the discussion with the most fundamental question: what does it mean when we call a system “distributed”? His definition — any system made up of parts that interact slowly or unreliably — is familiar, but by noting that the definition of “slow” is highly dependent on context, he opens up the “distributed” label to a wide range of systems that don’t immediately appear to be distributed. This could include:

  • NUMA architectures
  • making plans via SMS
  • Point-of-sale terminals
  • mobile applications
  • communicating threads or processes on a single computer
  • and of course, computers communicating via TCP/UDP over a LAN or the internet.

If the inter-part communication is, say, an order of magnitude slower than the intra-part communication, then you have a distributed system.

Asynchronous Networks

Since all networks in the real world take time to propagate information (and that information can be arbitrarily delayed or lost), understanding timing (and uncertainty of timing) is crucial to being able to reason about the behavior of the system. An important tool Kyle introduced for that purpose is the light cone diagram (by analogy to relativity and spacetime).

two threads accessing a shared variable

Two threads accessing a shared variable.

Because the operations take time to execute (and return a result), the ordering of those operations is ambiguous in this example. Thread 2 can know the time it initiated the read and the time it received the result, but it can’t know the time the read operation was applied on the memory — it could return either the old or new value.


We want to be able to define an ordering for events. However, keeping track of time in a distributed system is difficult. There are a number of types of clocks used for ordering events in a distributed system.

Wall clocks. Nodes track “actual” (human) time, likely by talking with an NTP (Network Time Protocol) server. However, in practice the synchronization doesn’t work well enough to track a precise enough time. The hardware frequently drifts, sometimes by centuries, not to mention that POSIX time is not monotonic.

Lamport clocks. Each node has a counter that it increments with each state transition. The counter is included with every message and, on receiving a message, a node sets its counter to the one included on the message, if greater than its own. This gives a partial ordering of events. If event b is causally dependent on event a, then C(a) < C(b).

Vector clocks. Vector clocks generalize Lamport clocks. The clock is a vector of counters, one for each node. When executing an operation, a node increments its counter in the vector, and again includes the clock with every message to other nodes. When receiving a message, a node takes the pairwise maximum of its clock and the one in the message. Orders more events than Lamport clocks.

Dotted version vectors. “Better vector clocks.” Still a partial ordering, but orders more events.

GPS & Atomic clocks. Much better than NTP. Gives a globally distributed total orders on the scale of milliseconds, which means you can perform one operation (that’s allowed to conflict) per uncertainty window. But it’s really expensive.

Consistency Models

As software developers, we expect the systems we program against to behave correctly. But there are many possible definitions of correct behavior, which we call consistency models.

Each consistency model defines what sequence of operations on a state machine are valid. In the diagram below, a parent-child relationship between two models means any valid sequence of operations in the parent is valid in the child. That is, requirements are stricter as you move up the tree, which makes it easier to reason about the system in question, but incurs performance and availability penalties.

A hierarchy of consistency models.

A hierarchy of consistency models.

Kyle himself has an excellent write-up of these consistency models on his blog, but I’ll give a brief description of a few here.

Linearizability – There appears to be a single global state where operations are applied linearly.

Sequential consistency – Writes appear in the same order to all readers. (But a read may not reflect the “current” state.)

Causal consistency – Only the order of *causally related* operations is linear. Unrelated (concurrent) operations may appear in any order to readers.

Linearizability/Strong serializability is the “golden standard” of consistency models. But there are performance and availability tradeoffs as you move up the tree. It is impossible to have total availability of a linearizable (or even sequentially consistent) system.

Convergent Replicated Data Types

CRDTs are a class of eventually consistent data types that work by supporting an idempotent, commutative and associative merge operation. The operation merges the histories of two (diverging) versions of the data structure. Some thorough and accessible examples of CRDTs can be found in Kyle’s project here.

Building Consensus

To achieve stricter consistency models in a distributed system, it is necessary to achieve consensus on certain facts among a majority of nodes. The Paxos algorithm is the notoriously difficult to understand gold standard of consensus algorithms, introduced by Leslie Lamport in his 1970 paper, The Part-Time Parliament.

Since it’s publication, it has inspired a family of similar algorithms with optimizations in performance or accessibility, including Multi-Paxos, Fast Paxos and Generalized Paxos.

In Search of an Understandable Consensus Algorithm (Ongaro & Ousterhout, 2014) is a recent publication describing the Raft consensus algorithm that, like it says on the tin, is intended to be (more) easily understood than Paxos, while retaining similar guarantees and performance characteristics.

A Pattern Language

Kyle’s number one piece of advice for building a distributed system is don’t do it if you don’t have to. Maybe your problem can be solved by getting a bigger box, or tolerating some downtime.

Number two is to use someone else’s distributed system. Pay Amazon to do it, or at least use battle-tested software.

Number three is don’t fail. You can build really reliable hardware and networks at the cost of moving slowly, buying expensive hardware and hiring really good talent.

But, if your system runs long enough, it will eventually fail, so accept it gracefully.

Take frequent backups. Done correctly, they give you sequential consistency. When taking a backup of a database, understand the transactional semantics. (For example, copying filesystem state of a MySQL database may not reflect a consistent transactional state at every point in time, or filesystem may change during backup process. Better to use mysqldump which preserves database semantics.)

Use immutable values when possible. Data that never changes is trivial to store because it does not require coordination.

If you require mutable values, instead use mutable identities (pointers to immutable values). Can fit a huge number of mutable pointers in a linearizable RDBMS while immutable values are stored in S3 or other eventually consistent stores. This is similar to how Datomic works.

Some operations can return fractional results or tolerate fractional availability. (E.g., search results, video or voice call quality, analytics, map tiles.) This lets you gracefully degrade the app experience while trying to recover, rather than suddenly crashing. Feature flags that turn off application features in response to system health metrics can be used to the same end.

Service owners should write client libraries for their services, and the libraries should include mock I/O to allow consumers to test locally without having to set up (and keep in sync) a local instance of the dependency. Include the library version in the request headers so you can track uptake of new versions and go after the slowpokes.

Particularly important when moving from monolithic to distributed architectures is to test everything, all the time. This includes correctness invariants of course, but also performance invariants. Simulate network failures, performance test against large data sets. Test against production data sets since they are always different from test data.

Always bound the size of queues in your application, or you can get in a situation where work takes arbitrarily long to complete. Monitor enqueue and dequeue rates on the queue (throughput) as well as the time it takes a message to be worked after being enqueued (latency).


Thanks to Sebastian Galkin for organizing the event, Dylan Pittman for his notes and diagrams, and of course, Kyle Kingsbury for his time, energy and expertise.

Posted in Engineering

Multidimensional Arrays with Mandoline

At the Climate Corporation, we have a great demand for storing large amounts of raster-based data, and an even greater demand to retrieve small amounts of it quickly.  Mandoline is our distributed, immutable, versioned database for storing big multidimensional data.  We use it for storing weather data, elevation data, satellite imagery, and other kinds of data.  It is one of the core systems that we use in production.


What can Mandoline do for me?

Mandoline can store your multidimensional array data in a versionable way that doesn’t bloat your storage.  When you don’t know what your query pattern is, or when you want to preserve past versions of your data, Mandoline may be the solution for you.

Some more details

  • It’s a clojure library.
  • Using the Mandoline library, we have built services both to access and to ingest data in a distributed fashion.  The reasons for doing it this way is so that our scientists have easy and language agnostic access to the data, and so that it plays well with existing scientific tools, like netCDF libraries and applications in any language.
  • When we say distributed, we mean that you can have distributed reads and writes from different machines to your dataset.
  • Mandoline uses swappable backends, and can save actual data to different backends.  In production, it currently runs on Amazon’s DynamoDB.  For testing purposes, we can either use the sqlite or in memory backends.
  • We want to expand the backend offerings to other databases like Cassandra and HBase.
  • Mandoline takes advantage of shared structure to make immutability and versions possible.

For more of an introduction to Mandoline (formerly known as Doc Brown), here is a video of the talk I gave at Clojure/West 2014.



Mandoline was primarily written by Brian Davis, Alice Liang, and Sebastian Galkin.  Brian Davis and Steve Kim have been the main push to open source Mandoline for use with the general public.

Posted in Engineering

Service Oriented Authorization – Part 3

As a follow up to Part 1 and Part 2 here is the video of the talk I gave at RailsConf last month.

Disclaimer: This is my first conference talk, ever!

Slides available on Speakerdeck https://speakerdeck.com/lumberj/authorization-in-a-service-oriented-environment

Tagged with: , , ,
Posted in Engineering

Rake tasks from a Rubygem

While recently working on IronHide, specifically the CouchDB Adapter, I needed a way to have the gem expose custom Rake tasks to an end user’s application.

For example, to make working with the IronHide CouchDB Adapter easier, I wanted to allow users to upload existing rules from disk to a remote CouchDB database.

The task could be something as simple as:

$ bundle exec rake iron_hide:load_rules\['/absolute/path/file.json',''\]

where the two arguments are the path to the JSON file and the remote CouchDB database.


After thinking for a moment, I knew that I’ve come across at least one gem that when included, exposes Rake tasks that I can use in my project, Bundler. So, I started looking through the source and found this:https://github.com/bundler/bundler/blob/master/lib/bundler/gem_helper.rb

Essentially, you’d like an application owner to be able to “install” a few Rake tasks at their convenience. Borrowing from the Bundler project:

# iron_hide/couchdb_tasks.rb
module IronHide
  class CouchDbTasks
    include Rake::DSL if defined? Rake::DSL

    def install_tasks
      namespace:iron_hide do
        desc 'Load rules from local JSON file to remote CouchDB server'
        task :load_rules, :file, :db_server do |t, args|


Now, in my application’s Rakefile I just need to add this line:

# Rakefile

Et voila!

$ bundle exec rake -T
#=> rake iron_hide:load_rules[file,db_server]  # Load rules from local JSON
Tagged with: , ,
Posted in Engineering

Indexing Polygons in Lucene with Accuracy

Today we share a post by Lucene Spatial contributor and guest author David Smiley.  David explains storing and querying polygon shapes in Lucene, including some brand new features to improve accuracy of spatial queries that The Climate Corporation helped get into the latest version.

Apache Lucene is a Java toolkit that provides a rich set of search capabilities like keyword search, query suggesters, relevancy, and faceting. It also has a spatial module for searching and sorting with geometric data using either a flat-plane model or a spherical model. For the most part, the capabilities therein are leveraged to varying degrees by Apache Solr and ElasticSearch–the two leading search servers based on Lucene.

Advanced spatial

The most basic and essential spatial capability is the ability to index points specified using latitude and longitude, and then be able to search for them based on a maximum distance from a given query point—often a user’s location, or the center of a map’s viewport. Most information-retrieval systems (e.g. databases) support this.

Most relational databases, such as PostGIS, are mature and have advanced spatial support. I loosely define advanced spatial as the ability to both index polygons and query for them by another polygon. Outside of relational databases, very few information-retrieval systems have advanced spatial support. Some notable members of the advanced spatial club are Lucene, Solr, Elasticsearch, MongoDB and Apache CouchDB. In this article we’ll be reviewing advanced spatial in Lucene.

How it works…

This article describes how polygon indexing & search works in some conceptual detail with pointers to specific classes. If you want to know literally how to use these features in Lucene, then first know you should be familiar with Lucene. Find a basic tutorial on that if you aren’t familiar with it. Then you should read the documentation for the Lucene-spatial API, and review the SpatialExample.java source which shows some basic usage. Next, read the API Javadocs referenced here within the article for relevant classes. If instead you want to know how to use it from Solr and ElasticSearch, this isn’t the right article, not to mention that the new accurate indexing addition isn’t yet hooked into those search servers yet, as of this writing.

Grid indexing with PrefixTrees

From Lucene 4 through 4.6, the only way to index polygons is to use Lucene-spatial’s PrefixTreeStrategy which is an abstract implementation of SpatialStrategy — the base abstraction for how to index and search spatially. It has two subclasses, TermQueryPrefixTreeStrategy and RecursivePrefixTreeStrategy (RPT for short) — I always recommend the latter. The indexing scheme is fundamentally based on a model in which the world is divided into grid squares (AKA cells), and is done so recursively to get almost any amount of accuracy required. Each cell, is indexed in Lucene with a byte string that has the parent cell’s byte string as a prefix — hence the name PrefixTree.

The PrefixTreeStrategy uses a SpatialPrefixTree abstraction that decides how to divide the world into grid-squares and what the byte encoding looks like to represent each grid square. There are two implementations:

  • geohash: This is strictly for latitude & longitude based coordinates. Each cell is divided into 32 smaller cells in an 8×4 or 4×8 alternating grid.
  • quad: Works with any configurable range of numbers. Each cell is divided into 4 smaller cells in a straight-forward 2×2 grid.

We’ll be using a geohash based PrefixTree for the examples in this article. For further information about geohashes, to include a convenient table of geohash length to accuracy, go to Wikipedia.

Indexing points

Lets now see what the indexed bytes look like in the geohash SpatialPrefixTree grid for indexing a point. We’ll use Boston Massachusetts: latitude 42.358, longitude -71.059. If we go to level 5, then we have a grid cell that covers the entire area ± 2.4 kilometers from its center (about 19 square kilometers in this case):

DRT2Y cell

This will index as the following 5 “terms” in the underlying Lucene index:


Terms are the atomic byte sequences that Lucene indexes on a per-field basis, and they map to lists of matching documents (i.e. records); in this case, a document representing Boston.

If we attempt to index any other hypothetical point that also falls in the cell’s boundary it will also be indexed using the same terms, effectively coalescing them into the same for search purposes. Note that the original numeric literals are returned in search results since they are internally stored separately.

To get more search accuracy we need to take the PrefixTree to more levels. This easily scales: there aren’t that many terms to index per point. If you want accuracy of about a meter then level 11 will do it, translating to 11 indexed terms. The search algorithms to efficiently search grid squares indexed in this way are interesting but it’s not the focus of this article.

Indexing other shapes

The following screen-captures depict the leaf cells indexed for a polygon of Massachusetts to a grid level of 6 (±0.61km from center):

Massachusetts, gridded MA zoomed in

The biggest cells here are at level 4, and the smallest are at 6. A leaf cell is a cell that is indexed at no smaller resolution, either because it doesn’t cross the shape’s edge, or it does but the accuracy is capped (configurable). There are 5,208 leaf cells here. Furthermore, when the shape is indexed, all the parent (non-leaf) cells get indexed too which adds another 339 cells that go all the way up to coarsest cell “D”. The current indexing scheme calls for leaf cells to be indexed both with and without a leaf marker — an additional special ‘+’ byte. The grand total number of terms to index is 10,755 terms. That’s a lot! Needless to say, we won’t list them here.

(Note: the extra indexed term per leaf will get removed in the future, once the search algorithms are adapted for this change. See LUCENE-4942.)

Theoretically, to get more accuracy we “just” need to go to lower grid levels, just as we do for points. But unless the shape is small relative to the desired accuracy, it doesn’t scale; it’s simply infeasible with this indexing approach to represent a region covering more than a few square kilometers to meter level accuracy. Each additional PrefixTree level multiplies the number of indexed terms compared to the previous level. The Massachusetts example went to level 6 and produced 5208 leaf cells, but if it had just gone to level 5 then there would have been only 463 leaf cells. With each additional level, there will be on average 32 / 2 = 16 times as many (for geohash) cells of the previous level. For a quad-tree it’s 4 / 2 = 2 times as many, but it also takes more than one added level of a quad tree to achieve the same desired accuracy, so it’s effectively the same no matter how many leaves are in the tree.

Summary: unlike a point shape, which has a linear relationship between accuracy and the number of terms, polygon shapes are exponential to the power of the numbers of levels. This clearly doesn’t scale for high-accuracy requirements.

Serialized shapes

Lucene 4.7 introduced a new Lucene-spatial SpatialStrategy called SerializedDVStrategy (SDV for short). It serializes each document’s shape vector geometry itself into a part of the Lucene Index called DocValues. At query time, candidate results are verified as true or false matches, by retrieving the geometry by doc-id on-demand, deserializing, and comparing to the query shape.

As of this writing, there aren’t any adapters for ElasticSearch or Solr yet. The Solr feature request is tracked as SOLR-5728, and is probably going to arrive in Solr 4.9. The adapters will likely maintain in-memory caches of deserialized shapes to speed up execution.

Using both strategies

It’s critical to understand that SerializedDVStrategy is not a PrefixTree index; if used alone, SerializedDVStrategy will brute-force and deserialize compare all documents, O(N) complexity, and have terrible spatial performance.

In order to minimize the number of documents that SerializedDVStrategy has to see, you should put a faster SpatialStrategy like RecursivePrefixTreeStrategy in front of it. And you can now safely dial-down the configurable accuracy of RPT to return more false-positives since SDV will filter them out. This is done with the distErrPct option, which is roughly the fraction of a shape’s approximate radius, and it defaults to 0.025 (2.5%). Given the Massachusetts polygon, RPT arrived at level 6. If distErrPct is changed to 0.1 (10%), my recommendation when used with SDV, the algorithm chose level 5 for Massachusetts, which had about 10% as many cells as level 6.

Performance and accuracy — the holy grail

What I’m most excited about is the prospect of further enhancing the PrefixTree technique such that it is able to differentiate matching documents between those that are a guaranteed matches versus those that need to be verified against the serialized geometry. Consider the following shape, a circle, acting as the query shape:

Gridded circle

If any grid cell that isn’t on the edge matches a document, then the PrefixTree guarantees it’s a match; there’s no point in checking such documents against SerializedDVStrategy (SDV). This insight should lead to a tremendous performance benefit since a small fraction of matching documents, often zero, will need to be checked against SDV. Performance and accuracy! This feature is being tracked with LUCENE-5579.

One final note about accuracy — it is only as accurate as the underlying geometry. If you’ve got a polygon representation of an area with only 20 vertices, and another with 1000 for the same area, then more vertices are likely to be a more precise reflection of the border of the underlying area. Secondly, you might want to work in a projected 2D coordinate system instead of latitude & longitude. Lucene-spatial / Spatial4j / JTS doesn’t help in this regard, it’s up to your application to convert the coordinates with something such as Proj4j. The main shortcoming of this approach is that a point-radius (circle) query is going to be in 2D even if you might have wanted a geodesic (surface of a sphere) implementation that normally happens when you use latitudes & longitudes. But if you are projecting then the distorted difference between the two is generally fairly limited if the radius isn’t large and if the area isn’t near the poles.


Thanks to the Climate Corporation for supporting my recent improvements for accurate spatial queries!

Posted in Engineering

Self Hosted Gem Server with Jenkins and S3

On my team, we like to be able to keep our applications light. We use several internal libraries to manage things like distributed request tracing, authorization, configuration management, etc.

Isolating this reusable, generic code into a library keeps our application code concise and manageable. It also allows us to test changes to the library in isolation (not to mention keeping our tests fast).

The server

Most of these gems are very specific, and it wouldn’t make much sense to make them public. So, we decided the best approach was to use our own, internally accessible gem server.

A gem server can really just be a set of static files — nothing fancy.

    # http://guides.rubygems.org/command-reference/#gem_generate_index
    # Given a base directory of *.gem files, generate the index files for a gem server directory
    $ gem generate_index --directory=GEMS_DIR

And just like that, we’ve generated the static files we need for our gem server.

Using S3

We use S3 for plenty of things here at Climate. We’d rather not have another server to have to support, and S3 is a perfectly fine place to host static files. So, we just turn an S3 bucket into an internally accessible endpoint (i.e., through our internal DNS routing on our network).

Amazon has instructions for setting up a bucket as a static file server.

Tying it all together: Automated testing and build

Now, we have our S3 bucket setup to behave like a static file server. This is the workflow we want:

  1. Make some changes. Commit. Push. Pull request.
  2. Pull request and code review
  3. Merge pull request
  4. Manually trigger a build of the new gem (which automatically runs tests) 4a. If tests pass, deploy the packaged gem to our gem server
  5. Run bundle update MY_NEW_GEM! to update our project


We use Jenkins to automate our builds here at Climate. Depending on the git server you use (we’re in the process of migrating to Atlassian Stash), Jenkins integration is a bit different. I won’t talk too much about the Jenkins configuration specifics, but this is basically what happens:

We have two Jenkins jobs: the first job is the build and the second one is the update server job. The reason behind this is to allow concurrent builds. When the first job completes, it triggers the second.

We run each of these jobs in a Linux container. Docker is a nice project designed to make that process easy. In our containers we make sure we have the ruby and s3cmd in the path.


Our build job is a parametrized Jenkins build. We pass in a parameter, which is the PROJECT_DIR, or the directory relative to some root where Jenkins can find the gem we want to build. We keep all our gems in the same repo for simplicity.

Jenkins will check out the gems repo, and build the specified gem. This is the build script that Jenkins will execute, which is essentially equivalent to:

    $ ./build.sh ~/gems_repo/my_new_gem
    #!/usr/bin/env bash
    # Exit 1 if any command fails
    set -e
    if [ "$#" -ne 1 ]; then
        echo "PROJECT_DIR not specified"
        echo "Usage : `basename $0` <PROJECT_DIR>"
        exit 1
    GEMS_ROOT=<root directory of gems repo>
    echo "Building gems in PROJECT_DIR $PROJECT_DIR"
    # Check that PROJECT_DIR is in the path relative to GEMS_ROOT
    if ! [ -d "$GEMS_ROOT/$PROJECT_DIR" ]; then
      echo "Error: PROJECT_DIR does not exist"
      exit 1
    ## Go to Gem project
    # Create a build number
    # year.month.day.hour.minute.second.sha1
    export GEM_BUILD=$(echo -n $(date +%Y.%m.%d.%H.%M.%S).;echo $(git rev-parse --short=7 HEAD))

    # Find the gemspec to build
    GEM_SPEC=$(find $GEMS_ROOT/$PROJECT_DIR -type f -name *.gemspec)
    echo "Building gem from gemspec $GEM_SPEC"
    # Bundle, run tests, and build gem
    bundle install
    bundle exec rake test --trace
    gem build $GEM_SPEC
    # Find target gem. Prune search to exclude vendor
    TARGET_GEM=$(find $WB_ROOT/$PROJECT_DIR -type f -not -path "*vendor/*" -name *.gem)
    echo "Uploading gem $TARGET_GEM to gem server"
    # Deploy (updating the Gem server index is left to another job)
    s3cmd put $TARGET_GEM s3://my-gems-s3-bucket

One thing to note about the build process is that it takes care of build versioning automatically for us. The way we handle this in each of our gems is:

    # my_new_new/lib/my_new_gem/version.rb
    module MyNewGem
      MAJOR = "0"
      MINOR = "2"
      BUILD = ENV["GEM_BUILD"] || "DEV"
      VERSION = [MAJOR, MINOR, BUILD].join(".")

Another thing to keep in mind is we always want to run our tests (you are writing tests, right?). This depends on our gem having a rake task named test. This is fairly simple:

# Rakefile
require 'bundler/gem_tasks'
require 'rake/testtask'

Rake::TestTask.new do |t|
    t.libs << 'spec'
    t.test_files = FileList['spec/**/*_spec.rb']
    t.verbose = true

desc 'Run tests'
task :default => :test
Update gem server

The last step of our build job is to push the new *.gem file to the server. We now need to update the set of static files Bundler uses to retrieve the available gems from the server. This is fairly simple. Here’s the script for that job:

    #!/usr/bin/env bash

    # Updates the index of available Gems on the S3 gem server
    # Exit 1 if any command fails
    set -e
    # CD to Gem project directory
    cd $GEM_ROOT/applications/gems
    # Create a placeholder directory
    export GEM_SERVER_DIR=./gemserver
    if [ ! -d $GEM_SERVER_DIR ]
      mkdir -p $GEM_SERVER_DIR/gems
      mkdir $GEM_SERVER_DIR/quick
    # Install any dependencies in the Gems project
    # We require the builder gem
    # See: http://rubygems.org/gems/builder
    bundle install
    # Get existing files from Gem server
    s3cmd get --recursive s3://gem-server/ $GEM_SERVER_DIR
    # Generate new static files
    # See: http://guides.rubygems.org/command-reference/#gem_generate_index
    # "Update modern indexes with gems added since the last update"
    gem generate_index --directory $GEM_SERVER_DIR
    # Sync the index files to the server. No need to sync anything in gems/ 
    s3cmd sync . --exclude 'gems/*' s3://my-gem-server


And, that’s it! We have automated our build/deploy proces for our own, internal Rubygems. Feel free to reach out to me with any questions.

I also highly recommend learning how Bundler and Rubygems work.

Tagged with: , ,
Posted in Engineering