A Summary of Kyle Kingsbury’s Distributed Systems Workshop

Last week The Climate Corporation invited Kyle Kingsbury — inventor of Jepsen and distributed-systems-expert-about-town — to lead a workshop on distributed computing theory and practice at our San Francisco office.

Kyle’s work focuses on the necessity of theory to distributed computing practice — it’s a task that’s almost impossible to get right on a first try, and you need all the theoretical assistance you can get. You can see empirical application of distributed computing theory to real-world system in his series of Jepsen blog posts.

This post will follow a rough outline of Kyle’s workshop, starting with theoretical definitions and concepts, and ending with applying those definitions and concepts to discuss and evaluate distributed computing practice. Enjoy!

What is a Distributed System?

Kyle began the discussion with the most fundamental question: what does it mean when we call a system “distributed”? His definition — any system made up of parts that interact slowly or unreliably — is familiar, but by noting that the definition of “slow” is highly dependent on context, he opens up the “distributed” label to a wide range of systems that don’t immediately appear to be distributed. This could include:

  • NUMA architectures
  • making plans via SMS
  • Point-of-sale terminals
  • mobile applications
  • communicating threads or processes on a single computer
  • and of course, computers communicating via TCP/UDP over a LAN or the internet.

If the inter-part communication is, say, an order of magnitude slower than the intra-part communication, then you have a distributed system.

Asynchronous Networks

Since all networks in the real world take time to propagate information (and that information can be arbitrarily delayed or lost), understanding timing (and uncertainty of timing) is crucial to being able to reason about the behavior of the system. An important tool Kyle introduced for that purpose is the light cone diagram (by analogy to relativity and spacetime).

two threads accessing a shared variable

Two threads accessing a shared variable.

Because the operations take time to execute (and return a result), the ordering of those operations is ambiguous in this example. Thread 2 can know the time it initiated the read and the time it received the result, but it can’t know the time the read operation was applied on the memory — it could return either the old or new value.

Clocks

We want to be able to define an ordering for events. However, keeping track of time in a distributed system is difficult. There are a number of types of clocks used for ordering events in a distributed system.

Wall clocks. Nodes track “actual” (human) time, likely by talking with an NTP (Network Time Protocol) server. However, in practice the synchronization doesn’t work well enough to track a precise enough time. The hardware frequently drifts, sometimes by centuries, not to mention that POSIX time is not monotonic.

Lamport clocks. Each node has a counter that it increments with each state transition. The counter is included with every message and, on receiving a message, a node sets its counter to the one included on the message, if greater than its own. This gives a partial ordering of events. If event b is causally dependent on event a, then C(a) < C(b).

Vector clocks. Vector clocks generalize Lamport clocks. The clock is a vector of counters, one for each node. When executing an operation, a node increments its counter in the vector, and again includes the clock with every message to other nodes. When receiving a message, a node takes the pairwise maximum of its clock and the one in the message. Orders more events than Lamport clocks.

Dotted version vectors. “Better vector clocks.” Still a partial ordering, but orders more events.

GPS & Atomic clocks. Much better than NTP. Gives a globally distributed total orders on the scale of milliseconds, which means you can perform one operation (that’s allowed to conflict) per uncertainty window. But it’s really expensive.

Consistency Models

As software developers, we expect the systems we program against to behave correctly. But there are many possible definitions of correct behavior, which we call consistency models.

Each consistency model defines what sequence of operations on a state machine are valid. In the diagram below, a parent-child relationship between two models means any valid sequence of operations in the parent is valid in the child. That is, requirements are stricter as you move up the tree, which makes it easier to reason about the system in question, but incurs performance and availability penalties.

A hierarchy of consistency models.

A hierarchy of consistency models.

Kyle himself has an excellent write-up of these consistency models on his blog, but I’ll give a brief description of a few here.

Linearizability – There appears to be a single global state where operations are applied linearly.

Sequential consistency – Writes appear in the same order to all readers. (But a read may not reflect the “current” state.)

Causal consistency – Only the order of *causally related* operations is linear. Unrelated (concurrent) operations may appear in any order to readers.

Linearizability/Strong serializability is the “golden standard” of consistency models. But there are performance and availability tradeoffs as you move up the tree. It is impossible to have total availability of a linearizable (or even sequentially consistent) system.

Convergent Replicated Data Types

CRDTs are a class of eventually consistent data types that work by supporting an idempotent, commutative and associative merge operation. The operation merges the histories of two (diverging) versions of the data structure. Some thorough and accessible examples of CRDTs can be found in Kyle’s project here.

Building Consensus

To achieve stricter consistency models in a distributed system, it is necessary to achieve consensus on certain facts among a majority of nodes. The Paxos algorithm is the notoriously difficult to understand gold standard of consensus algorithms, introduced by Leslie Lamport in his 1970 paper, The Part-Time Parliament.

Since it’s publication, it has inspired a family of similar algorithms with optimizations in performance or accessibility, including Multi-Paxos, Fast Paxos and Generalized Paxos.

In Search of an Understandable Consensus Algorithm (Ongaro & Ousterhout, 2014) is a recent publication describing the Raft consensus algorithm that, like it says on the tin, is intended to be (more) easily understood than Paxos, while retaining similar guarantees and performance characteristics.

A Pattern Language

Kyle’s number one piece of advice for building a distributed system is don’t do it if you don’t have to. Maybe your problem can be solved by getting a bigger box, or tolerating some downtime.

Number two is to use someone else’s distributed system. Pay Amazon to do it, or at least use battle-tested software.

Number three is don’t fail. You can build really reliable hardware and networks at the cost of moving slowly, buying expensive hardware and hiring really good talent.

But, if your system runs long enough, it will eventually fail, so accept it gracefully.

Take frequent backups. Done correctly, they give you sequential consistency. When taking a backup of a database, understand the transactional semantics. (For example, copying filesystem state of a MySQL database may not reflect a consistent transactional state at every point in time, or filesystem may change during backup process. Better to use mysqldump which preserves database semantics.)

Use immutable values when possible. Data that never changes is trivial to store because it does not require coordination.

If you require mutable values, instead use mutable identities (pointers to immutable values). Can fit a huge number of mutable pointers in a linearizable RDBMS while immutable values are stored in S3 or other eventually consistent stores. This is similar to how Datomic works.

Some operations can return fractional results or tolerate fractional availability. (E.g., search results, video or voice call quality, analytics, map tiles.) This lets you gracefully degrade the app experience while trying to recover, rather than suddenly crashing. Feature flags that turn off application features in response to system health metrics can be used to the same end.

Service owners should write client libraries for their services, and the libraries should include mock I/O to allow consumers to test locally without having to set up (and keep in sync) a local instance of the dependency. Include the library version in the request headers so you can track uptake of new versions and go after the slowpokes.

Particularly important when moving from monolithic to distributed architectures is to test everything, all the time. This includes correctness invariants of course, but also performance invariants. Simulate network failures, performance test against large data sets. Test against production data sets since they are always different from test data.

Always bound the size of queues in your application, or you can get in a situation where work takes arbitrarily long to complete. Monitor enqueue and dequeue rates on the queue (throughput) as well as the time it takes a message to be worked after being enqueued (latency).

Acknowledgements

Thanks to Sebastian Galkin for organizing the event, Dylan Pittman for his notes and diagrams, and of course, Kyle Kingsbury for his time, energy and expertise.

Posted in Engineering

Multidimensional Arrays with Mandoline

At the Climate Corporation, we have a great demand for storing large amounts of raster-based data, and an even greater demand to retrieve small amounts of it quickly.  Mandoline is our distributed, immutable, versioned database for storing big multidimensional data.  We use it for storing weather data, elevation data, satellite imagery, and other kinds of data.  It is one of the core systems that we use in production.

https://github.com/TheClimateCorporation/mandoline

What can Mandoline do for me?

Mandoline can store your multidimensional array data in a versionable way that doesn’t bloat your storage.  When you don’t know what your query pattern is, or when you want to preserve past versions of your data, Mandoline may be the solution for you.

Some more details

  • It’s a clojure library.
  • Using the Mandoline library, we have built services both to access and to ingest data in a distributed fashion.  The reasons for doing it this way is so that our scientists have easy and language agnostic access to the data, and so that it plays well with existing scientific tools, like netCDF libraries and applications in any language.
  • When we say distributed, we mean that you can have distributed reads and writes from different machines to your dataset.
  • Mandoline uses swappable backends, and can save actual data to different backends.  In production, it currently runs on Amazon’s DynamoDB.  For testing purposes, we can either use the sqlite or in memory backends.
  • We want to expand the backend offerings to other databases like Cassandra and HBase.
  • Mandoline takes advantage of shared structure to make immutability and versions possible.

For more of an introduction to Mandoline (formerly known as Doc Brown), here is a video of the talk I gave at Clojure/West 2014.

 

Acknowledgements

Mandoline was primarily written by Brian Davis, Alice Liang, and Sebastian Galkin.  Brian Davis and Steve Kim have been the main push to open source Mandoline for use with the general public.

Posted in Engineering

Service Oriented Authorization – Part 3

As a follow up to Part 1 and Part 2 here is the video of the talk I gave at RailsConf last month.

Disclaimer: This is my first conference talk, ever!

Slides available on Speakerdeck https://speakerdeck.com/lumberj/authorization-in-a-service-oriented-environment

Tagged with: , , ,
Posted in Engineering

Rake tasks from a Rubygem

While recently working on IronHide, specifically the CouchDB Adapter, I needed a way to have the gem expose custom Rake tasks to an end user’s application.

For example, to make working with the IronHide CouchDB Adapter easier, I wanted to allow users to upload existing rules from disk to a remote CouchDB database.

The task could be something as simple as:

$ bundle exec rake iron_hide:load_rules\['/absolute/path/file.json','http://127.0.0.1:5984/rules'\]

where the two arguments are the path to the JSON file and the remote CouchDB database.

BUNDLER

After thinking for a moment, I knew that I’ve come across at least one gem that when included, exposes Rake tasks that I can use in my project, Bundler. So, I started looking through the source and found this:https://github.com/bundler/bundler/blob/master/lib/bundler/gem_helper.rb

Essentially, you’d like an application owner to be able to “install” a few Rake tasks at their convenience. Borrowing from the Bundler project:

# iron_hide/couchdb_tasks.rb
module IronHide
  class CouchDbTasks
    include Rake::DSL if defined? Rake::DSL

    def install_tasks
      namespace:iron_hide do
        desc 'Load rules from local JSON file to remote CouchDB server'
        task :load_rules, :file, :db_server do |t, args|
          ...
        end
      end
    end
  end
 end

IronHide::CouchDbTasks.new.install_tasks

Now, in my application’s Rakefile I just need to add this line:

# Rakefile
require'iron_hide/couchdb_tasks'

Et voila!

$ bundle exec rake -T
#=> rake iron_hide:load_rules[file,db_server]  # Load rules from local JSON
Tagged with: , ,
Posted in Engineering

Indexing Polygons in Lucene with Accuracy

Today we share a post by Lucene Spatial contributor and guest author David Smiley.  David explains storing and querying polygon shapes in Lucene, including some brand new features to improve accuracy of spatial queries that The Climate Corporation helped get into the latest version.

Apache Lucene is a Java toolkit that provides a rich set of search capabilities like keyword search, query suggesters, relevancy, and faceting. It also has a spatial module for searching and sorting with geometric data using either a flat-plane model or a spherical model. For the most part, the capabilities therein are leveraged to varying degrees by Apache Solr and ElasticSearch–the two leading search servers based on Lucene.

Advanced spatial

The most basic and essential spatial capability is the ability to index points specified using latitude and longitude, and then be able to search for them based on a maximum distance from a given query point—often a user’s location, or the center of a map’s viewport. Most information-retrieval systems (e.g. databases) support this.

Most relational databases, such as PostGIS, are mature and have advanced spatial support. I loosely define advanced spatial as the ability to both index polygons and query for them by another polygon. Outside of relational databases, very few information-retrieval systems have advanced spatial support. Some notable members of the advanced spatial club are Lucene, Solr, Elasticsearch, MongoDB and Apache CouchDB. In this article we’ll be reviewing advanced spatial in Lucene.

How it works…

This article describes how polygon indexing & search works in some conceptual detail with pointers to specific classes. If you want to know literally how to use these features in Lucene, then first know you should be familiar with Lucene. Find a basic tutorial on that if you aren’t familiar with it. Then you should read the documentation for the Lucene-spatial API, and review the SpatialExample.java source which shows some basic usage. Next, read the API Javadocs referenced here within the article for relevant classes. If instead you want to know how to use it from Solr and ElasticSearch, this isn’t the right article, not to mention that the new accurate indexing addition isn’t yet hooked into those search servers yet, as of this writing.

Grid indexing with PrefixTrees

From Lucene 4 through 4.6, the only way to index polygons is to use Lucene-spatial’s PrefixTreeStrategy which is an abstract implementation of SpatialStrategy — the base abstraction for how to index and search spatially. It has two subclasses, TermQueryPrefixTreeStrategy and RecursivePrefixTreeStrategy (RPT for short) — I always recommend the latter. The indexing scheme is fundamentally based on a model in which the world is divided into grid squares (AKA cells), and is done so recursively to get almost any amount of accuracy required. Each cell, is indexed in Lucene with a byte string that has the parent cell’s byte string as a prefix — hence the name PrefixTree.

The PrefixTreeStrategy uses a SpatialPrefixTree abstraction that decides how to divide the world into grid-squares and what the byte encoding looks like to represent each grid square. There are two implementations:

  • geohash: This is strictly for latitude & longitude based coordinates. Each cell is divided into 32 smaller cells in an 8×4 or 4×8 alternating grid.
  • quad: Works with any configurable range of numbers. Each cell is divided into 4 smaller cells in a straight-forward 2×2 grid.

We’ll be using a geohash based PrefixTree for the examples in this article. For further information about geohashes, to include a convenient table of geohash length to accuracy, go to Wikipedia.

Indexing points

Lets now see what the indexed bytes look like in the geohash SpatialPrefixTree grid for indexing a point. We’ll use Boston Massachusetts: latitude 42.358, longitude -71.059. If we go to level 5, then we have a grid cell that covers the entire area ± 2.4 kilometers from its center (about 19 square kilometers in this case):

DRT2Y cell

This will index as the following 5 “terms” in the underlying Lucene index:

D, DR, DRT, DRT2, DRT2Y

Terms are the atomic byte sequences that Lucene indexes on a per-field basis, and they map to lists of matching documents (i.e. records); in this case, a document representing Boston.

If we attempt to index any other hypothetical point that also falls in the cell’s boundary it will also be indexed using the same terms, effectively coalescing them into the same for search purposes. Note that the original numeric literals are returned in search results since they are internally stored separately.

To get more search accuracy we need to take the PrefixTree to more levels. This easily scales: there aren’t that many terms to index per point. If you want accuracy of about a meter then level 11 will do it, translating to 11 indexed terms. The search algorithms to efficiently search grid squares indexed in this way are interesting but it’s not the focus of this article.

Indexing other shapes

The following screen-captures depict the leaf cells indexed for a polygon of Massachusetts to a grid level of 6 (±0.61km from center):

Massachusetts, gridded MA zoomed in

The biggest cells here are at level 4, and the smallest are at 6. A leaf cell is a cell that is indexed at no smaller resolution, either because it doesn’t cross the shape’s edge, or it does but the accuracy is capped (configurable). There are 5,208 leaf cells here. Furthermore, when the shape is indexed, all the parent (non-leaf) cells get indexed too which adds another 339 cells that go all the way up to coarsest cell “D”. The current indexing scheme calls for leaf cells to be indexed both with and without a leaf marker — an additional special ‘+’ byte. The grand total number of terms to index is 10,755 terms. That’s a lot! Needless to say, we won’t list them here.

(Note: the extra indexed term per leaf will get removed in the future, once the search algorithms are adapted for this change. See LUCENE-4942.)

Theoretically, to get more accuracy we “just” need to go to lower grid levels, just as we do for points. But unless the shape is small relative to the desired accuracy, it doesn’t scale; it’s simply infeasible with this indexing approach to represent a region covering more than a few square kilometers to meter level accuracy. Each additional PrefixTree level multiplies the number of indexed terms compared to the previous level. The Massachusetts example went to level 6 and produced 5208 leaf cells, but if it had just gone to level 5 then there would have been only 463 leaf cells. With each additional level, there will be on average 32 / 2 = 16 times as many (for geohash) cells of the previous level. For a quad-tree it’s 4 / 2 = 2 times as many, but it also takes more than one added level of a quad tree to achieve the same desired accuracy, so it’s effectively the same no matter how many leaves are in the tree.

Summary: unlike a point shape, which has a linear relationship between accuracy and the number of terms, polygon shapes are exponential to the power of the numbers of levels. This clearly doesn’t scale for high-accuracy requirements.

Serialized shapes

Lucene 4.7 introduced a new Lucene-spatial SpatialStrategy called SerializedDVStrategy (SDV for short). It serializes each document’s shape vector geometry itself into a part of the Lucene Index called DocValues. At query time, candidate results are verified as true or false matches, by retrieving the geometry by doc-id on-demand, deserializing, and comparing to the query shape.

As of this writing, there aren’t any adapters for ElasticSearch or Solr yet. The Solr feature request is tracked as SOLR-5728, and is probably going to arrive in Solr 4.9. The adapters will likely maintain in-memory caches of deserialized shapes to speed up execution.

Using both strategies

It’s critical to understand that SerializedDVStrategy is not a PrefixTree index; if used alone, SerializedDVStrategy will brute-force and deserialize compare all documents, O(N) complexity, and have terrible spatial performance.

In order to minimize the number of documents that SerializedDVStrategy has to see, you should put a faster SpatialStrategy like RecursivePrefixTreeStrategy in front of it. And you can now safely dial-down the configurable accuracy of RPT to return more false-positives since SDV will filter them out. This is done with the distErrPct option, which is roughly the fraction of a shape’s approximate radius, and it defaults to 0.025 (2.5%). Given the Massachusetts polygon, RPT arrived at level 6. If distErrPct is changed to 0.1 (10%), my recommendation when used with SDV, the algorithm chose level 5 for Massachusetts, which had about 10% as many cells as level 6.

Performance and accuracy — the holy grail

What I’m most excited about is the prospect of further enhancing the PrefixTree technique such that it is able to differentiate matching documents between those that are a guaranteed matches versus those that need to be verified against the serialized geometry. Consider the following shape, a circle, acting as the query shape:

Gridded circle

If any grid cell that isn’t on the edge matches a document, then the PrefixTree guarantees it’s a match; there’s no point in checking such documents against SerializedDVStrategy (SDV). This insight should lead to a tremendous performance benefit since a small fraction of matching documents, often zero, will need to be checked against SDV. Performance and accuracy! This feature is being tracked with LUCENE-5579.

One final note about accuracy — it is only as accurate as the underlying geometry. If you’ve got a polygon representation of an area with only 20 vertices, and another with 1000 for the same area, then more vertices are likely to be a more precise reflection of the border of the underlying area. Secondly, you might want to work in a projected 2D coordinate system instead of latitude & longitude. Lucene-spatial / Spatial4j / JTS doesn’t help in this regard, it’s up to your application to convert the coordinates with something such as Proj4j. The main shortcoming of this approach is that a point-radius (circle) query is going to be in 2D even if you might have wanted a geodesic (surface of a sphere) implementation that normally happens when you use latitudes & longitudes. But if you are projecting then the distorted difference between the two is generally fairly limited if the radius isn’t large and if the area isn’t near the poles.

Acknowledgements

Thanks to the Climate Corporation for supporting my recent improvements for accurate spatial queries!

Posted in Engineering

Self Hosted Gem Server with Jenkins and S3

On my team, we like to be able to keep our applications light. We use several internal libraries to manage things like distributed request tracing, authorization, configuration management, etc.

Isolating this reusable, generic code into a library keeps our application code concise and manageable. It also allows us to test changes to the library in isolation (not to mention keeping our tests fast).

The server

Most of these gems are very specific, and it wouldn’t make much sense to make them public. So, we decided the best approach was to use our own, internally accessible gem server.

A gem server can really just be a set of static files — nothing fancy.

    # http://guides.rubygems.org/command-reference/#gem_generate_index
    # Given a base directory of *.gem files, generate the index files for a gem server directory
    
    $ gem generate_index --directory=GEMS_DIR

And just like that, we’ve generated the static files we need for our gem server.

Using S3

We use S3 for plenty of things here at Climate. We’d rather not have another server to have to support, and S3 is a perfectly fine place to host static files. So, we just turn an S3 bucket into an internally accessible endpoint (i.e., through our internal DNS routing on our network).

Amazon has instructions for setting up a bucket as a static file server.

Tying it all together: Automated testing and build

Now, we have our S3 bucket setup to behave like a static file server. This is the workflow we want:

  1. Make some changes. Commit. Push. Pull request.
  2. Pull request and code review
  3. Merge pull request
  4. Manually trigger a build of the new gem (which automatically runs tests) 4a. If tests pass, deploy the packaged gem to our gem server
  5. Run bundle update MY_NEW_GEM! to update our project

Jenkins

We use Jenkins to automate our builds here at Climate. Depending on the git server you use (we’re in the process of migrating to Atlassian Stash), Jenkins integration is a bit different. I won’t talk too much about the Jenkins configuration specifics, but this is basically what happens:

We have two Jenkins jobs: the first job is the build and the second one is the update server job. The reason behind this is to allow concurrent builds. When the first job completes, it triggers the second.

We run each of these jobs in a Linux container. Docker is a nice project designed to make that process easy. In our containers we make sure we have the ruby and s3cmd in the path.

Build

Our build job is a parametrized Jenkins build. We pass in a parameter, which is the PROJECT_DIR, or the directory relative to some root where Jenkins can find the gem we want to build. We keep all our gems in the same repo for simplicity.

Jenkins will check out the gems repo, and build the specified gem. This is the build script that Jenkins will execute, which is essentially equivalent to:

    $ ./build.sh ~/gems_repo/my_new_gem
    #!/usr/bin/env bash
    
    # Exit 1 if any command fails
    set -e
    
    if [ "$#" -ne 1 ]; then
        echo "PROJECT_DIR not specified"
        echo "Usage : `basename $0` <PROJECT_DIR>"
        exit 1
    fi
    
    PROJECT_DIR=$1
    GEMS_ROOT=<root directory of gems repo>
    
    echo "Building gems in PROJECT_DIR $PROJECT_DIR"
    
    # Check that PROJECT_DIR is in the path relative to GEMS_ROOT
    if ! [ -d "$GEMS_ROOT/$PROJECT_DIR" ]; then
      echo "Error: PROJECT_DIR does not exist"
      exit 1
    fi
    
    ## Go to Gem project
    cd $GEMS_ROOT/$PROJECT_DIR
    
    # Create a build number
    # year.month.day.hour.minute.second.sha1
    export GEM_BUILD=$(echo -n $(date +%Y.%m.%d.%H.%M.%S).;echo $(git rev-parse --short=7 HEAD))
    echo "GEM_BUILD $GEM_BUILD"

    # Find the gemspec to build
    GEM_SPEC=$(find $GEMS_ROOT/$PROJECT_DIR -type f -name *.gemspec)
    echo "Building gem from gemspec $GEM_SPEC"
    
    # Bundle, run tests, and build gem
    bundle install
    bundle exec rake test --trace
    gem build $GEM_SPEC
    
    # Find target gem. Prune search to exclude vendor
    TARGET_GEM=$(find $WB_ROOT/$PROJECT_DIR -type f -not -path "*vendor/*" -name *.gem)
    
    echo "Uploading gem $TARGET_GEM to gem server"
    # Deploy (updating the Gem server index is left to another job)
    s3cmd put $TARGET_GEM s3://my-gems-s3-bucket

One thing to note about the build process is that it takes care of build versioning automatically for us. The way we handle this in each of our gems is:

    # my_new_new/lib/my_new_gem/version.rb
    
    module MyNewGem
      MAJOR = "0"
      MINOR = "2"
      BUILD = ENV["GEM_BUILD"] || "DEV"
      VERSION = [MAJOR, MINOR, BUILD].join(".")
    end

Another thing to keep in mind is we always want to run our tests (you are writing tests, right?). This depends on our gem having a rake task named test. This is fairly simple:

# Rakefile
require 'bundler/gem_tasks'
require 'rake/testtask'

Rake::TestTask.new do |t|
    t.libs << 'spec'
    t.test_files = FileList['spec/**/*_spec.rb']
    t.verbose = true
end

desc 'Run tests'
task :default => :test
Update gem server

The last step of our build job is to push the new *.gem file to the server. We now need to update the set of static files Bundler uses to retrieve the available gems from the server. This is fairly simple. Here’s the script for that job:

    #!/usr/bin/env bash

    # Updates the index of available Gems on the S3 gem server
    
    # Exit 1 if any command fails
    set -e
    
    # CD to Gem project directory
    cd $GEM_ROOT/applications/gems
    
    # Create a placeholder directory
    export GEM_SERVER_DIR=./gemserver
    if [ ! -d $GEM_SERVER_DIR ]
      then
      mkdir -p $GEM_SERVER_DIR/gems
      mkdir $GEM_SERVER_DIR/quick
    fi
    
    # Install any dependencies in the Gems project
    # We require the builder gem
    # See: http://rubygems.org/gems/builder
    bundle install
    
    # Get existing files from Gem server
    s3cmd get --recursive s3://gem-server/ $GEM_SERVER_DIR
    
    # Generate new static files
    # See: http://guides.rubygems.org/command-reference/#gem_generate_index
    # "Update modern indexes with gems added since the last update"
    gem generate_index --directory $GEM_SERVER_DIR
    
    # Sync the index files to the server. No need to sync anything in gems/ 
    cd $GEM_SERVER_DIR
    s3cmd sync . --exclude 'gems/*' s3://my-gem-server

Conclusion

And, that’s it! We have automated our build/deploy proces for our own, internal Rubygems. Feel free to reach out to me with any questions.

I also highly recommend learning how Bundler and Rubygems work.

Tagged with: , ,
Posted in Engineering

Astro-Algo: Astronomical Algorithms for Clojure

At The Climate Corporation, we have “sprintbaticals”, two-week projects where we can work on something a bit different. This post is about work done by Matt Eckerle during his recent sprintbatical.

Sometimes I want to know where the sun will be at a certain time. I don’t want to know approximately where the sun will be, in a close-enough-for-a-Facebook-app kind of way. I want to know precisely where the sun will be, in a which-picnic-table-at-the-beer-garden-will-be-in-the-sun-at-exactly-5pm-on-my-34th-birthday kind of way. Or, more to the point of my employment at The Climate Corporation, what zenith angle should I use to correct radiation measurements taken at location x and time t? (It turns out that sunlight is important for crop growth. Who knew?) For this sort of endeavor, I keep a copy of Jean Meeus’ Astronomical Algorithms at my desk at all times.

IMG_20140331_150813Using a book on astronomy to compute solar position may sound like overkill, but using various solar calculators over the years has given me a lot of perspective on what is considered “accurate” by non-astronomers. Even the NOAA solar calculator,  which uses methods described in AA, doesn’t bother to consider the difference between dynamical time to universal time. This discrepancy of about 60 seconds seems to be obscured by the fact that they round output times to the nearest minute. And NOAA’s solar calculator is much, much, much better than most solar calculators out there.

Introducing Astro-Algo, Astronomical Algorithms for Clojure

Naturally, I wanted a better solar calculator to use with Clojure. So I wrote one in pure clojure. And it seemed silly to write just a solar calculator library when I had a whole astronomy book at my disposal, so I wrote a library for astronomy and implemented the Sun as the first celestial body of the library. I hope to get more bodies (starting with the Moon) into the library eventually, but here it is. To use Astro-Algo, add the following dependency to your Leiningen project: [astro-algo "0.1.2"]

The fine print

Astro-Algo is a Clojure library designed to implement computational methods described in Astronomical Algorithms by Jean Meeus. In this release, the following is complete:

  • date utils with reduction of time scales table from 1890 to 2016
  • celestial Body protocol to get equatorial coordinates and standard altitude
  • implementation of Body protocol for the Sun
  • function to get local coordinates for a body
  • function to get passages (rising, transit, setting) for a body

The solar position computations implemented in this library are considered low accuracy. They account for some effects of nutation with simplifications described in AA Ch. 25. These calculations should be accurate to within 0.01 degrees of geometric position, however larger differences may be observed due to atmospheric refraction, which is not accounted for other than the standard values for the effect of atmospheric refraction at the horizon used in the calculation of passage times. Passages are calculated according to AA Ch. 15 by interpolating the time that the celestial body crosses the local meridian (for transit) or the standard altitude (for rising and setting). Standard altitudes for rising and setting are taken from the celestial body unless the passages for twilight are desired, in which case the standard altitude for the specified twilight definition is used.

The core API

com.climate.astro-algo.bodies

protocol Body

equatorial-coordinates

Calculate apparent right ascension and declination of a body.
Input:
  this (Body instance)
  dt (org.joda.time.DateTime UTC)
Returns a map with keys:
  :right-ascension (radians west of Greenwich)
  :declination (radians north of the celestial equator)

English translation: Interface Body has method equatorial-coordinates, which returns the orientation of the Body relative to the earth at any time.

standard-altitude

Returns the "standard" altitude, i.e. the geometric altitude of the
center of the body at the time of apparent rising or setting
Input:
  this (Body instance)
Returns standard altitude (degrees)

English translation: Interface Body has method standard-altitude, which returns the angle above the horizon at which the body is considered to have risen or set. (This can be different for different kinds of celestial bodies.)

local-coordinates

Calculate local coordinates of a celestial body
Input:
  body (Body instance)
  dt (org.joda.time.DateTime UTC)
  lon - longitude (degrees East of Greenwich)
  lat - latitude (degrees North of the Equator)
Returns a map with keys:
  :azimuth (radians westward from the South)
  :altitude (radians above the horizon)

English translation: Function local-coordinates returns the orientation of a Body relative to an observer on earth at a specified longitude and latitude.

passages

Calculate the UTC times of rising, transit and setting of a celestial body
Input:
  body (Body instance)
  date (org.joda.time.LocalDate)
  lon - longitude (degrees East of Greenwich)
  lat - latitude (degrees North of the Equator)
  opts
    :include-twilight (one of "civil" "astronomical" "nautical" default nil)
    :precision - iterate until corrections are below this (degrees, default 0.1)
Returns a map with keys:
  :rising (org.joda.time.DateTime UTC)
  :transit (org.joda.time.DateTime UTC)
  :setting (org.joda.time.DateTime UTC)

English translation: Function passages returns the times that a Body will rise, transit and set given a date, longitude and latitude. You can optionally specify a twilight definition to include, which changes the angle below the horizon used for rising and setting. Optional precision allows you to tune the amount of interpolation before convergence.

Usage examples

(:require
  [com.climate.astro-algo.bodies :refer [local-coordinates passages]]
  [clj-time.core :as ct])
(:import
  [com.climate.astro_algo.bodies Sun])

; Suppose we want to get the cosine of the solar azimuth
; on 10/18/1989 00:04:00 (UTC) in San Francisco 
; (at the moment of the 1989 Loma Prieta earthquake).

; We begin by start by instantiating a Body (the Sun).
(let [sun (Sun.)]
  ; Now we declare the time/space coordinates we need
  ; for local-coordinates...
  (let [dt (ct/date-time 1989 10 18 0 4 0)
        lon -122.42 ; [degrees east of Greenwich]
        lat 37.77   ; [degrees north of the Equator]
        ; ...and get azimuth and altitude in radians.
        {:keys [azimuth altitude]} (local-coordinates sun dt lon lat)]
    ; To get the cosine of the zenith,
    ; we can take the sin of the altitude.
    (Math/sin altitude)))

; => 0.2585521627566685
; This represents the ratio of the number of photons hitting a horizontal
; surface to the number of photons hitting a surface facing directly at
; the sun. A negative value would indicate that the sun is geometrically
; beneath the horizon.

(let [sun (Sun.)]
  ; get civil daylength on 12/21/2013 at Stonehenge
  (let [date (ct/local-date 2012 12 21)
        lon -1.8262 ; [degrees east of Greenwich]
        lat 51.1788 ; [degrees north of the Equator]
        {:keys [rising transit setting]}
        (passages sun date lon lat
                  ; can be none, civil, astronomical, nautical
                  ; none is default
                  :include-twilight "civil"
                  ; set lower precision to sacrifice some accuracy
                  ; and speed up the computation
                  ; 0.1 is default (degrees)
                  :precision 1.0)]
    ; calculate daylength in hours
    (-> (ct/interval rising setting)
      ct/in-millis ; [milliseconds]
      (/ 3600000.))))  ; [hours]

; => 9.214338611111112
; If we're going to be there all day,
; we're going to need more than one bag of chips.

How does it stack up?

I did a spot check to compare the accuracy of Astro-Algo against NOAA’s calculator and java library for sunrise and sunset (called luckycat in the chart below).

chart_3

As you can see, we are making progress!

How can I learn more?

The complete API is documented at the astro-algo GitHub project.

Can I get involved?

Yes! We love pull requests. If you want to get involved, getting a copy of AA is highly recommended.

Thanks!

Special thanks to Bhaskar (Buro) Mookerji, who reviewed my code and gave great feedback. Big thanks also to Buro, Leon Barrett, and Steve Kim for reviewing this post.

On deck

This library could use a refactor for more efficient batch jobs (like calculating solar local coordinates for a bunch of locations at the same time). I’d also like to add the calculations for lunar position and phase soon. Of course, I might have to work on the beer garden picnic table problem first… :)

Tagged with: , ,
Posted in Engineering
Follow

Get every new post delivered to your Inbox.

Join 359 other followers