Multidimensional Arrays with Mandoline

At the Climate Corporation, we have a great demand for storing large amounts of raster-based data, and an even greater demand to retrieve small amounts of it quickly.  Mandoline is our distributed, immutable, versioned database for storing big multidimensional data.  We use it for storing weather data, elevation data, satellite imagery, and other kinds of data.  It is one of the core systems that we use in production.

What can Mandoline do for me?

Mandoline can store your multidimensional array data in a versionable way that doesn’t bloat your storage.  When you don’t know what your query pattern is, or when you want to preserve past versions of your data, Mandoline may be the solution for you.

Some more details

  • It’s a clojure library.
  • Using the Mandoline library, we have built services both to access and to ingest data in a distributed fashion.  The reasons for doing it this way is so that our scientists have easy and language agnostic access to the data, and so that it plays well with existing scientific tools, like netCDF libraries and applications in any language.
  • When we say distributed, we mean that you can have distributed reads and writes from different machines to your dataset.
  • Mandoline uses swappable backends, and can save actual data to different backends.  In production, it currently runs on Amazon’s DynamoDB.  For testing purposes, we can either use the sqlite or in memory backends.
  • We want to expand the backend offerings to other databases like Cassandra and HBase.
  • Mandoline takes advantage of shared structure to make immutability and versions possible.

For more of an introduction to Mandoline (formerly known as Doc Brown), here is a video of the talk I gave at Clojure/West 2014.



Mandoline was primarily written by Brian Davis, Alice Liang, and Sebastian Galkin.  Brian Davis and Steve Kim have been the main push to open source Mandoline for use with the general public.

Posted in Engineering

Service Oriented Authorization – Part 3

As a follow up to Part 1 and Part 2 here is the video of the talk I gave at RailsConf last month.

Disclaimer: This is my first conference talk, ever!

Slides available on Speakerdeck

Tagged with: , , ,
Posted in Engineering

Rake tasks from a Rubygem

While recently working on IronHide, specifically the CouchDB Adapter, I needed a way to have the gem expose custom Rake tasks to an end user’s application.

For example, to make working with the IronHide CouchDB Adapter easier, I wanted to allow users to upload existing rules from disk to a remote CouchDB database.

The task could be something as simple as:

$ bundle exec rake iron_hide:load_rules\['/absolute/path/file.json',''\]

where the two arguments are the path to the JSON file and the remote CouchDB database.


After thinking for a moment, I knew that I’ve come across at least one gem that when included, exposes Rake tasks that I can use in my project, Bundler. So, I started looking through the source and found this:

Essentially, you’d like an application owner to be able to “install” a few Rake tasks at their convenience. Borrowing from the Bundler project:

# iron_hide/couchdb_tasks.rb
module IronHide
  class CouchDbTasks
    include Rake::DSL if defined? Rake::DSL

    def install_tasks
      namespace:iron_hide do
        desc 'Load rules from local JSON file to remote CouchDB server'
        task :load_rules, :file, :db_server do |t, args|

Now, in my application’s Rakefile I just need to add this line:

# Rakefile

Et voila!

$ bundle exec rake -T
#=> rake iron_hide:load_rules[file,db_server]  # Load rules from local JSON
Tagged with: , ,
Posted in Engineering

Indexing Polygons in Lucene with Accuracy

Today we share a post by Lucene Spatial contributor and guest author David Smiley.  David explains storing and querying polygon shapes in Lucene, including some brand new features to improve accuracy of spatial queries that The Climate Corporation helped get into the latest version.

Apache Lucene is a Java toolkit that provides a rich set of search capabilities like keyword search, query suggesters, relevancy, and faceting. It also has a spatial module for searching and sorting with geometric data using either a flat-plane model or a spherical model. For the most part, the capabilities therein are leveraged to varying degrees by Apache Solr and ElasticSearch–the two leading search servers based on Lucene.

Advanced spatial

The most basic and essential spatial capability is the ability to index points specified using latitude and longitude, and then be able to search for them based on a maximum distance from a given query point—often a user’s location, or the center of a map’s viewport. Most information-retrieval systems (e.g. databases) support this.

Most relational databases, such as PostGIS, are mature and have advanced spatial support. I loosely define advanced spatial as the ability to both index polygons and query for them by another polygon. Outside of relational databases, very few information-retrieval systems have advanced spatial support. Some notable members of the advanced spatial club are Lucene, Solr, Elasticsearch, MongoDB and Apache CouchDB. In this article we’ll be reviewing advanced spatial in Lucene.

How it works…

This article describes how polygon indexing & search works in some conceptual detail with pointers to specific classes. If you want to know literally how to use these features in Lucene, then first know you should be familiar with Lucene. Find a basic tutorial on that if you aren’t familiar with it. Then you should read the documentation for the Lucene-spatial API, and review the source which shows some basic usage. Next, read the API Javadocs referenced here within the article for relevant classes. If instead you want to know how to use it from Solr and ElasticSearch, this isn’t the right article, not to mention that the new accurate indexing addition isn’t yet hooked into those search servers yet, as of this writing.

Grid indexing with PrefixTrees

From Lucene 4 through 4.6, the only way to index polygons is to use Lucene-spatial’s PrefixTreeStrategy which is an abstract implementation of SpatialStrategy — the base abstraction for how to index and search spatially. It has two subclasses, TermQueryPrefixTreeStrategy and RecursivePrefixTreeStrategy (RPT for short) — I always recommend the latter. The indexing scheme is fundamentally based on a model in which the world is divided into grid squares (AKA cells), and is done so recursively to get almost any amount of accuracy required. Each cell, is indexed in Lucene with a byte string that has the parent cell’s byte string as a prefix — hence the name PrefixTree.

The PrefixTreeStrategy uses a SpatialPrefixTree abstraction that decides how to divide the world into grid-squares and what the byte encoding looks like to represent each grid square. There are two implementations:

  • geohash: This is strictly for latitude & longitude based coordinates. Each cell is divided into 32 smaller cells in an 8×4 or 4×8 alternating grid.
  • quad: Works with any configurable range of numbers. Each cell is divided into 4 smaller cells in a straight-forward 2×2 grid.

We’ll be using a geohash based PrefixTree for the examples in this article. For further information about geohashes, to include a convenient table of geohash length to accuracy, go to Wikipedia.

Indexing points

Lets now see what the indexed bytes look like in the geohash SpatialPrefixTree grid for indexing a point. We’ll use Boston Massachusetts: latitude 42.358, longitude -71.059. If we go to level 5, then we have a grid cell that covers the entire area ± 2.4 kilometers from its center (about 19 square kilometers in this case):

DRT2Y cell

This will index as the following 5 “terms” in the underlying Lucene index:


Terms are the atomic byte sequences that Lucene indexes on a per-field basis, and they map to lists of matching documents (i.e. records); in this case, a document representing Boston.

If we attempt to index any other hypothetical point that also falls in the cell’s boundary it will also be indexed using the same terms, effectively coalescing them into the same for search purposes. Note that the original numeric literals are returned in search results since they are internally stored separately.

To get more search accuracy we need to take the PrefixTree to more levels. This easily scales: there aren’t that many terms to index per point. If you want accuracy of about a meter then level 11 will do it, translating to 11 indexed terms. The search algorithms to efficiently search grid squares indexed in this way are interesting but it’s not the focus of this article.

Indexing other shapes

The following screen-captures depict the leaf cells indexed for a polygon of Massachusetts to a grid level of 6 (±0.61km from center):

Massachusetts, gridded MA zoomed in

The biggest cells here are at level 4, and the smallest are at 6. A leaf cell is a cell that is indexed at no smaller resolution, either because it doesn’t cross the shape’s edge, or it does but the accuracy is capped (configurable). There are 5,208 leaf cells here. Furthermore, when the shape is indexed, all the parent (non-leaf) cells get indexed too which adds another 339 cells that go all the way up to coarsest cell “D”. The current indexing scheme calls for leaf cells to be indexed both with and without a leaf marker — an additional special ‘+’ byte. The grand total number of terms to index is 10,755 terms. That’s a lot! Needless to say, we won’t list them here.

(Note: the extra indexed term per leaf will get removed in the future, once the search algorithms are adapted for this change. See LUCENE-4942.)

Theoretically, to get more accuracy we “just” need to go to lower grid levels, just as we do for points. But unless the shape is small relative to the desired accuracy, it doesn’t scale; it’s simply infeasible with this indexing approach to represent a region covering more than a few square kilometers to meter level accuracy. Each additional PrefixTree level multiplies the number of indexed terms compared to the previous level. The Massachusetts example went to level 6 and produced 5208 leaf cells, but if it had just gone to level 5 then there would have been only 463 leaf cells. With each additional level, there will be on average 32 / 2 = 16 times as many (for geohash) cells of the previous level. For a quad-tree it’s 4 / 2 = 2 times as many, but it also takes more than one added level of a quad tree to achieve the same desired accuracy, so it’s effectively the same no matter how many leaves are in the tree.

Summary: unlike a point shape, which has a linear relationship between accuracy and the number of terms, polygon shapes are exponential to the power of the numbers of levels. This clearly doesn’t scale for high-accuracy requirements.

Serialized shapes

Lucene 4.7 introduced a new Lucene-spatial SpatialStrategy called SerializedDVStrategy (SDV for short). It serializes each document’s shape vector geometry itself into a part of the Lucene Index called DocValues. At query time, candidate results are verified as true or false matches, by retrieving the geometry by doc-id on-demand, deserializing, and comparing to the query shape.

As of this writing, there aren’t any adapters for ElasticSearch or Solr yet. The Solr feature request is tracked as SOLR-5728, and is probably going to arrive in Solr 4.9. The adapters will likely maintain in-memory caches of deserialized shapes to speed up execution.

Using both strategies

It’s critical to understand that SerializedDVStrategy is not a PrefixTree index; if used alone, SerializedDVStrategy will brute-force and deserialize compare all documents, O(N) complexity, and have terrible spatial performance.

In order to minimize the number of documents that SerializedDVStrategy has to see, you should put a faster SpatialStrategy like RecursivePrefixTreeStrategy in front of it. And you can now safely dial-down the configurable accuracy of RPT to return more false-positives since SDV will filter them out. This is done with the distErrPct option, which is roughly the fraction of a shape’s approximate radius, and it defaults to 0.025 (2.5%). Given the Massachusetts polygon, RPT arrived at level 6. If distErrPct is changed to 0.1 (10%), my recommendation when used with SDV, the algorithm chose level 5 for Massachusetts, which had about 10% as many cells as level 6.

Performance and accuracy — the holy grail

What I’m most excited about is the prospect of further enhancing the PrefixTree technique such that it is able to differentiate matching documents between those that are a guaranteed matches versus those that need to be verified against the serialized geometry. Consider the following shape, a circle, acting as the query shape:

Gridded circle

If any grid cell that isn’t on the edge matches a document, then the PrefixTree guarantees it’s a match; there’s no point in checking such documents against SerializedDVStrategy (SDV). This insight should lead to a tremendous performance benefit since a small fraction of matching documents, often zero, will need to be checked against SDV. Performance and accuracy! This feature is being tracked with LUCENE-5579.

One final note about accuracy — it is only as accurate as the underlying geometry. If you’ve got a polygon representation of an area with only 20 vertices, and another with 1000 for the same area, then more vertices are likely to be a more precise reflection of the border of the underlying area. Secondly, you might want to work in a projected 2D coordinate system instead of latitude & longitude. Lucene-spatial / Spatial4j / JTS doesn’t help in this regard, it’s up to your application to convert the coordinates with something such as Proj4j. The main shortcoming of this approach is that a point-radius (circle) query is going to be in 2D even if you might have wanted a geodesic (surface of a sphere) implementation that normally happens when you use latitudes & longitudes. But if you are projecting then the distorted difference between the two is generally fairly limited if the radius isn’t large and if the area isn’t near the poles.


Thanks to the Climate Corporation for supporting my recent improvements for accurate spatial queries!

Posted in Engineering

Self Hosted Gem Server with Jenkins and S3

On my team, we like to be able to keep our applications light. We use several internal libraries to manage things like distributed request tracing, authorization, configuration management, etc.

Isolating this reusable, generic code into a library keeps our application code concise and manageable. It also allows us to test changes to the library in isolation (not to mention keeping our tests fast).

The server

Most of these gems are very specific, and it wouldn’t make much sense to make them public. So, we decided the best approach was to use our own, internally accessible gem server.

A gem server can really just be a set of static files — nothing fancy.

    # Given a base directory of *.gem files, generate the index files for a gem server directory
    $ gem generate_index --directory=GEMS_DIR

And just like that, we’ve generated the static files we need for our gem server.

Using S3

We use S3 for plenty of things here at Climate. We’d rather not have another server to have to support, and S3 is a perfectly fine place to host static files. So, we just turn an S3 bucket into an internally accessible endpoint (i.e., through our internal DNS routing on our network).

Amazon has instructions for setting up a bucket as a static file server.

Tying it all together: Automated testing and build

Now, we have our S3 bucket setup to behave like a static file server. This is the workflow we want:

  1. Make some changes. Commit. Push. Pull request.
  2. Pull request and code review
  3. Merge pull request
  4. Manually trigger a build of the new gem (which automatically runs tests) 4a. If tests pass, deploy the packaged gem to our gem server
  5. Run bundle update MY_NEW_GEM! to update our project


We use Jenkins to automate our builds here at Climate. Depending on the git server you use (we’re in the process of migrating to Atlassian Stash), Jenkins integration is a bit different. I won’t talk too much about the Jenkins configuration specifics, but this is basically what happens:

We have two Jenkins jobs: the first job is the build and the second one is the update server job. The reason behind this is to allow concurrent builds. When the first job completes, it triggers the second.

We run each of these jobs in a Linux container. Docker is a nice project designed to make that process easy. In our containers we make sure we have the ruby and s3cmd in the path.


Our build job is a parametrized Jenkins build. We pass in a parameter, which is the PROJECT_DIR, or the directory relative to some root where Jenkins can find the gem we want to build. We keep all our gems in the same repo for simplicity.

Jenkins will check out the gems repo, and build the specified gem. This is the build script that Jenkins will execute, which is essentially equivalent to:

    $ ./ ~/gems_repo/my_new_gem
    #!/usr/bin/env bash
    # Exit 1 if any command fails
    set -e
    if [ "$#" -ne 1 ]; then
        echo "PROJECT_DIR not specified"
        echo "Usage : `basename $0` <PROJECT_DIR>"
        exit 1
    GEMS_ROOT=<root directory of gems repo>
    echo "Building gems in PROJECT_DIR $PROJECT_DIR"
    # Check that PROJECT_DIR is in the path relative to GEMS_ROOT
    if ! [ -d "$GEMS_ROOT/$PROJECT_DIR" ]; then
      echo "Error: PROJECT_DIR does not exist"
      exit 1
    ## Go to Gem project
    # Create a build number
    export GEM_BUILD=$(echo -n $(date +%Y.%m.%d.%H.%M.%S).;echo $(git rev-parse --short=7 HEAD))

    # Find the gemspec to build
    GEM_SPEC=$(find $GEMS_ROOT/$PROJECT_DIR -type f -name *.gemspec)
    echo "Building gem from gemspec $GEM_SPEC"
    # Bundle, run tests, and build gem
    bundle install
    bundle exec rake test --trace
    gem build $GEM_SPEC
    # Find target gem. Prune search to exclude vendor
    TARGET_GEM=$(find $WB_ROOT/$PROJECT_DIR -type f -not -path "*vendor/*" -name *.gem)
    echo "Uploading gem $TARGET_GEM to gem server"
    # Deploy (updating the Gem server index is left to another job)
    s3cmd put $TARGET_GEM s3://my-gems-s3-bucket

One thing to note about the build process is that it takes care of build versioning automatically for us. The way we handle this in each of our gems is:

    # my_new_new/lib/my_new_gem/version.rb
    module MyNewGem
      MAJOR = "0"
      MINOR = "2"
      BUILD = ENV["GEM_BUILD"] || "DEV"
      VERSION = [MAJOR, MINOR, BUILD].join(".")

Another thing to keep in mind is we always want to run our tests (you are writing tests, right?). This depends on our gem having a rake task named test. This is fairly simple:

# Rakefile
require 'bundler/gem_tasks'
require 'rake/testtask' do |t|
    t.libs << 'spec'
    t.test_files = FileList['spec/**/*_spec.rb']
    t.verbose = true

desc 'Run tests'
task :default => :test
Update gem server

The last step of our build job is to push the new *.gem file to the server. We now need to update the set of static files Bundler uses to retrieve the available gems from the server. This is fairly simple. Here’s the script for that job:

    #!/usr/bin/env bash

    # Updates the index of available Gems on the S3 gem server
    # Exit 1 if any command fails
    set -e
    # CD to Gem project directory
    cd $GEM_ROOT/applications/gems
    # Create a placeholder directory
    export GEM_SERVER_DIR=./gemserver
    if [ ! -d $GEM_SERVER_DIR ]
      mkdir -p $GEM_SERVER_DIR/gems
      mkdir $GEM_SERVER_DIR/quick
    # Install any dependencies in the Gems project
    # We require the builder gem
    # See:
    bundle install
    # Get existing files from Gem server
    s3cmd get --recursive s3://gem-server/ $GEM_SERVER_DIR
    # Generate new static files
    # See:
    # "Update modern indexes with gems added since the last update"
    gem generate_index --directory $GEM_SERVER_DIR
    # Sync the index files to the server. No need to sync anything in gems/ 
    s3cmd sync . --exclude 'gems/*' s3://my-gem-server


And, that’s it! We have automated our build/deploy proces for our own, internal Rubygems. Feel free to reach out to me with any questions.

I also highly recommend learning how Bundler and Rubygems work.

Tagged with: , ,
Posted in Engineering

Astro-Algo: Astronomical Algorithms for Clojure

At The Climate Corporation, we have “sprintbaticals”, two-week projects where we can work on something a bit different. This post is about work done by Matt Eckerle during his recent sprintbatical.

Sometimes I want to know where the sun will be at a certain time. I don’t want to know approximately where the sun will be, in a close-enough-for-a-Facebook-app kind of way. I want to know precisely where the sun will be, in a which-picnic-table-at-the-beer-garden-will-be-in-the-sun-at-exactly-5pm-on-my-34th-birthday kind of way. Or, more to the point of my employment at The Climate Corporation, what zenith angle should I use to correct radiation measurements taken at location x and time t? (It turns out that sunlight is important for crop growth. Who knew?) For this sort of endeavor, I keep a copy of Jean Meeus’ Astronomical Algorithms at my desk at all times.

IMG_20140331_150813Using a book on astronomy to compute solar position may sound like overkill, but using various solar calculators over the years has given me a lot of perspective on what is considered “accurate” by non-astronomers. Even the NOAA solar calculator,  which uses methods described in AA, doesn’t bother to consider the difference between dynamical time to universal time. This discrepancy of about 60 seconds seems to be obscured by the fact that they round output times to the nearest minute. And NOAA’s solar calculator is much, much, much better than most solar calculators out there.

Introducing Astro-Algo, Astronomical Algorithms for Clojure

Naturally, I wanted a better solar calculator to use with Clojure. So I wrote one in pure clojure. And it seemed silly to write just a solar calculator library when I had a whole astronomy book at my disposal, so I wrote a library for astronomy and implemented the Sun as the first celestial body of the library. I hope to get more bodies (starting with the Moon) into the library eventually, but here it is. To use Astro-Algo, add the following dependency to your Leiningen project: [astro-algo "0.1.2"]

The fine print

Astro-Algo is a Clojure library designed to implement computational methods described in Astronomical Algorithms by Jean Meeus. In this release, the following is complete:

  • date utils with reduction of time scales table from 1890 to 2016
  • celestial Body protocol to get equatorial coordinates and standard altitude
  • implementation of Body protocol for the Sun
  • function to get local coordinates for a body
  • function to get passages (rising, transit, setting) for a body

The solar position computations implemented in this library are considered low accuracy. They account for some effects of nutation with simplifications described in AA Ch. 25. These calculations should be accurate to within 0.01 degrees of geometric position, however larger differences may be observed due to atmospheric refraction, which is not accounted for other than the standard values for the effect of atmospheric refraction at the horizon used in the calculation of passage times. Passages are calculated according to AA Ch. 15 by interpolating the time that the celestial body crosses the local meridian (for transit) or the standard altitude (for rising and setting). Standard altitudes for rising and setting are taken from the celestial body unless the passages for twilight are desired, in which case the standard altitude for the specified twilight definition is used.

The core API


protocol Body


Calculate apparent right ascension and declination of a body.
  this (Body instance)
  dt (org.joda.time.DateTime UTC)
Returns a map with keys:
  :right-ascension (radians west of Greenwich)
  :declination (radians north of the celestial equator)

English translation: Interface Body has method equatorial-coordinates, which returns the orientation of the Body relative to the earth at any time.


Returns the "standard" altitude, i.e. the geometric altitude of the
center of the body at the time of apparent rising or setting
  this (Body instance)
Returns standard altitude (degrees)

English translation: Interface Body has method standard-altitude, which returns the angle above the horizon at which the body is considered to have risen or set. (This can be different for different kinds of celestial bodies.)


Calculate local coordinates of a celestial body
  body (Body instance)
  dt (org.joda.time.DateTime UTC)
  lon - longitude (degrees East of Greenwich)
  lat - latitude (degrees North of the Equator)
Returns a map with keys:
  :azimuth (radians westward from the South)
  :altitude (radians above the horizon)

English translation: Function local-coordinates returns the orientation of a Body relative to an observer on earth at a specified longitude and latitude.


Calculate the UTC times of rising, transit and setting of a celestial body
  body (Body instance)
  date (org.joda.time.LocalDate)
  lon - longitude (degrees East of Greenwich)
  lat - latitude (degrees North of the Equator)
    :include-twilight (one of "civil" "astronomical" "nautical" default nil)
    :precision - iterate until corrections are below this (degrees, default 0.1)
Returns a map with keys:
  :rising (org.joda.time.DateTime UTC)
  :transit (org.joda.time.DateTime UTC)
  :setting (org.joda.time.DateTime UTC)

English translation: Function passages returns the times that a Body will rise, transit and set given a date, longitude and latitude. You can optionally specify a twilight definition to include, which changes the angle below the horizon used for rising and setting. Optional precision allows you to tune the amount of interpolation before convergence.

Usage examples

  [com.climate.astro-algo.bodies :refer [local-coordinates passages]]
  [clj-time.core :as ct])
  [com.climate.astro_algo.bodies Sun])

; Suppose we want to get the cosine of the solar azimuth
; on 10/18/1989 00:04:00 (UTC) in San Francisco 
; (at the moment of the 1989 Loma Prieta earthquake).

; We begin by start by instantiating a Body (the Sun).
(let [sun (Sun.)]
  ; Now we declare the time/space coordinates we need
  ; for local-coordinates...
  (let [dt (ct/date-time 1989 10 18 0 4 0)
        lon -122.42 ; [degrees east of Greenwich]
        lat 37.77   ; [degrees north of the Equator]
        ; ...and get azimuth and altitude in radians.
        {:keys [azimuth altitude]} (local-coordinates sun dt lon lat)]
    ; To get the cosine of the zenith,
    ; we can take the sin of the altitude.
    (Math/sin altitude)))

; => 0.2585521627566685
; This represents the ratio of the number of photons hitting a horizontal
; surface to the number of photons hitting a surface facing directly at
; the sun. A negative value would indicate that the sun is geometrically
; beneath the horizon.

(let [sun (Sun.)]
  ; get civil daylength on 12/21/2013 at Stonehenge
  (let [date (ct/local-date 2012 12 21)
        lon -1.8262 ; [degrees east of Greenwich]
        lat 51.1788 ; [degrees north of the Equator]
        {:keys [rising transit setting]}
        (passages sun date lon lat
                  ; can be none, civil, astronomical, nautical
                  ; none is default
                  :include-twilight "civil"
                  ; set lower precision to sacrifice some accuracy
                  ; and speed up the computation
                  ; 0.1 is default (degrees)
                  :precision 1.0)]
    ; calculate daylength in hours
    (-> (ct/interval rising setting)
      ct/in-millis ; [milliseconds]
      (/ 3600000.))))  ; [hours]

; => 9.214338611111112
; If we're going to be there all day,
; we're going to need more than one bag of chips.

How does it stack up?

I did a spot check to compare the accuracy of Astro-Algo against NOAA’s calculator and java library for sunrise and sunset (called luckycat in the chart below).


As you can see, we are making progress!

How can I learn more?

The complete API is documented at the astro-algo GitHub project.

Can I get involved?

Yes! We love pull requests. If you want to get involved, getting a copy of AA is highly recommended.


Special thanks to Bhaskar (Buro) Mookerji, who reviewed my code and gave great feedback. Big thanks also to Buro, Leon Barrett, and Steve Kim for reviewing this post.

On deck

This library could use a refactor for more efficient batch jobs (like calculating solar local coordinates for a bunch of locations at the same time). I’d also like to add the calculations for lunar position and phase soon. Of course, I might have to work on the beer garden picnic table problem first… :)

Tagged with: , ,
Posted in Engineering

Claypoole: Threadpool tools for Clojure

At The Climate Corporation, we have “sprintbaticals”, two-week projects where we can work on something a bit different. This post is about work done by Leon Barrett during his recent sprintbatical.

At the Climate Corporation, we do a lot of resource-intensive scientific modeling, especially of weather and plant growth. We use parallelism, such as pmap, to speed that up whenever possible. We recently released a library, claypoole, that makes it easy to use and manage threadpools for such parallelism.

To use claypoole, add the Leiningen dependency [com.climate/claypoole "0.2.1"].


Basically, we just wanted a better pmap. Clojure’s pmap is pretty awesome, but we wanted to be able to control the number of threads we were using, and it was nice to get a few other bonus features. (Frankly, we were surprised that we couldn’t find such a library when we searched.)

Although the parallelism we need is simple, the structure of our computations is often relatively complex. We first compute some things, then make requests to a service, then process some other stuff, then … you get the picture. We want to be able to control our number of threads across multiple stages of work and multiple simultaneous requests.

Nevertheless, we don’t really need core.async’s asynchronous programming. Coroutines and channels are nice, but our parallelism needs don’t require their complexity, and we’d still have to manage the amount of concurrency we were using.

Similarly, reducers are great, but they’re really just oriented at CPU-bound tasks. We needed more flexibility than that.

Aside: So why do you need so many threads?

Like many of you, we’re consuming resources that have some ideal amount of parallelism: they have some maximum throughput, and trying to use more or less than that is ineffective. For instance, we want to use our CPU cores but not have too many context switches, and we want to amortize our network latency but not overload our backend services.

Consider using parallelism to amortize network latency. Each request we make has a delay (latency) before the server begins responding, plus a span of network transfer. If we just run serial network requests, we’ll see a timeline like this:


That means that we’re not actually making good use of our network bandwidth. In fact, the network is sitting idle for most of the time. Instead, with optimal parallelism, we’ll get much fuller usage of our bandwidth by having the latency period of the requests overlap.


The transfers may be individually somewhat slower because we’re sharing bandwidth, but on average we finish sooner. On the other hand, with too much parallelism, we’ll use our bandwidth well, but we’ll see our average total latency go up:


That’s why we want to be able to control how much parallelism we use.

How do I use it?

Just make a threadpool and use it in claypoole’s version of a parallel function like future, pmap, pcalls, and so on. We even made a parallel for.

(require '[com.climate.claypoole :as cp])
(cp/with-shutdown! [pool (cp/threadpool 4)]
  (cp/future pool (+ 1 2))
  (cp/pmap pool inc (range 10))
  (cp/pvalues pool (str "si" "mul") (str "ta" "neous"))
  (cp/pfor pool [i (range 10)]
    (* i (- i 2))))

They stream their results eagerly, so you don’t have to force them to be realized with something like doall as you would for .core.pmap. And, because they produce sequential streams of output and take sequential streams of input, you can chain them easily.

(->> (range 3)
     (cp/pmap pool inc)
     (cp/pmap pool #(* 2 %))
     (cp/pmap other-pool #(doto % log/info)))

Got anything cooler than that?

You don’t have to manage the threadpool at all, really. If you just need a temporary pool (and don’t care about the overhead of spawning new threads), you can just let the parallel function do it for you.

;; Instead of a threadpool, we just pass a number of threads (4).
(cp/pmap 4 inc (range 4))

To reduce latency, you can use unordered versions of these functions that return results in the order they’re completed.

;; This will probably return '(0 1 2), depending on how
;; the OS schedules our threads.
(cp/upfor 3 [i (reverse (range 3))]
    (Thread/sleep (* i 1000))
    (inc i)))

For instance, if we’re fetching and resizing images from the network, some images might be smaller and download faster, so we can start resizing them first.

(->> image-urls
     ;; Put the URL in a map.
     (map (fn [url] {:url url}))
     ;; Add the image data to the map.
     (cp/upmap network-pool
               #(assoc % :data
                       (-> % :url clj-http.client/get :body)))
     ;; Add the resized image to the map.
     (cp/upmap cpu-pool
               #(assoc % :resized (resize (:data %)))))

You can also have your tasks run in priority order. Tasks are chosen as threads become available, so the highest-priority task at any moment is chosen. (So, for instance, the first task submitted to a pool will run first, regardless of priority.)

(require '[com.climate.claypoole.priority :as cpp])
(cp/with-shutdown! [pool (cpp/priority-threadpool 4)]
  (let [;; These will mostly run last.
        xs (cp/pmap (cpp/with-priority pool 0) inc (range 10))
        ;; These will mostly run first.
        ys (cp/pmap (cpp/with-priority pool 10) dec (range 10))]

What’s next?

We don’t have particularly specific plans at this time. There are a number of interesting tricks to play with threadpools and parallelism. For instance, tools for ForkJoinPools could combine this work with reducers, support for web workers in Clojurescript would be nice, and there are many other such opportunities.

Send us your requests (and pull requests) on Github!

Where can I learn more?

A detailed README can be seen on the claypoole Github project.


Thanks to Sebastian Galkin of Climate Corp. and to Jason Wolfe of Prismatic, who helped with advice on API design decisions.

Tagged with: ,
Posted in Engineering

Get every new post delivered to your Inbox.

Join 298 other followers