Numba vs Cython: How to Choose

My name is Stephan, and I’m a scientist on the Climatology team at The Climate Corporation. We make extensive use of Python to build statistical weather models, and sometimes we need our code to be fast. Here’s how I choose between Numba and Cython, two of the best options for accelerating numeric Python code.

Most of the time, libraries like NumPy, SciPy and pandas, whose critical loops are already written in a compiled language like C, are enough fast scientific Python code. Unfortunately, sometimes you need to write your own loop in performance critical paths of your code, and also unfortunately, loops in Python are painfully slow. This is where Numba and Cython come in: they both promise the ability to write the inner loop of your code in something that looks a lot like normal Python, but that runs about as fast as handwritten C.

Numba uses LLVM to power Just-In-Time compilation of array oriented Python code. Using Numba is usually about as simple as adding a decorator to your functions:

from numba import jit

@jit
def numba_mean(x):
    total = 0
    for xi in x:
        total += xi
    return total / len(x)

You can supply optional types, but they aren’t required for performant code as Numba can compile functions on the fly using its JIT compiler.

In contrast, Cython is a general purpose tool, not just for array oriented computing, that compiles Python into C extensions. To see impressive speedups, you need to manually add types:

def cython_mean(double[:] x):
    cdef double total = 0
    for i in range(len(x)):
        total += x[i]
    return total / len(x)

When I benchmark this example, IPython’s %timeit reports that calling this function on a 100000 element array takes ~16 ms with pure Python version, but only ~93 µs with Numba and ~96 µs with Cython.1

This trivial example illustrates my broader experience with Numba and Cython: both are pretty easy to use, and result in roughly equivalently fast code. For similar results on a less contrived example, see this blog post by Jake VanderPlas.

The bottom line is that even though performance is why we reach for tools like Numba and Cython, it doesn’t provide a good basis for choosing one over the other. So here are the questions I ask myself when making that choice for my projects.

Will other people be deploying your code?

Cython is easier to distribute than Numba, which makes it a better option for user facing libraries. It’s the preferred option for most of the scientific Python stack, including NumPy, SciPy, pandas and Scikit-Learn. In contrast, there are very few libraries that use Numba. I know of two, both of which are basically in the experimental phase: Blaze and my project numbagg.

The main issue is that it can be difficult to install Numba unless you use Conda, which is a great tool, but not one everyone wants to use. In contrast, distributing a package with Cython based C-extensions is almost miraculous easy. Cython is also a more stable and mature platform, whereas the features and performance of Numba are still evolving.

If you don’t need to distribute your code beyond your computer or your team (especially if you use Conda), then Numba can be a great choice. Otherwise, you should lean toward Cython.

Do you need advanced Python features or to use C-level APIs?

The features that Numba supports in the accelerated nopython mode are very limited. For example:

  • Numba only accelerates code that uses scalars or (N-dimensional) arrays. You can’t use built-in types like list or dict or your own custom classes.
  • You can’t allocate new arrays in accelerated code.
  • You can’t use recursion.

Some of these are design decisions; in other cases, these are being actively worked on.

In contrast, Cython can compile arbitrary Python code, and can even directly call C. The ability to “cythonize” an entire module written using advanced Python features and then only tweak the bottlenecks for speed can be really nice.

For example, switching to an extension type and calling C APIs directly can make for big differences in speed, even if you still rely on builtin Python types like lists or dictionaries. Writing
something like cyordereddict in Numba would be nearly impossible.

Do you want to write code that works on N-dimensional arrays?

Suppose you want a function that takes several arguments and returns a scalar or array, depending on the number of provided arguments. For example, consider a function that averages two numbers:

def average(a, b):
    return 0.5 * (a + b)

One of the most powerful features of NumPy is that this simple function would work even if a or b are multi-dimensional arrays (tensors), by following broadcasting rules.

Numba makes it easy to accelerate functions with broadcasting by simply adding the vectorize decorator. This produces universal functions (ufuncs) that automatically work (even preserving labels) on array-like data structures in the entire scientific Python ecosystem, including xray (my project) and pandas. In other cases, Numba can handle arbitrary dimensional input by using Just-In-Time compilation with jit or by creating generalized universal functions with guvectorize.

In contrast, generally speaking, your Cython functions will only work for input with a number of dimensions that you determine ahead of time (e.g., a 1D vector, but not a scalar or 2D array). It certainly possible to do this sort of stuff with Cython, but it’s not easy, and you’ll need to get your hands dirty with the NumPy C-API. Keith Goodman has some nice examples in version 1.0 of bottleneck.

Still not sure?

When I’m not constrained by other concerns, I’ll try to make Numba work. Numba is usually easier to write for the simple cases where it works. You may still run into annoying limitations when you try to do complex things, but Numba has been getting a lot better, even just over the past few months (e.g., they recently added support for generating random numbers). At the end of the day, even if you ultimately can’t get things to work, you’ll still have idiomatic Python code that should be easy to accelerate with Cython.


This post is cross posted to my personal website.


  1. numpy.mean is faster still, at ~60 µs, but here we’re pretending that we need to write our own custom function that is not already built in. It’s still impressive that we’re only 50% slower than highly tuned C. 
Tagged with: , , , ,
Posted in Engineering

S3 Distributed Version Restore

At The Climate Corporation, we work with Amazon Web Services a lot. As you can imagine, weather data is very plentiful. We keep lots of data in Amazon’s Simple Storage Service (S3), which is basically a great big filesystem in the sky, and in Mandoline, our array data store, which can use S3 as a backend. It can take some effort to deal with so much data, but we make that effort willingly, since we need it for important things like predicting fertilizer needs for fields.

To help manage all this data, we use S3’s advanced features, such as versioning. When versioning is enabled, every time a file in S3 is updated or deleted, the old version remains stored in S3 as a backup, and we can access it with a special API call. Then, if we accidentally overwrite or delete a file, we can make an API call to list all that file’s versions, choose one, and then use another API call to restore that old version.

However, at Climate, we often manipulate our weather data with automated tools. That means that if we mess up, we can mess up in a really big way. With millions of files and billions of versions, the API for listing and restoring old versions might take weeks to restore them all. In order to fix mistakes made by automated tools, at large scale, we need another automated tool, one that can restore lots of files in a hurry. Unfortunately, as far as we know, there wasn’t one provided.

So, we’ve made just that automated tool, S3DistVersions, to let us restore lots of files efficiently. For it to perform well, it needs lots of simple parallelism, so we used Hadoop MapReduce, a parallel batch processing system. Because its primary task is to access Amazon S3, we designed it to work well inside Elastic MapReduce (EMR), Amazon’s on-demand Hadoop service.

We based the interface for S3DistVersions on that of S3DistCp, a batch tool for copying data in S3. Given a time and a location in S3, S3DistVersions will restore all files there to their state at the specified time. (Because each file has its own list of versions, you can’t specify a version to restore them all to. Instead, we find it easier to specify a time, and S3DistVersions selects the appropriate version for each file.) It takes simple arguments:

java -jar S3DistVersions-0.1.0.jar
  --src s3://mybucket/mypath
  --prefixes s3://mybucket/list-of-s3-prefixes.txt
  --restore-time 2014-01-01T14:00:00+07:00

Challenges

The hardest thing to deal with was obtaining parallelism while working with S3’s versioning API. Suppose we wanted to restore all the files under the location s3://mybucket/mypath. Unfortunately, S3 only permits access to the versions of files under a location via a (serial) paging API, so we have to make a long sequence of API calls. Sure, each request might return a thousand versions, but it would take a long time to get through a million files and all their versions at that rate. In order to restore those files in a useful amount of time, we need to access the version information in parallel.

(Note: We can’t just start by listing the files under s3://mybucket/mypath because S3’s file-listing API won’t show deleted files. Instead, we have to use the version-listing API.)

To get parallelism while listing versions, we ask the user to provide a list of “prefixes”–the beginnings of the filenames that appear under s3://mybucket/mypath. Then, we can make requests for versions in those separate places in parallel. (For instance, given prefixes /a/1 and /b/2, we can scan for versions under s3://mybucket/mypath/a/1 and s3://mybucket/mypath/b/2 in parallel.) We need that parallelism in order to get through our billions of versions in a reasonable amount of time; for smaller restorations, it’s perfectly fine to omit the list of prefixes. And of course, parallelism is easy in the restore step, so the prefix list is only needed for parallelism in listing versions.

In practice, we find that it’s not hard to use a prefix list like this. We have our files spread out in many different subdirectories, so we use that list of subdirectories as our prefixes.

We designed this program to follow the recommended advice for partitioning keys. S3 keeps track of filenames (keys) in an index that is spread across many different Amazon computers in sequential chunks. It’s sort of like the way conferences have registration tables organized by segments of the alphabet (e.g. ‘A-C’, ‘D-F’, …) to distribute the load. Just as you wouldn’t want to have all the people with ‘A’ names come in at the same time, we don’t want our program to access the list of files in sequential order. Therefore, when S3DistVersions interacts with S3 files or lists their versions, it first shuffles the filenames or filename prefixes. That way, it spreads the load across Amazon’s S3 index cluster.

Technology

First of all, for this project, as for most things we do, we used Clojure, a programming language that’s based on Lisp and runs on the Java Virtual Machine (JVM). Those parents give it simplicity and speed, plus access to a large range of useful libraries. It’s also an opinionated language that prefers a functional, immutable style. We like that opinion; for instance, functional code is easier to reason about in bulk, since we can be confident it doesn’t have weird side effects. That also happens to work great with parallel data processing such as MapReduce.

We also often use Cascalog, a MapReduce tool for Clojure. However, here we wanted tight control over keys, grouping, and whether work happens in mappers or reducers, so we tried Parkour, which had some exceptions in EMR. Ultimately, we used clojure-hadoop. It isn’t as elegant as Cascalog in some ways, but it does give us the control we want.

Conclusion

Now, we are confident that when we mess up lots of files in S3, we can fix the problem at scale. We are pleased to release S3DistVersions, and we hope that it will help others work with S3’s versioning more easily. We welcome contributions and feedback, so please let us know how it works for you.

Tagged with: , , , ,
Posted in Engineering, Software Engineering

A Summary of Kyle Kingsbury’s Distributed Systems Workshop

Last week The Climate Corporation invited Kyle Kingsbury — inventor of Jepsen and distributed-systems-expert-about-town — to lead a workshop on distributed computing theory and practice at our San Francisco office.

Kyle’s work focuses on the necessity of theory to distributed computing practice — it’s a task that’s almost impossible to get right on a first try, and you need all the theoretical assistance you can get. You can see empirical application of distributed computing theory to real-world system in his series of Jepsen blog posts.

This post will follow a rough outline of Kyle’s workshop, starting with theoretical definitions and concepts, and ending with applying those definitions and concepts to discuss and evaluate distributed computing practice. Enjoy!

What is a Distributed System?

Kyle began the discussion with the most fundamental question: what does it mean when we call a system “distributed”? His definition — any system made up of parts that interact slowly or unreliably — is familiar, but by noting that the definition of “slow” is highly dependent on context, he opens up the “distributed” label to a wide range of systems that don’t immediately appear to be distributed. This could include:

  • NUMA architectures
  • making plans via SMS
  • Point-of-sale terminals
  • mobile applications
  • communicating threads or processes on a single computer
  • and of course, computers communicating via TCP/UDP over a LAN or the internet.

If the inter-part communication is, say, an order of magnitude slower than the intra-part communication, then you have a distributed system.

Asynchronous Networks

Since all networks in the real world take time to propagate information (and that information can be arbitrarily delayed or lost), understanding timing (and uncertainty of timing) is crucial to being able to reason about the behavior of the system. An important tool Kyle introduced for that purpose is the light cone diagram (by analogy to relativity and spacetime).

two threads accessing a shared variable

Two threads accessing a shared variable.

Because the operations take time to execute (and return a result), the ordering of those operations is ambiguous in this example. Thread 2 can know the time it initiated the read and the time it received the result, but it can’t know the time the read operation was applied on the memory — it could return either the old or new value.

Clocks

We want to be able to define an ordering for events. However, keeping track of time in a distributed system is difficult. There are a number of types of clocks used for ordering events in a distributed system.

Wall clocks. Nodes track “actual” (human) time, likely by talking with an NTP (Network Time Protocol) server. However, in practice the synchronization doesn’t work well enough to track a precise enough time. The hardware frequently drifts, sometimes by centuries, not to mention that POSIX time is not monotonic.

Lamport clocks. Each node has a counter that it increments with each state transition. The counter is included with every message and, on receiving a message, a node sets its counter to the one included on the message, if greater than its own. This gives a partial ordering of events. If event b is causally dependent on event a, then C(a) < C(b).

Vector clocks. Vector clocks generalize Lamport clocks. The clock is a vector of counters, one for each node. When executing an operation, a node increments its counter in the vector, and again includes the clock with every message to other nodes. When receiving a message, a node takes the pairwise maximum of its clock and the one in the message. Orders more events than Lamport clocks.

Dotted version vectors. “Better vector clocks.” Still a partial ordering, but orders more events.

GPS & Atomic clocks. Much better than NTP. Gives a globally distributed total orders on the scale of milliseconds, which means you can perform one operation (that’s allowed to conflict) per uncertainty window. But it’s really expensive.

Consistency Models

As software developers, we expect the systems we program against to behave correctly. But there are many possible definitions of correct behavior, which we call consistency models.

Each consistency model defines what sequence of operations on a state machine are valid. In the diagram below, a parent-child relationship between two models means any valid sequence of operations in the parent is valid in the child. That is, requirements are stricter as you move up the tree, which makes it easier to reason about the system in question, but incurs performance and availability penalties.

A hierarchy of consistency models.

A hierarchy of consistency models.

Kyle himself has an excellent write-up of these consistency models on his blog, but I’ll give a brief description of a few here.

Linearizability – There appears to be a single global state where operations are applied linearly.

Sequential consistency – Writes appear in the same order to all readers. (But a read may not reflect the “current” state.)

Causal consistency – Only the order of *causally related* operations is linear. Unrelated (concurrent) operations may appear in any order to readers.

Linearizability/Strong serializability is the “golden standard” of consistency models. But there are performance and availability tradeoffs as you move up the tree. It is impossible to have total availability of a linearizable (or even sequentially consistent) system.

Convergent Replicated Data Types

CRDTs are a class of eventually consistent data types that work by supporting an idempotent, commutative and associative merge operation. The operation merges the histories of two (diverging) versions of the data structure. Some thorough and accessible examples of CRDTs can be found in Kyle’s project here.

Building Consensus

To achieve stricter consistency models in a distributed system, it is necessary to achieve consensus on certain facts among a majority of nodes. The Paxos algorithm is the notoriously difficult to understand gold standard of consensus algorithms, introduced by Leslie Lamport in his 1970 paper, The Part-Time Parliament.

Since it’s publication, it has inspired a family of similar algorithms with optimizations in performance or accessibility, including Multi-Paxos, Fast Paxos and Generalized Paxos.

In Search of an Understandable Consensus Algorithm (Ongaro & Ousterhout, 2014) is a recent publication describing the Raft consensus algorithm that, like it says on the tin, is intended to be (more) easily understood than Paxos, while retaining similar guarantees and performance characteristics.

A Pattern Language

Kyle’s number one piece of advice for building a distributed system is don’t do it if you don’t have to. Maybe your problem can be solved by getting a bigger box, or tolerating some downtime.

Number two is to use someone else’s distributed system. Pay Amazon to do it, or at least use battle-tested software.

Number three is don’t fail. You can build really reliable hardware and networks at the cost of moving slowly, buying expensive hardware and hiring really good talent.

But, if your system runs long enough, it will eventually fail, so accept it gracefully.

Take frequent backups. Done correctly, they give you sequential consistency. When taking a backup of a database, understand the transactional semantics. (For example, copying filesystem state of a MySQL database may not reflect a consistent transactional state at every point in time, or filesystem may change during backup process. Better to use mysqldump which preserves database semantics.)

Use immutable values when possible. Data that never changes is trivial to store because it does not require coordination.

If you require mutable values, instead use mutable identities (pointers to immutable values). Can fit a huge number of mutable pointers in a linearizable RDBMS while immutable values are stored in S3 or other eventually consistent stores. This is similar to how Datomic works.

Some operations can return fractional results or tolerate fractional availability. (E.g., search results, video or voice call quality, analytics, map tiles.) This lets you gracefully degrade the app experience while trying to recover, rather than suddenly crashing. Feature flags that turn off application features in response to system health metrics can be used to the same end.

Service owners should write client libraries for their services, and the libraries should include mock I/O to allow consumers to test locally without having to set up (and keep in sync) a local instance of the dependency. Include the library version in the request headers so you can track uptake of new versions and go after the slowpokes.

Particularly important when moving from monolithic to distributed architectures is to test everything, all the time. This includes correctness invariants of course, but also performance invariants. Simulate network failures, performance test against large data sets. Test against production data sets since they are always different from test data.

Always bound the size of queues in your application, or you can get in a situation where work takes arbitrarily long to complete. Monitor enqueue and dequeue rates on the queue (throughput) as well as the time it takes a message to be worked after being enqueued (latency).

Acknowledgements

Thanks to Sebastian Galkin for organizing the event, Dylan Pittman for his notes and diagrams, and of course, Kyle Kingsbury for his time, energy and expertise.

Posted in Engineering

Multidimensional Arrays with Mandoline

At the Climate Corporation, we have a great demand for storing large amounts of raster-based data, and an even greater demand to retrieve small amounts of it quickly.  Mandoline is our distributed, immutable, versioned database for storing big multidimensional data.  We use it for storing weather data, elevation data, satellite imagery, and other kinds of data.  It is one of the core systems that we use in production.

https://github.com/TheClimateCorporation/mandoline

What can Mandoline do for me?

Mandoline can store your multidimensional array data in a versionable way that doesn’t bloat your storage.  When you don’t know what your query pattern is, or when you want to preserve past versions of your data, Mandoline may be the solution for you.

Some more details

  • It’s a clojure library.
  • Using the Mandoline library, we have built services both to access and to ingest data in a distributed fashion.  The reasons for doing it this way is so that our scientists have easy and language agnostic access to the data, and so that it plays well with existing scientific tools, like netCDF libraries and applications in any language.
  • When we say distributed, we mean that you can have distributed reads and writes from different machines to your dataset.
  • Mandoline uses swappable backends, and can save actual data to different backends.  In production, it currently runs on Amazon’s DynamoDB.  For testing purposes, we can either use the sqlite or in memory backends.
  • We want to expand the backend offerings to other databases like Cassandra and HBase.
  • Mandoline takes advantage of shared structure to make immutability and versions possible.

For more of an introduction to Mandoline (formerly known as Doc Brown), here is a video of the talk I gave at Clojure/West 2014.

 

Acknowledgements

Mandoline was primarily written by Brian Davis, Alice Liang, and Sebastian Galkin.  Brian Davis and Steve Kim have been the main push to open source Mandoline for use with the general public.

Posted in Engineering

Service Oriented Authorization – Part 3

As a follow up to Part 1 and Part 2 here is the video of the talk I gave at RailsConf last month.

Disclaimer: This is my first conference talk, ever!

Slides available on Speakerdeck https://speakerdeck.com/lumberj/authorization-in-a-service-oriented-environment

Tagged with: , , ,
Posted in Engineering

Rake tasks from a Rubygem

While recently working on IronHide, specifically the CouchDB Adapter, I needed a way to have the gem expose custom Rake tasks to an end user’s application.

For example, to make working with the IronHide CouchDB Adapter easier, I wanted to allow users to upload existing rules from disk to a remote CouchDB database.

The task could be something as simple as:

$ bundle exec rake iron_hide:load_rules\['/absolute/path/file.json','http://127.0.0.1:5984/rules'\]

where the two arguments are the path to the JSON file and the remote CouchDB database.

BUNDLER

After thinking for a moment, I knew that I’ve come across at least one gem that when included, exposes Rake tasks that I can use in my project, Bundler. So, I started looking through the source and found this:https://github.com/bundler/bundler/blob/master/lib/bundler/gem_helper.rb

Essentially, you’d like an application owner to be able to “install” a few Rake tasks at their convenience. Borrowing from the Bundler project:

# iron_hide/couchdb_tasks.rb
module IronHide
  class CouchDbTasks
    include Rake::DSL if defined? Rake::DSL

    def install_tasks
      namespace:iron_hide do
        desc 'Load rules from local JSON file to remote CouchDB server'
        task :load_rules, :file, :db_server do |t, args|
          ...
        end
      end
    end
  end
 end

IronHide::CouchDbTasks.new.install_tasks

Now, in my application’s Rakefile I just need to add this line:

# Rakefile
require'iron_hide/couchdb_tasks'

Et voila!

$ bundle exec rake -T
#=> rake iron_hide:load_rules[file,db_server]  # Load rules from local JSON
Tagged with: , ,
Posted in Engineering

Indexing Polygons in Lucene with Accuracy

Today we share a post by Lucene Spatial contributor and guest author David Smiley.  David explains storing and querying polygon shapes in Lucene, including some brand new features to improve accuracy of spatial queries that The Climate Corporation helped get into the latest version.

Apache Lucene is a Java toolkit that provides a rich set of search capabilities like keyword search, query suggesters, relevancy, and faceting. It also has a spatial module for searching and sorting with geometric data using either a flat-plane model or a spherical model. For the most part, the capabilities therein are leveraged to varying degrees by Apache Solr and ElasticSearch–the two leading search servers based on Lucene.

Advanced spatial

The most basic and essential spatial capability is the ability to index points specified using latitude and longitude, and then be able to search for them based on a maximum distance from a given query point—often a user’s location, or the center of a map’s viewport. Most information-retrieval systems (e.g. databases) support this.

Most relational databases, such as PostGIS, are mature and have advanced spatial support. I loosely define advanced spatial as the ability to both index polygons and query for them by another polygon. Outside of relational databases, very few information-retrieval systems have advanced spatial support. Some notable members of the advanced spatial club are Lucene, Solr, Elasticsearch, MongoDB and Apache CouchDB. In this article we’ll be reviewing advanced spatial in Lucene.

How it works…

This article describes how polygon indexing & search works in some conceptual detail with pointers to specific classes. If you want to know literally how to use these features in Lucene, then first know you should be familiar with Lucene. Find a basic tutorial on that if you aren’t familiar with it. Then you should read the documentation for the Lucene-spatial API, and review the SpatialExample.java source which shows some basic usage. Next, read the API Javadocs referenced here within the article for relevant classes. If instead you want to know how to use it from Solr and ElasticSearch, this isn’t the right article, not to mention that the new accurate indexing addition isn’t yet hooked into those search servers yet, as of this writing.

Grid indexing with PrefixTrees

From Lucene 4 through 4.6, the only way to index polygons is to use Lucene-spatial’s PrefixTreeStrategy which is an abstract implementation of SpatialStrategy — the base abstraction for how to index and search spatially. It has two subclasses, TermQueryPrefixTreeStrategy and RecursivePrefixTreeStrategy (RPT for short) — I always recommend the latter. The indexing scheme is fundamentally based on a model in which the world is divided into grid squares (AKA cells), and is done so recursively to get almost any amount of accuracy required. Each cell, is indexed in Lucene with a byte string that has the parent cell’s byte string as a prefix — hence the name PrefixTree.

The PrefixTreeStrategy uses a SpatialPrefixTree abstraction that decides how to divide the world into grid-squares and what the byte encoding looks like to represent each grid square. There are two implementations:

  • geohash: This is strictly for latitude & longitude based coordinates. Each cell is divided into 32 smaller cells in an 8×4 or 4×8 alternating grid.
  • quad: Works with any configurable range of numbers. Each cell is divided into 4 smaller cells in a straight-forward 2×2 grid.

We’ll be using a geohash based PrefixTree for the examples in this article. For further information about geohashes, to include a convenient table of geohash length to accuracy, go to Wikipedia.

Indexing points

Lets now see what the indexed bytes look like in the geohash SpatialPrefixTree grid for indexing a point. We’ll use Boston Massachusetts: latitude 42.358, longitude -71.059. If we go to level 5, then we have a grid cell that covers the entire area ± 2.4 kilometers from its center (about 19 square kilometers in this case):

DRT2Y cell

This will index as the following 5 “terms” in the underlying Lucene index:

D, DR, DRT, DRT2, DRT2Y

Terms are the atomic byte sequences that Lucene indexes on a per-field basis, and they map to lists of matching documents (i.e. records); in this case, a document representing Boston.

If we attempt to index any other hypothetical point that also falls in the cell’s boundary it will also be indexed using the same terms, effectively coalescing them into the same for search purposes. Note that the original numeric literals are returned in search results since they are internally stored separately.

To get more search accuracy we need to take the PrefixTree to more levels. This easily scales: there aren’t that many terms to index per point. If you want accuracy of about a meter then level 11 will do it, translating to 11 indexed terms. The search algorithms to efficiently search grid squares indexed in this way are interesting but it’s not the focus of this article.

Indexing other shapes

The following screen-captures depict the leaf cells indexed for a polygon of Massachusetts to a grid level of 6 (±0.61km from center):

Massachusetts, gridded MA zoomed in

The biggest cells here are at level 4, and the smallest are at 6. A leaf cell is a cell that is indexed at no smaller resolution, either because it doesn’t cross the shape’s edge, or it does but the accuracy is capped (configurable). There are 5,208 leaf cells here. Furthermore, when the shape is indexed, all the parent (non-leaf) cells get indexed too which adds another 339 cells that go all the way up to coarsest cell “D”. The current indexing scheme calls for leaf cells to be indexed both with and without a leaf marker — an additional special ‘+’ byte. The grand total number of terms to index is 10,755 terms. That’s a lot! Needless to say, we won’t list them here.

(Note: the extra indexed term per leaf will get removed in the future, once the search algorithms are adapted for this change. See LUCENE-4942.)

Theoretically, to get more accuracy we “just” need to go to lower grid levels, just as we do for points. But unless the shape is small relative to the desired accuracy, it doesn’t scale; it’s simply infeasible with this indexing approach to represent a region covering more than a few square kilometers to meter level accuracy. Each additional PrefixTree level multiplies the number of indexed terms compared to the previous level. The Massachusetts example went to level 6 and produced 5208 leaf cells, but if it had just gone to level 5 then there would have been only 463 leaf cells. With each additional level, there will be on average 32 / 2 = 16 times as many (for geohash) cells of the previous level. For a quad-tree it’s 4 / 2 = 2 times as many, but it also takes more than one added level of a quad tree to achieve the same desired accuracy, so it’s effectively the same no matter how many leaves are in the tree.

Summary: unlike a point shape, which has a linear relationship between accuracy and the number of terms, polygon shapes are exponential to the power of the numbers of levels. This clearly doesn’t scale for high-accuracy requirements.

Serialized shapes

Lucene 4.7 introduced a new Lucene-spatial SpatialStrategy called SerializedDVStrategy (SDV for short). It serializes each document’s shape vector geometry itself into a part of the Lucene Index called DocValues. At query time, candidate results are verified as true or false matches, by retrieving the geometry by doc-id on-demand, deserializing, and comparing to the query shape.

As of this writing, there aren’t any adapters for ElasticSearch or Solr yet. The Solr feature request is tracked as SOLR-5728, and is probably going to arrive in Solr 4.9. The adapters will likely maintain in-memory caches of deserialized shapes to speed up execution.

Using both strategies

It’s critical to understand that SerializedDVStrategy is not a PrefixTree index; if used alone, SerializedDVStrategy will brute-force and deserialize compare all documents, O(N) complexity, and have terrible spatial performance.

In order to minimize the number of documents that SerializedDVStrategy has to see, you should put a faster SpatialStrategy like RecursivePrefixTreeStrategy in front of it. And you can now safely dial-down the configurable accuracy of RPT to return more false-positives since SDV will filter them out. This is done with the distErrPct option, which is roughly the fraction of a shape’s approximate radius, and it defaults to 0.025 (2.5%). Given the Massachusetts polygon, RPT arrived at level 6. If distErrPct is changed to 0.1 (10%), my recommendation when used with SDV, the algorithm chose level 5 for Massachusetts, which had about 10% as many cells as level 6.

Performance and accuracy — the holy grail

What I’m most excited about is the prospect of further enhancing the PrefixTree technique such that it is able to differentiate matching documents between those that are a guaranteed matches versus those that need to be verified against the serialized geometry. Consider the following shape, a circle, acting as the query shape:

Gridded circle

If any grid cell that isn’t on the edge matches a document, then the PrefixTree guarantees it’s a match; there’s no point in checking such documents against SerializedDVStrategy (SDV). This insight should lead to a tremendous performance benefit since a small fraction of matching documents, often zero, will need to be checked against SDV. Performance and accuracy! This feature is being tracked with LUCENE-5579.

One final note about accuracy — it is only as accurate as the underlying geometry. If you’ve got a polygon representation of an area with only 20 vertices, and another with 1000 for the same area, then more vertices are likely to be a more precise reflection of the border of the underlying area. Secondly, you might want to work in a projected 2D coordinate system instead of latitude & longitude. Lucene-spatial / Spatial4j / JTS doesn’t help in this regard, it’s up to your application to convert the coordinates with something such as Proj4j. The main shortcoming of this approach is that a point-radius (circle) query is going to be in 2D even if you might have wanted a geodesic (surface of a sphere) implementation that normally happens when you use latitudes & longitudes. But if you are projecting then the distorted difference between the two is generally fairly limited if the radius isn’t large and if the area isn’t near the poles.

Acknowledgements

Thanks to the Climate Corporation for supporting my recent improvements for accurate spatial queries!

Posted in Engineering
Follow

Get every new post delivered to your Inbox.

Join 417 other followers