MapReduce for Large Geodatasets


Here‘s an interesting video where Google describes how they use MapReduce to build connectivity in their street data. In ESRI terminology, this how they clean and build topology using parallel processing. They also briefly mention using it to render map tiles.

They don’t go into detail, but apparently those of us outside Google could do this sort of thing using Hadoop on Amazon EC2.

A challenge with tile caches is keeping them up to date with the vectors they depict. Here is how ESRI does it. I think ESRI needs to allow us to scale tile generation across a large number of cpus the way Google does. The licensing model needs to allow this. It seems like opensource Geo software on a paid AMI could be coupled with Hadoop on EC2 to do this.

Once that happens, an agency like a state data center could rebuild tile caches on EC2/S3 nightly from, for example, a statewide vector layer of parcel maps.

I’ve heard rebuilding a geodatabase topology for the nationwide census takes over 24 hours. I bet a MapReduce approach would be much faster for this too.

Advertisements

2 comments so far

  1. Matt Blackler on

    Large batch processes which involve large number of FeatureToRaster executions and the like, run in parallel using MapReduce seems like a very interesting opportunity. For a specific project at work, we are creating multiple national scale rasters at 100m resolution from 50ish vector layers, some of which are buffered – and it can take up to 17/18 hours to execute!

    But the processing of each of the layers could be done in parallel, so immediately bringing that 17/18 hours process down to 20 minutes, assuming each layer takes the same time to process. Quite an improvement in turn around time!

    Further to this, some rather trick programming could be done to further distribute the processing of each layer, so that a subsets of a shapefile could be spat at numerous servers for creation of a raster, and then combined at the end. A kind of nested mapreduce if you will.

    So many opportunities, so little spare time at work to implement them!

  2. Kirk Kuykendall on

    Hey Matt –

    This sounds great, are the results online somewhere?

    “subsets of a shapefile” … hmmm, this would be easier if we went back to tiled vectors (anyone remember Librarian?). Tiled vectors are considered harmful last time I checked though. Maybe with a compelling benefit that might change.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: