Archive for the ‘Amazon EC2’ Category

Parallel GeoProcessing

parallel

A large city here in Texas has done a highly detailed sidewalk inventory. In addition, they’ve created “missing sidewalk” segments representing places where sidewalks could be constructed. In order to prioritize construction they are having me write a program that scores each missing sidewalk segment based on a variety of factors. Many of these factors involve proximity to other things like bus stops, office buildings etc. While I don’t know the name for this, it seems like this is a common pattern for GIS modeling: For each sidewalk, buffer it, search for nearby things and compute a score, basically just an application of Tobler’s 1st law of geography.

The problem is that this is very slow. To make it faster, I’m considering parallelization – divide up the problem across many machines, each scoring sidewalks in a different area of the city. Once each area is complete, merge together the results from each area.

One machine would be the “master”, the rest would be scorers. The master would divide up the city into rectangular areas and put job requests into an Amazon Simple Queue Service queue. Each scorer machines would read this queue, score the area described by the job, and write results the the url prescribed by the job result message in a different queue (also prescribed by the job). The master would read this queue, fetch the results using the url, and append it to the master result.

Since all scorer processes are hitting against the same geodatabase, I suspect the geodatabase would become the bottleneck. Suppose I had an unlimited number of scorer machines at my disposal. Doubling the number of them would not cut execution time in half, but I wonder what the factor would be? What is the optimal area size? Certainly having a tiny area with just a few sidewalks doesn’t make sense, but neither would a huge area. How can we determine optimal size?

There is really no reason the master process needs to be on the same side of the firewall as the scorer processes. This means it should be possible to write a Web Service that allows 3rd parties to submit scoring jobs. For example, realtors could score each available property based on proximity to other features relevant to a particular homebuyer.

What I’m proposing here sure sounds a lot like what Hadoop does. I wonder if we will ever see the day when we can use something like hadoop for geoprocessing with ArcObjects on .NET ?

Geoprocessing Sandboxes as a Service

Though I haven’t used it in production, I really like the ESRI geoprocessing server concept. ESRI’s approach involves authoring, publishing and consuming. However, from what I can tell, it requires authors to do their work inside the firewall.

I contend that authoring from outside the firewall would have business value.

For example, say I’m a data vendor with a very large geodatabase. My business model involves selling cpu cycles, data i/o, memory usage etc. I’m too busy keeping my data accurate and current to mess with writing all sorts of geoprocessing models. Instead, I would like to provide a platform where others author the models and I simply charge them for resources used.

Users would create an account with their credit card, similar to Amazon Web Services. Once they have an account they may create .NET assemblies and submit them as a job, perhaps via REST. The server runs the assembly within a sandbox, measuring database i/o, cpu cycles, memory usage etc. and applies charges to the user’s account accordingly. By running within the firewall, the assembly has more efficient access to the DBMS so that much more interesting things may be done.

To make it easier for authors, I would provide Visual Studio plugins to assist the development of assemblies. Perhaps they submit source code to the service, and let the server compile it. Or maybe something like the new dynamic lookup would allow me to test with a local small sample database, then once I got it working submit it to the server. Or better yet, maybe a thinclient IDE that allows editing of code on the server itself. Maybe ESRI could web-enable modelbuilder?

The submitted job could be configured to stay alive, and the user could then, in turn, sell use of his service to his own subscribers. This would be similar to weogeo‘s concept but with services instead of data.

In a way, this would be crowdsourcing the development of geoprocessing models. Doing something like this would allow ESRI to bring in more developers (authors) of ArcGIS server applications. Right now the cost of ArcGIS server is just too high both in terms of license costs and administration. Perhaps Amazon will spatially enable SimpleDB so that in conjunction with EC2 authors of services will have a low-cost platform.

MapReduce for Large Geodatasets


Here‘s an interesting video where Google describes how they use MapReduce to build connectivity in their street data. In ESRI terminology, this how they clean and build topology using parallel processing. They also briefly mention using it to render map tiles.

They don’t go into detail, but apparently those of us outside Google could do this sort of thing using Hadoop on Amazon EC2.

A challenge with tile caches is keeping them up to date with the vectors they depict. Here is how ESRI does it. I think ESRI needs to allow us to scale tile generation across a large number of cpus the way Google does. The licensing model needs to allow this. It seems like opensource Geo software on a paid AMI could be coupled with Hadoop on EC2 to do this.

Once that happens, an agency like a state data center could rebuild tile caches on EC2/S3 nightly from, for example, a statewide vector layer of parcel maps.

I’ve heard rebuilding a geodatabase topology for the nationwide census takes over 24 hours. I bet a MapReduce approach would be much faster for this too.