Parallel GeoProcessing

Posted February 15, 2008
Filed under: 10.0 Wishlist, Amazon EC2, ArcGIS Server, parallel geoprocessing |

parallel

A large city here in Texas has done a highly detailed sidewalk inventory. In addition, they’ve created “missing sidewalk” segments representing places where sidewalks could be constructed. In order to prioritize construction they are having me write a program that scores each missing sidewalk segment based on a variety of factors. Many of these factors involve proximity to other things like bus stops, office buildings etc. While I don’t know the name for this, it seems like this is a common pattern for GIS modeling: For each sidewalk, buffer it, search for nearby things and compute a score, basically just an application of Tobler’s 1st law of geography.

The problem is that this is very slow. To make it faster, I’m considering parallelization – divide up the problem across many machines, each scoring sidewalks in a different area of the city. Once each area is complete, merge together the results from each area.

One machine would be the “master”, the rest would be scorers. The master would divide up the city into rectangular areas and put job requests into an Amazon Simple Queue Service queue. Each scorer machines would read this queue, score the area described by the job, and write results the the url prescribed by the job result message in a different queue (also prescribed by the job). The master would read this queue, fetch the results using the url, and append it to the master result.

Since all scorer processes are hitting against the same geodatabase, I suspect the geodatabase would become the bottleneck. Suppose I had an unlimited number of scorer machines at my disposal. Doubling the number of them would not cut execution time in half, but I wonder what the factor would be? What is the optimal area size? Certainly having a tiny area with just a few sidewalks doesn’t make sense, but neither would a huge area. How can we determine optimal size?

There is really no reason the master process needs to be on the same side of the firewall as the scorer processes. This means it should be possible to write a Web Service that allows 3rd parties to submit scoring jobs. For example, realtors could score each available property based on proximity to other features relevant to a particular homebuyer.

What I’m proposing here sure sounds a lot like what Hadoop does. I wonder if we will ever see the day when we can use something like hadoop for geoprocessing with ArcObjects on .NET ?

5 comments so far

Scott Crawford on February 16, 2008

Funny you should mention this today. I’ve built a couple of systems like this lately. In fact I’m working on one now, which is why I stumbled onto your site. I’ve got a memory leak in an ArcObjects pdf generation service.

I work for a small company that provides a framework for parallel processing called Appistry EAF (Enterprise Application Fabric). So far, I’ve built a batch geocoder that can process 50K addresses/second, and a linearly scalable PDF map generator (at least once I fix this leak!). Next up is Network Analyst.

I’m deploying ArcGIS Engine Runtime onto our EAF platform (on windows). I build these “coarse-grained” GIS services in Java or C#. EAF provides automatic provisioning of the code to all available servers, load balances all of the requests in software, and provides automatic fail-over of failed requests.

You are right about the geodatabase being a bottleneck, which is why so far I’ve used flat files and I deploy them to all of the servers in the fabric.

I’m presenting at the ESRI FedUC next week.

Reply
Dave on February 16, 2008

Very interesting. I would recommend using not only SQS (great idea), but also EC2 for your processing and sending the results back via SQS (free bandwith to SQS within EC2 now), or even using the new Amazon distributed DB service and simply dumping the results out when you’re done.

Regardless, by sending results back via a queue you avoid a DB bottleneck. Use XML or YAML to define you’re own protocol for the jobs and the job results.

Reply
Kirk Kuykendall on February 17, 2008

@Scott – wow that’s good geocoding throughput. Wish I could make it to FedUC, planning on going to PUG instead. EAF certainly looks interesting.

After thinking more about the geodb bottleneck, I wonder if instead of having each scorer node hitting a geodb on the LAN, maybe each scorer could create an in-memory workspace of a specified area of the city. This assumes querying an in-memory workspace is faster, need to verify.

Good luck with the memory leak … are you calling IExporter.Cleanup?

@Dave I’ve got a lot of existing .NET code I’d like to use, so I’m not currently planning on pursuing EC2. While I’m not a Microsoft fanatic, I do see them building a huge datacenter here in San Antonio. I also read about WCF for high performance http://www.topxml.com/rbnews/WSCF/WCF/re-67968_New-HPC-WCF-Feature—-Use-Compute-Cluster-as-Scalable-SOA-Platform.aspx

To me this means they must be responding to EC2 soon. If that doesn’t happen I will look more closely at EC2, and brush the dust off those java manuals.

Reply
Scott Crawford on February 18, 2008

What is PUG?

The in-memory workspace idea sounds good. For my map exporter service, I’m loading the mapdocument into memory (one per core), and for each service invocation, I simply do the feature query, zoom, and export.

EAF has a feature called task affinity, which could help on with in-memory workspace usage for apps like what you describe. It basically allows jobs to land on machines based on resources available on the machine. You could imagine starting with a central data repository (geodb for example). As jobs land on servers, they could fetch the data and cache it locally. Then as new jobs come in, if the data is already cached somewhere, they could favor the server that already had some or all of the data. Another possible solution would be something like Oracle Coherence (Distributed data caching).

I did find the leak. (sort of). The previous version of my service actually loaded the mapdoc for each invocation (I started with a VB.Net sample that I found), but it never closed the mapdoc. Huge leak. Funny I would have thought the .Net garbage collector would have released all of that… there were no more references to any of it in VB. That said, even with closing the doc (and the IExporter.cleanup() was already being called), there was still a slow leak. Thankfully, I wanted to preload the map once and keep in loaded anyway. Once I made that change, the leak went away. My next step was going to be invoking the exporter as a separate exe. (and I’m glad I don’t have to do that).

I now have a linearly scalable pdf/jpeg map service. Currently running on 10 Aopen boxes each with a dual core processor. The geocoding system was beefier… it was 16 Dell 1U servers each with dual proc dual core. (64 cores total)

Reply
Kirk Kuykendall on February 18, 2008

Hey Scott –
PUG is the Petroleum User Group meeting.
http://www.esri.com/events/pug/index.html

Yeah, I can see how deciding how long to keep something loaded in memory might be tricky. I guess google’s search engine faces the same issue. I’ve read somewhere (forget where) that few searches actually result in a disk access – most data is already cached in memory some where. I wonder how they do it.

Kirk

Reply

	Aaron Gilliland on Why not a Web-based ModelBuild…
	melinda on Knowing: Ban this Movie!
	Kirk Kuykendall on Network Trace Add-in for ArcGI…
	Travis Butcher on Network Trace Add-in for ArcGI…
	Nate Lawrence on Wp7 Photosynth

The Memory Leak