A large city here in Texas has done a highly detailed sidewalk inventory. In addition, they’ve created “missing sidewalk” segments representing places where sidewalks could be constructed. In order to prioritize construction they are having me write a program that scores each missing sidewalk segment based on a variety of factors. Many of these factors involve proximity to other things like bus stops, office buildings etc. While I don’t know the name for this, it seems like this is a common pattern for GIS modeling: For each sidewalk, buffer it, search for nearby things and compute a score, basically just an application of Tobler’s 1st law of geography.
The problem is that this is very slow. To make it faster, I’m considering parallelization – divide up the problem across many machines, each scoring sidewalks in a different area of the city. Once each area is complete, merge together the results from each area.
One machine would be the “master”, the rest would be scorers. The master would divide up the city into rectangular areas and put job requests into an Amazon Simple Queue Service queue. Each scorer machines would read this queue, score the area described by the job, and write results the the url prescribed by the job result message in a different queue (also prescribed by the job). The master would read this queue, fetch the results using the url, and append it to the master result.
Since all scorer processes are hitting against the same geodatabase, I suspect the geodatabase would become the bottleneck. Suppose I had an unlimited number of scorer machines at my disposal. Doubling the number of them would not cut execution time in half, but I wonder what the factor would be? What is the optimal area size? Certainly having a tiny area with just a few sidewalks doesn’t make sense, but neither would a huge area. How can we determine optimal size?
There is really no reason the master process needs to be on the same side of the firewall as the scorer processes. This means it should be possible to write a Web Service that allows 3rd parties to submit scoring jobs. For example, realtors could score each available property based on proximity to other features relevant to a particular homebuyer.
What I’m proposing here sure sounds a lot like what Hadoop does. I wonder if we will ever see the day when we can use something like hadoop for geoprocessing with ArcObjects on .NET ?