Archive for the 'parallel geoprocessing' Category

Microsoft Dryad Presentation at Google

This is an interesting video: Microsoft presenting Dryad research to Google. Dryad is claimed to be a “superset” of MapReduce.

modelbuilder
It seems like the complexity of Dryad could be hidden behind a modelbuilder-like interface.

I wonder how valid it is to view this video and think about modelbuilder terminology, replacing the word “vertex” with “tool”, “channel” with “intermediate data”. Note that, like earlier versions of geoprocessing, Dryad cannot handle cycles (loops).

Spatial Data in HyperTable?

hypertable

Realizing that the geodatabase would be the bottleneck in a parallel geoprocessing cluster, I read a bit on how Google manages its BigTable, and found HyperTable, a recently announced opensource project modeled after BigTable.

HyperTable’s sponsor is Zvents, a company that specializes in local search. If they’re doing local search they must be into LBS. I wonder if anyone there is looking into spatially enabling Hypertable ? Their sample queries indicate they can do a lot with temporal envelopes (timestamps) it seems like this could be extended to handle spatial minimum bounding rectangles (MBRs).

Why GeoProcessing with ArcObjects .NET

This is just to followup on Sean Gillies comments.

why you’d want to put proprietary per-server-licensed software in the mix – when the point of Hadoop is to leverage the combination of commodity hardware and open source – escapes me.

I know a lot of places that maintain large geodatabases. I’m thinking I could write ArcGIS Engine applications that would listen to queues for job requests, and run them. The Engine licenses would only be $500 per seat (not per app). Also, a lot of sites have spare floating licenses they aren’t using at night. Scaling an existing arcobjects based app so that it runs in parallel seems a logical next step to more fully utilize these resources.

One thing that .NET has that java is missing as far as I can tell is System.CodeDOM.Compiler. This would allow a job to include source code that each node would download and run.

I’m using the term geoprocessing here in the general sense - code that processes geodatasets located in a geodatabase without crossing a firewall.

Imagine a website where you send it a job with C# code that you wrote. For example, create me a list of the top 100 properties available for sale anywhere in the US, ranked by a score. Determine the score based on sum of number of 1/miles^2 from nearest starbucks, plus 3/miles^2 from each Home Depot or Lowes (i.e. an inverse distance weighted score). Put the result of this at this URL (an Amazon S3 bucket). The master node would split this up and run it on multiple scoring machines, combine the results and put it into the S3 bucket.

Since we want the top 100, that is a task the master node would need to determine after each scoring node has completed. So the job would include two different code chunks - one for the master, and the other for the scorers.

I can’t imagine anyone would ever take the effort to publish a traditional geoprocessing service that does this. Maybe geoprocessing isn’t the right word, maybe we should call it geocompiling, since we are sending it uncompiled source. Or maybe a domain specific language would be compiled into IL by the master node would make more sense. More later.

Parallel GeoProcessing

parallel

A large city here in Texas has done a highly detailed sidewalk inventory. In addition, they’ve created “missing sidewalk” segments representing places where sidewalks could be constructed. In order to prioritize construction they are having me write a program that scores each missing sidewalk segment based on a variety of factors. Many of these factors involve proximity to other things like bus stops, office buildings etc. While I don’t know the name for this, it seems like this is a common pattern for GIS modeling: For each sidewalk, buffer it, search for nearby things and compute a score, basically just an application of Tobler’s 1st law of geography.

The problem is that this is very slow. To make it faster, I’m considering parallelization - divide up the problem across many machines, each scoring sidewalks in a different area of the city. Once each area is complete, merge together the results from each area.

One machine would be the “master”, the rest would be scorers. The master would divide up the city into rectangular areas and put job requests into an Amazon Simple Queue Service queue. Each scorer machines would read this queue, score the area described by the job, and write results the the url prescribed by the job result message in a different queue (also prescribed by the job). The master would read this queue, fetch the results using the url, and append it to the master result.

Since all scorer processes are hitting against the same geodatabase, I suspect the geodatabase would become the bottleneck. Suppose I had an unlimited number of scorer machines at my disposal. Doubling the number of them would not cut execution time in half, but I wonder what the factor would be? What is the optimal area size? Certainly having a tiny area with just a few sidewalks doesn’t make sense, but neither would a huge area. How can we determine optimal size?

There is really no reason the master process needs to be on the same side of the firewall as the scorer processes. This means it should be possible to write a Web Service that allows 3rd parties to submit scoring jobs. For example, realtors could score each available property based on proximity to other features relevant to a particular homebuyer.

What I’m proposing here sure sounds a lot like what Hadoop does. I wonder if we will ever see the day when we can use something like hadoop for geoprocessing with ArcObjects on .NET ?

Geoprocessing Sandboxes as a Service

Though I haven’t used it in production, I really like the ESRI geoprocessing server concept. ESRI’s approach involves authoring, publishing and consuming. However, from what I can tell, it requires authors to do their work inside the firewall.

I contend that authoring from outside the firewall would have business value.

For example, say I’m a data vendor with a very large geodatabase. My business model involves selling cpu cycles, data i/o, memory usage etc. I’m too busy keeping my data accurate and current to mess with writing all sorts of geoprocessing models. Instead, I would like to provide a platform where others author the models and I simply charge them for resources used.

Users would create an account with their credit card, similar to Amazon Web Services. Once they have an account they may create .NET assemblies and submit them as a job, perhaps via REST. The server runs the assembly within a sandbox, measuring database i/o, cpu cycles, memory usage etc. and applies charges to the user’s account accordingly. By running within the firewall, the assembly has more efficient access to the DBMS so that much more interesting things may be done.

To make it easier for authors, I would provide Visual Studio plugins to assist the development of assemblies. Perhaps they submit source code to the service, and let the server compile it. Or maybe something like the new dynamic lookup would allow me to test with a local small sample database, then once I got it working submit it to the server. Or better yet, maybe a thinclient IDE that allows editing of code on the server itself. Maybe ESRI could web-enable modelbuilder?

The submitted job could be configured to stay alive, and the user could then, in turn, sell use of his service to his own subscribers. This would be similar to weogeo’s concept but with services instead of data.

In a way, this would be crowdsourcing the development of geoprocessing models. Doing something like this would allow ESRI to bring in more developers (authors) of ArcGIS server applications. Right now the cost of ArcGIS server is just too high both in terms of license costs and administration. Perhaps Amazon will spatially enable SimpleDB so that in conjunction with EC2 authors of services will have a low-cost platform.