Archive for the ‘parallel geoprocessing’ Category

Microsoft Dryad Presentation at Google

This is an interesting video: Microsoft presenting Dryad research to Google. Dryad is claimed to be a “superset” of MapReduce.

It seems like the complexity of Dryad could be hidden behind a modelbuilder-like interface.

I wonder how valid it is to view this video and think about modelbuilder terminology, replacing the word “vertex” with “tool”, “channel” with “intermediate data”. Note that, like earlier versions of geoprocessing, Dryad cannot handle cycles (loops).


Spatial Data in HyperTable?


Realizing that the geodatabase would be the bottleneck in a parallel geoprocessing cluster, I read a bit on how Google manages its BigTable, and found HyperTable, a recently announced opensource project modeled after BigTable.

HyperTable’s sponsor is Zvents, a company that specializes in local search. If they’re doing local search they must be into LBS. I wonder if anyone there is looking into spatially enabling Hypertable ? Their sample queries indicate they can do a lot with temporal envelopes (timestamps) it seems like this could be extended to handle spatial minimum bounding rectangles (MBRs).

Why GeoProcessing with ArcObjects .NET

This is just to followup on Sean Gillies comments.

why you’d want to put proprietary per-server-licensed software in the mix – when the point of Hadoop is to leverage the combination of commodity hardware and open source – escapes me.

I know a lot of places that maintain large geodatabases. I’m thinking I could write ArcGIS Engine applications that would listen to queues for job requests, and run them. The Engine licenses would only be $500 per seat (not per app). Also, a lot of sites have spare floating licenses they aren’t using at night. Scaling an existing arcobjects based app so that it runs in parallel seems a logical next step to more fully utilize these resources.

One thing that .NET has that java is missing as far as I can tell is System.CodeDOM.Compiler. This would allow a job to include source code that each node would download and run.

I’m using the term geoprocessing here in the general sense – code that processes geodatasets located in a geodatabase without crossing a firewall.

Imagine a website where you send it a job with C# code that you wrote. For example, create me a list of the top 100 properties available for sale anywhere in the US, ranked by a score. Determine the score based on sum of number of 1/miles^2 from nearest starbucks, plus 3/miles^2 from each Home Depot or Lowes (i.e. an inverse distance weighted score). Put the result of this at this URL (an Amazon S3 bucket). The master node would split this up and run it on multiple scoring machines, combine the results and put it into the S3 bucket.

Since we want the top 100, that is a task the master node would need to determine after each scoring node has completed. So the job would include two different code chunks – one for the master, and the other for the scorers.

I can’t imagine anyone would ever take the effort to publish a traditional geoprocessing service that does this. Maybe geoprocessing isn’t the right word, maybe we should call it geocompiling, since we are sending it uncompiled source. Or maybe a domain specific language would be compiled into IL by the master node would make more sense. More later.