Archive for February, 2008

LambdaExpressions for Spatial Operators

I’ve been reading a bit about ExpressionTrees:

Expression trees were created in order to make the task of converting code such as a query expression into a string that can be passed to some other process and executed there. It is that simple. There is no great mystery here, no magic wand that needs to be waved. One simply takes code, converts it into data, and then analyzes the data to find the constituent parts that will be translated into a string that can be passed to another process. - Charlie Calvert.

Isn’t this essentially what modelbuilder does?

I think I’ll back off from CodeDOM and focus on this instead.

It seems like it would be interesting to create LambdaExpressions for spatial operators. I wonder if lambda expressions could be used to resurrect the DOCELL capability that existed in workstation ARC/INFO but never ported to ArcGIS.

Petroleum User Group (PUG) 2008 Day 1 morning

pug logo

Where are the Users?
While I worked on a project for EIA a few years ago, I can’t really say I’m a petroleum user. OK, I used way too much petroleum driving here in my gas guzzling truck. (My excuse: I work from a home office and rarely drive my truck except to haul soil so my wife can grow vegetables.)

“Adding technology to a bad process only adds speed to a very bad process”.
Jim Geringer, former Gov. of Wyoming, now with ESRI, at FedUC in Washington DC.

Everyone here at the PUG is very focused on adding speed to oil exploration and production. Everyone knows having all our economic processes rely on cheap petroleum is a bad idea. It’s not our job to worry about peak oil though - someone in Washington should be figuring that one out.

Enough politics, here’s the beef.

There are over 1700 people registered for the PUG, up from 1200 or so from last year. This is probably the 6th PUG I’ve attended. The first was back in 1991, the last was in 2006. Each time there seem to be fewer actual end users (geologists, geophysicists, etc.) and more IT people. By show of hands it looks like more than half were first time attendees. I’d bet the proportion of new attendees that are actual end users is smaller than with the group as a whole.

Clint Brown started things off saying the meeting is “all about you, the user”. Then mentioned how 9.3 is focused on quality, and, without mentioning Google said 9.4 will focus on faster display in ArcGIS Server and ease of use. I admit, before I drove in I avoided ESRI’s link offering map and directions, and instead opted to go first to the hotel’s web page, where they use Virtual Earth.

While there may not be many users here, there is a lot of talk about users. Once upon a time presenters would say “I did this” or “I did that”. This year I hear a lot of “our users would like to do X”, or “users just want Y”. The end users who are here that I spoke with seem to be overwhelmed by all the IT discussions. Unfortunately, John Calkins was not here today. I always liked his keen understanding of the users perspective.

OpenSpirit did a good demo showing how one can drag and drop data from other apps into ArcGIS Explorer, and how their custom task can “broadcast” messages to other apps on the desktop.

Job Tracking
I always thought JTX stood for Job Tracking Extension, but it is not an extension - it is a stand alone application. Rob Brooks and Kim Kearns showed a good demo how JTX provides something like a modelbuilder UI to workflow allowing analysis of High Consequence Areas surrounding gas pipelines. I wish more people were analyzing the consequences for relying on petroleum to begin with. Sorry, no more politics, I promise. I suspect JTX uses Windows Workflow Foundation, so I wonder how hard it would be to roll your own versus buying JTX.

9.3 Notes
None of the other EDN subscribers have received their beta disks yet either, so I’m not the only one wondering how ESRI is going to meet a May ship date for 9.3. Anyway, when it does come I won’t be able to blog about it, but until that happens, I suppose I can. It will have much better support for KML display, including regionated kml. Automated crash reporting will provide quicker turnaround for bug fixes. If you ‘ve ever waited for a support request to make it through the queue, you realize this is a big improvement. Clint says a lot of attention has been put into “re-architecting” OGC compliance.

Dal Hunter went through an Author, Serve and Use demo, then later a Sharepoint demo. More details on that later.

Surveillance Becomes an Art Form

MIT’s NYTE project has caught the attention of New York’s Museum of Modern Art, who plan to exhibit some of the graphics. The data used for this graphic sure looks like what AT&T might use when profiling neighborhoods to track down terrorist cells.

MySQL vs. PostGIS

I’ve read some of Paul Ramsey’s response to the Timmy’s Telethon discussion. While this is very helpful, I would like to see discussion of MySQL vs. PostGIS.

As someone who’s just comparing these two toolsets, I feel like a shopper trying to choose between HD-DVD / Blu Ray (in Dec 2007).

Is there a study somewhere that compares these?

Spatial Data in HyperTable?

hypertable

Realizing that the geodatabase would be the bottleneck in a parallel geoprocessing cluster, I read a bit on how Google manages its BigTable, and found HyperTable, a recently announced opensource project modeled after BigTable.

HyperTable’s sponsor is Zvents, a company that specializes in local search. If they’re doing local search they must be into LBS. I wonder if anyone there is looking into spatially enabling Hypertable ? Their sample queries indicate they can do a lot with temporal envelopes (timestamps) it seems like this could be extended to handle spatial minimum bounding rectangles (MBRs).

Why GeoProcessing with ArcObjects .NET

This is just to followup on Sean Gillies comments.

why you’d want to put proprietary per-server-licensed software in the mix – when the point of Hadoop is to leverage the combination of commodity hardware and open source – escapes me.

I know a lot of places that maintain large geodatabases. I’m thinking I could write ArcGIS Engine applications that would listen to queues for job requests, and run them. The Engine licenses would only be $500 per seat (not per app). Also, a lot of sites have spare floating licenses they aren’t using at night. Scaling an existing arcobjects based app so that it runs in parallel seems a logical next step to more fully utilize these resources.

One thing that .NET has that java is missing as far as I can tell is System.CodeDOM.Compiler. This would allow a job to include source code that each node would download and run.

I’m using the term geoprocessing here in the general sense - code that processes geodatasets located in a geodatabase without crossing a firewall.

Imagine a website where you send it a job with C# code that you wrote. For example, create me a list of the top 100 properties available for sale anywhere in the US, ranked by a score. Determine the score based on sum of number of 1/miles^2 from nearest starbucks, plus 3/miles^2 from each Home Depot or Lowes (i.e. an inverse distance weighted score). Put the result of this at this URL (an Amazon S3 bucket). The master node would split this up and run it on multiple scoring machines, combine the results and put it into the S3 bucket.

Since we want the top 100, that is a task the master node would need to determine after each scoring node has completed. So the job would include two different code chunks - one for the master, and the other for the scorers.

I can’t imagine anyone would ever take the effort to publish a traditional geoprocessing service that does this. Maybe geoprocessing isn’t the right word, maybe we should call it geocompiling, since we are sending it uncompiled source. Or maybe a domain specific language would be compiled into IL by the master node would make more sense. More later.

Parallel GeoProcessing

parallel

A large city here in Texas has done a highly detailed sidewalk inventory. In addition, they’ve created “missing sidewalk” segments representing places where sidewalks could be constructed. In order to prioritize construction they are having me write a program that scores each missing sidewalk segment based on a variety of factors. Many of these factors involve proximity to other things like bus stops, office buildings etc. While I don’t know the name for this, it seems like this is a common pattern for GIS modeling: For each sidewalk, buffer it, search for nearby things and compute a score, basically just an application of Tobler’s 1st law of geography.

The problem is that this is very slow. To make it faster, I’m considering parallelization - divide up the problem across many machines, each scoring sidewalks in a different area of the city. Once each area is complete, merge together the results from each area.

One machine would be the “master”, the rest would be scorers. The master would divide up the city into rectangular areas and put job requests into an Amazon Simple Queue Service queue. Each scorer machines would read this queue, score the area described by the job, and write results the the url prescribed by the job result message in a different queue (also prescribed by the job). The master would read this queue, fetch the results using the url, and append it to the master result.

Since all scorer processes are hitting against the same geodatabase, I suspect the geodatabase would become the bottleneck. Suppose I had an unlimited number of scorer machines at my disposal. Doubling the number of them would not cut execution time in half, but I wonder what the factor would be? What is the optimal area size? Certainly having a tiny area with just a few sidewalks doesn’t make sense, but neither would a huge area. How can we determine optimal size?

There is really no reason the master process needs to be on the same side of the firewall as the scorer processes. This means it should be possible to write a Web Service that allows 3rd parties to submit scoring jobs. For example, realtors could score each available property based on proximity to other features relevant to a particular homebuyer.

What I’m proposing here sure sounds a lot like what Hadoop does. I wonder if we will ever see the day when we can use something like hadoop for geoprocessing with ArcObjects on .NET ?

Geoprocessing Sandboxes as a Service

Though I haven’t used it in production, I really like the ESRI geoprocessing server concept. ESRI’s approach involves authoring, publishing and consuming. However, from what I can tell, it requires authors to do their work inside the firewall.

I contend that authoring from outside the firewall would have business value.

For example, say I’m a data vendor with a very large geodatabase. My business model involves selling cpu cycles, data i/o, memory usage etc. I’m too busy keeping my data accurate and current to mess with writing all sorts of geoprocessing models. Instead, I would like to provide a platform where others author the models and I simply charge them for resources used.

Users would create an account with their credit card, similar to Amazon Web Services. Once they have an account they may create .NET assemblies and submit them as a job, perhaps via REST. The server runs the assembly within a sandbox, measuring database i/o, cpu cycles, memory usage etc. and applies charges to the user’s account accordingly. By running within the firewall, the assembly has more efficient access to the DBMS so that much more interesting things may be done.

To make it easier for authors, I would provide Visual Studio plugins to assist the development of assemblies. Perhaps they submit source code to the service, and let the server compile it. Or maybe something like the new dynamic lookup would allow me to test with a local small sample database, then once I got it working submit it to the server. Or better yet, maybe a thinclient IDE that allows editing of code on the server itself. Maybe ESRI could web-enable modelbuilder?

The submitted job could be configured to stay alive, and the user could then, in turn, sell use of his service to his own subscribers. This would be similar to weogeo’s concept but with services instead of data.

In a way, this would be crowdsourcing the development of geoprocessing models. Doing something like this would allow ESRI to bring in more developers (authors) of ArcGIS server applications. Right now the cost of ArcGIS server is just too high both in terms of license costs and administration. Perhaps Amazon will spatially enable SimpleDB so that in conjunction with EC2 authors of services will have a low-cost platform.

Local GeoPolitics and EveryBlock

“All politics is local.”
-Tip O’Neill

With so much attention going on in the primaries, it’s easy to forget that most of what really matters is local. I’ve decided not to tell you who I think should be the next president. Instead, I’d like to encourage you to campaign for more local government transparency. We should be examining Everyblock, and it’s implications on local government.

A lot of the work done by Everyblock involves sifting through on-line documents for geotags. Everyblock says they are publishing data from city governments:

“building permits, crimes, restaurant inspections and more. In many cases, this information is already on the Web but is buried in hard-to-find government databases. In other cases, this information has never been posted online, and we’ve forged relationships with governments to make it available.”

What if we lobbied our local city governments to publish RSS (and GeoRSS) feeds for all of their activities? Why should Everyblock be given a special treatment?

Not only would this would greatly increase transparency, but would also allow us in the geocommunity more opportunities for doing value-added analysis. (No we’re not a gang (or special interest group) - we’re just a club.) City staff is already required public notification for lots of things like zoning variances. We just need to push them to do it with RSS.

Here in San Antonio the city has gone to great efforts to streamline the process a real estate developer must go through to gain approval of a project. I would really like to see a map of pending projects in my area, and get notification when the project’s status changes.

O/R Mapping for DataModels needed

It’s helpful that ESRI has provided data models for so many industries. Some of the data models, such as ArcHydro, also include tools that operate on the data.

But from what I can tell, there are no classes provided that address the so-called “object-relational impedance mismatch”.

This excerpt from Scott Ambler’s book describes the problem pretty well.

What I think we need are some baseclasses that do simple object-relational mapping. These baseclasses would be generated from a data model designer that would run within Visual Studio. In addition to generating xml schemas from which would could create geodatabases, the designer would also generate .NET code (C# or VB.NET) which we could then inherit from and override/extend to meet particular needs. Or maybe partial classes is the way to go, as Dave Bouwman describes here. Either way, we really need some way to create a library of .NET classes that work with each of these data models.

While COM will likely be around for a lot longer, I don’t see why we can’t use .NET classes to hide much of it. Implementation inheritance can be good - let’s allow programmers to leverage use it. Maybe the goals of custom features could be addressed by this approach without requiring C++ ATL.

For example, instead of using a featurecursor to loop through a StreamGauge featureclass, I’d like to simply get a System.Generic.Collection collection of StreamGauges with getters/setters mapped to each field of the feature.

Next Page »