Archive for the '10.0 Wishlist' Category

Faster Calculations in Arcmap

ricky ricardo
Somebody has some splaining to do.

I like to think of myself as a patient person, not in a hurry. However sometimes when I use the Field Calculator in Arcmap my patience is challenged.

I have a featureclass of sidewalks related to a street centerline network. 284000 sidewalks and 54000 street centerlines. When I calculate a field on the sidewalks, setting it equal to a field from the street network, it takes several hours. That’s in a file gdb with the key field indexed. Somebody has some explaining to do.

Last time I read ESRI’s EULA, I recall it prohibiting publication of benchmark statistics, so I have not included precise numbers. Looks like they are following Oracle’s policy.

Unlike Oracle, ArcGIS does not allow me to determine what execution plan it is using (see EXPLAIN PLAN). I’m quite certain it always uses the same plan, which in this case is a very poor one.

To make it faster, I cache the key/value pairs from the street attribute table into memory using a System.Collections.Hashtable. This takes about 1.4 seconds. I then open a cursor on the sidewalks table and loop through it, looking up the value from the hashtable using the key field value from the sidewalk featureclass. This takes less than a minute.

Interestingly an update cursor is slower than using a cursor created via IFeatureclass.Search. I think this is just with a file geodatabase though, on ArcSDE I believe an update cursor is generally faster, assuming proper rollback segment sizes are configured.

Maybe what ESRI should do is to beef up the MemoryRelationShipClass so that it allows the user to examine and/or specify an execution plan. That way I could tell it to use a hashtable when it does the join, alleviating me from having to roll my own.

For a good comparison of hashtables, sortedlists and dictionaries, see this post.

The shortcoming of this approach is that I can’t do field calculations, just simple assignment. Some day I’d like to try using CodeDOM to generate code from an expression the user has provided. It would need to substitute in the field values for field names. Since square brackets are used in C# I’d need different field name delimiters.

LambdaExpressions for Spatial Operators

I’ve been reading a bit about ExpressionTrees:

Expression trees were created in order to make the task of converting code such as a query expression into a string that can be passed to some other process and executed there. It is that simple. There is no great mystery here, no magic wand that needs to be waved. One simply takes code, converts it into data, and then analyzes the data to find the constituent parts that will be translated into a string that can be passed to another process. - Charlie Calvert.

Isn’t this essentially what modelbuilder does?

I think I’ll back off from CodeDOM and focus on this instead.

It seems like it would be interesting to create LambdaExpressions for spatial operators. I wonder if lambda expressions could be used to resurrect the DOCELL capability that existed in workstation ARC/INFO but never ported to ArcGIS.

Parallel GeoProcessing

parallel

A large city here in Texas has done a highly detailed sidewalk inventory. In addition, they’ve created “missing sidewalk” segments representing places where sidewalks could be constructed. In order to prioritize construction they are having me write a program that scores each missing sidewalk segment based on a variety of factors. Many of these factors involve proximity to other things like bus stops, office buildings etc. While I don’t know the name for this, it seems like this is a common pattern for GIS modeling: For each sidewalk, buffer it, search for nearby things and compute a score, basically just an application of Tobler’s 1st law of geography.

The problem is that this is very slow. To make it faster, I’m considering parallelization - divide up the problem across many machines, each scoring sidewalks in a different area of the city. Once each area is complete, merge together the results from each area.

One machine would be the “master”, the rest would be scorers. The master would divide up the city into rectangular areas and put job requests into an Amazon Simple Queue Service queue. Each scorer machines would read this queue, score the area described by the job, and write results the the url prescribed by the job result message in a different queue (also prescribed by the job). The master would read this queue, fetch the results using the url, and append it to the master result.

Since all scorer processes are hitting against the same geodatabase, I suspect the geodatabase would become the bottleneck. Suppose I had an unlimited number of scorer machines at my disposal. Doubling the number of them would not cut execution time in half, but I wonder what the factor would be? What is the optimal area size? Certainly having a tiny area with just a few sidewalks doesn’t make sense, but neither would a huge area. How can we determine optimal size?

There is really no reason the master process needs to be on the same side of the firewall as the scorer processes. This means it should be possible to write a Web Service that allows 3rd parties to submit scoring jobs. For example, realtors could score each available property based on proximity to other features relevant to a particular homebuyer.

What I’m proposing here sure sounds a lot like what Hadoop does. I wonder if we will ever see the day when we can use something like hadoop for geoprocessing with ArcObjects on .NET ?

Geoprocessing Sandboxes as a Service

Though I haven’t used it in production, I really like the ESRI geoprocessing server concept. ESRI’s approach involves authoring, publishing and consuming. However, from what I can tell, it requires authors to do their work inside the firewall.

I contend that authoring from outside the firewall would have business value.

For example, say I’m a data vendor with a very large geodatabase. My business model involves selling cpu cycles, data i/o, memory usage etc. I’m too busy keeping my data accurate and current to mess with writing all sorts of geoprocessing models. Instead, I would like to provide a platform where others author the models and I simply charge them for resources used.

Users would create an account with their credit card, similar to Amazon Web Services. Once they have an account they may create .NET assemblies and submit them as a job, perhaps via REST. The server runs the assembly within a sandbox, measuring database i/o, cpu cycles, memory usage etc. and applies charges to the user’s account accordingly. By running within the firewall, the assembly has more efficient access to the DBMS so that much more interesting things may be done.

To make it easier for authors, I would provide Visual Studio plugins to assist the development of assemblies. Perhaps they submit source code to the service, and let the server compile it. Or maybe something like the new dynamic lookup would allow me to test with a local small sample database, then once I got it working submit it to the server. Or better yet, maybe a thinclient IDE that allows editing of code on the server itself. Maybe ESRI could web-enable modelbuilder?

The submitted job could be configured to stay alive, and the user could then, in turn, sell use of his service to his own subscribers. This would be similar to weogeo’s concept but with services instead of data.

In a way, this would be crowdsourcing the development of geoprocessing models. Doing something like this would allow ESRI to bring in more developers (authors) of ArcGIS server applications. Right now the cost of ArcGIS server is just too high both in terms of license costs and administration. Perhaps Amazon will spatially enable SimpleDB so that in conjunction with EC2 authors of services will have a low-cost platform.

O/R Mapping for DataModels needed

It’s helpful that ESRI has provided data models for so many industries. Some of the data models, such as ArcHydro, also include tools that operate on the data.

But from what I can tell, there are no classes provided that address the so-called “object-relational impedance mismatch”.

This excerpt from Scott Ambler’s book describes the problem pretty well.

What I think we need are some baseclasses that do simple object-relational mapping. These baseclasses would be generated from a data model designer that would run within Visual Studio. In addition to generating xml schemas from which would could create geodatabases, the designer would also generate .NET code (C# or VB.NET) which we could then inherit from and override/extend to meet particular needs. Or maybe partial classes is the way to go, as Dave Bouwman describes here. Either way, we really need some way to create a library of .NET classes that work with each of these data models.

While COM will likely be around for a lot longer, I don’t see why we can’t use .NET classes to hide much of it. Implementation inheritance can be good - let’s allow programmers to leverage use it. Maybe the goals of custom features could be addressed by this approach without requiring C++ ATL.

For example, instead of using a featurecursor to loop through a StreamGauge featureclass, I’d like to simply get a System.Generic.Collection collection of StreamGauges with getters/setters mapped to each field of the feature.

MapReduce for Large Geodatasets


Here’s an interesting video where Google describes how they use MapReduce to build connectivity in their street data. In ESRI terminology, this how they clean and build topology using parallel processing. They also briefly mention using it to render map tiles.

They don’t go into detail, but apparently those of us outside Google could do this sort of thing using Hadoop on Amazon EC2.

A challenge with tile caches is keeping them up to date with the vectors they depict. Here is how ESRI does it. I think ESRI needs to allow us to scale tile generation across a large number of cpus the way Google does. The licensing model needs to allow this. It seems like opensource Geo software on a paid AMI could be coupled with Hadoop on EC2 to do this.

Once that happens, an agency like a state data center could rebuild tile caches on EC2/S3 nightly from, for example, a statewide vector layer of parcel maps.

I’ve heard rebuilding a geodatabase topology for the nationwide census takes over 24 hours. I bet a MapReduce approach would be much faster for this too.

Linq to Geodatabase Provider

As suggested by Dave Bouwman and Ron Bruder, I’ve played around a bit with ArcGIS diagrammer developed by Richie Carmichael. I like it. It would be great to see something like this offered as a fully supported product.

I’ve also watched some of the Linq to SQL videos at MSDN. Keep in mind that Linq is extensible, so while currently there are only a few providers, expect soon to see other Linq providers appearing, for example Linq to Google.

It sure would be cool if ESRI wrote a Linq to Geodatabase provider.

If ESRI doesn’t do something, I suspect many will opt to bypass ArcSDE altogether and access SQL Server directly via Linq to SQL, once it supports spatial types.

I suppose if ESRI did it right it would support a user experience similar to the one in this video, but in C# of course :). I suppose, too, it would involve writing a designer with similar look and feel as ArcGIS diagrammer.

Sourcecode in Geodatabase Prototype

I’ve written a proof of concept editor extension based on the ideas outlined in my previous post, plus some helpful feedback from Brian Flood (thanks!).

The solution, which includes an installer and test file gdb, has been uploaded to arcscripts, right here.

The editor extension maintains a generic List of IExtensions. The extensions in this list are instantiated at OnStartEditing from source code in a table called SourceCode in the edit workspace.

There is also a command provided on a commandbar that allows you to browse and load a source .cs file into the SourceCode table, after verifying that it compiles without errors.

Note that the “using” statements need to include a full path name to the assembly files referenced by the source code.

I’m wondering if this approach might be easier than class extensions as well as easier to maintain. I haven’t tried it, but I suppose it would be possible to have it work with shapefiles as well.

The potential uses of dynamic compilation are intriguing. I’d really like to try this in a IServerObjectExtension. More on that later.

Storing Code in a Geodatabase

It appears that the .NET 2.0 framework will be part of the standard install now for ArcGIS. This means it should be possible to dynamically compile code at run time using CodeDOM.Compiler.

What if we could store code in the geodatabase that would be dynamically compiled at run time?

A major pain for working with ClassExtensions is that people who do not have the DLL installed on their machine are not able to even open featureclasses that have extensions. Perhaps it would be possible to store code in the database (optionally obfuscated) for a classextension.

Perhaps this approach could also be applied to support triggers. In arccatalog, I’d like to be able to right click on a field in a geodatabase table (or featureclass) and provide code that would be called for OnCreateFeature, OnDelete etc. Behind the scenes the geodatabase would store this in a GDB_ table, and when a feature is added to a featureclass (or table) it would JIT compile and call the code. The GDB_ table could also provide a read/write property field (or fields) that would allow me to implement sequences.

I suppose it should be possible to write Visual Studio Addins to support editing code stored in a geodatabase.

I guess there’s really no reason the code would need to be C#, VB.NET or whatever. Maybe a simplified (domain specific) language could be provided that even DBA’s could understand. For example, say I want to assign a county ID to a point whenever a point is created or updated, based on the county the point falls within. The C# code to accomplish this might be a bit overwhelming, so maybe a simpler language could be provided that expresses this. Of course ESRI would need to provide a compiler to create the CIL from the simplified language.

Maybe this approach would also allow custom features. Custom features were promoted at 8.0, but they never really worked as intended. AFAIK ArcFM is the only custom feature based solution in widespread use. What if the CLSID stored in the GDB_Objectclasses table were a key to another table that stored code. Instead of instantiating a COM class when the objectclass is opened, ArcGIS could JIT compile the code stored in the geodatabase. If this is possible perhaps ESRI could provide us with a base Feature class that we could extend and whose methods we could override.

Update: I’ve written a prototype and uploaded here.