Archive for the 'Amazon EC2' Category

Amazon Dynamo

dynamo
This graph is from a paper on Amazon’s Dynamo. I suspect other months (not during shopping season) would look quite different.

With a name like “Elastic Compute Cloud”, I would expect the price for EC2 to reflect the supply of available compute capacity - isn’t this what elasticity of supply is all about? Currently pricing does not reflect time of use. I wonder how much performance degrades for EC2 users during Christmas shopping season.

Amazon relaxes the Consistency part of the DBMS ACID requirement in order to achieve availability. Maybe another rule could be relaxed, the one saying keys should not have any meaning beyond their use as an ID. If we did this, maybe Peano keys could be used, providing a spatially enabled Dynamo-like system.

Parallel GeoProcessing

parallel

A large city here in Texas has done a highly detailed sidewalk inventory. In addition, they’ve created “missing sidewalk” segments representing places where sidewalks could be constructed. In order to prioritize construction they are having me write a program that scores each missing sidewalk segment based on a variety of factors. Many of these factors involve proximity to other things like bus stops, office buildings etc. While I don’t know the name for this, it seems like this is a common pattern for GIS modeling: For each sidewalk, buffer it, search for nearby things and compute a score, basically just an application of Tobler’s 1st law of geography.

The problem is that this is very slow. To make it faster, I’m considering parallelization - divide up the problem across many machines, each scoring sidewalks in a different area of the city. Once each area is complete, merge together the results from each area.

One machine would be the “master”, the rest would be scorers. The master would divide up the city into rectangular areas and put job requests into an Amazon Simple Queue Service queue. Each scorer machines would read this queue, score the area described by the job, and write results the the url prescribed by the job result message in a different queue (also prescribed by the job). The master would read this queue, fetch the results using the url, and append it to the master result.

Since all scorer processes are hitting against the same geodatabase, I suspect the geodatabase would become the bottleneck. Suppose I had an unlimited number of scorer machines at my disposal. Doubling the number of them would not cut execution time in half, but I wonder what the factor would be? What is the optimal area size? Certainly having a tiny area with just a few sidewalks doesn’t make sense, but neither would a huge area. How can we determine optimal size?

There is really no reason the master process needs to be on the same side of the firewall as the scorer processes. This means it should be possible to write a Web Service that allows 3rd parties to submit scoring jobs. For example, realtors could score each available property based on proximity to other features relevant to a particular homebuyer.

What I’m proposing here sure sounds a lot like what Hadoop does. I wonder if we will ever see the day when we can use something like hadoop for geoprocessing with ArcObjects on .NET ?

Geoprocessing Sandboxes as a Service

Though I haven’t used it in production, I really like the ESRI geoprocessing server concept. ESRI’s approach involves authoring, publishing and consuming. However, from what I can tell, it requires authors to do their work inside the firewall.

I contend that authoring from outside the firewall would have business value.

For example, say I’m a data vendor with a very large geodatabase. My business model involves selling cpu cycles, data i/o, memory usage etc. I’m too busy keeping my data accurate and current to mess with writing all sorts of geoprocessing models. Instead, I would like to provide a platform where others author the models and I simply charge them for resources used.

Users would create an account with their credit card, similar to Amazon Web Services. Once they have an account they may create .NET assemblies and submit them as a job, perhaps via REST. The server runs the assembly within a sandbox, measuring database i/o, cpu cycles, memory usage etc. and applies charges to the user’s account accordingly. By running within the firewall, the assembly has more efficient access to the DBMS so that much more interesting things may be done.

To make it easier for authors, I would provide Visual Studio plugins to assist the development of assemblies. Perhaps they submit source code to the service, and let the server compile it. Or maybe something like the new dynamic lookup would allow me to test with a local small sample database, then once I got it working submit it to the server. Or better yet, maybe a thinclient IDE that allows editing of code on the server itself. Maybe ESRI could web-enable modelbuilder?

The submitted job could be configured to stay alive, and the user could then, in turn, sell use of his service to his own subscribers. This would be similar to weogeo’s concept but with services instead of data.

In a way, this would be crowdsourcing the development of geoprocessing models. Doing something like this would allow ESRI to bring in more developers (authors) of ArcGIS server applications. Right now the cost of ArcGIS server is just too high both in terms of license costs and administration. Perhaps Amazon will spatially enable SimpleDB so that in conjunction with EC2 authors of services will have a low-cost platform.

MapReduce for Large Geodatasets


Here’s an interesting video where Google describes how they use MapReduce to build connectivity in their street data. In ESRI terminology, this how they clean and build topology using parallel processing. They also briefly mention using it to render map tiles.

They don’t go into detail, but apparently those of us outside Google could do this sort of thing using Hadoop on Amazon EC2.

A challenge with tile caches is keeping them up to date with the vectors they depict. Here is how ESRI does it. I think ESRI needs to allow us to scale tile generation across a large number of cpus the way Google does. The licensing model needs to allow this. It seems like opensource Geo software on a paid AMI could be coupled with Hadoop on EC2 to do this.

Once that happens, an agency like a state data center could rebuild tile caches on EC2/S3 nightly from, for example, a statewide vector layer of parcel maps.

I’ve heard rebuilding a geodatabase topology for the nationwide census takes over 24 hours. I bet a MapReduce approach would be much faster for this too.

WeoGeo is Finalist in Amazon Startup Challenge

WeoGeo, discussed in an earlier blog, has been named as a finalist in Amazon’s Startup Challenge.

I think they really deserve it. Wish them luck!

More GIS in the Cloud

clouds
From EnchantedLearning.

Peter Batty is looking into EC2 for his new venture:

… thinking seriously about using Amazon EC2 and S3 when we roll out, especially now that Amazon has added new “extra large” servers with 15 GB of memory, 8 “EC2 Compute Units” (4 virtual cores with 2 EC2 Compute Units each), and 1690 GB of instance storage, based on a 64-bit platform - these servers should work well for serious database processing.

Amazon has details on the new instance types Peter refers to here.

With such large amounts of memory available, it seems possible to build some really killer route finding services.

Microsoft is working on something similar to EC2. I just hope ESRI provides 64-bit, and a license policy that allows cloud deployment when Microsoft comes up with something.

In response to EC2 questions, Microsoft CTO Ray Ozzie said:

Amazon Web Services [are] … showing Web 2.0 startups that there might actually be something there with regard to this utility computing model. Whether it’s the right set of services exactly, or whether the way that they’ve designed them is exactly what matches the needs of those potential developers, there are some questions. But I think they’ve done the industry a service by beginning to open people’s [in other words, Microsoft's] eyes to the potential.

I don’t have any announcements at this point in time. But directionally, I think you could see in my presentation that we believe very heavily in this utility computing fabric concept; it’s the only way, even internally focused, it’s the only way we can get scale amongst all the properties we run internally. And I think it just makes sense to offer those services to developers and to enterprise customers over time.

Sounds like the same business case Bezos made for AWS.

It’s not so much that (Amazon Web Services) has something to do with selling books. It’s the inverse: Selling books has a lot to do with this.

Amazon as Infrastructure for OpenGeo Data ?

Charlie Savage has a post worth reading. He says:

In addition to storing the tiles, you’d need machines to store the original vector data, render the tiles and serving them on the Web. And we haven’t even talked about imagery yet, which takes even more space. Even then, having the hardware isn’t enough. You’d also have to have the expertise to run it all, which perhaps is Google’s most treasured secret.

Lately I’ve been exploring Amazon Web Services. I wonder if a combination of S3, EC2 and SQS could provide the infrastructure. S3 and SQS are straightforward. My next step involves EC2: build an Amazon Machine Image (AMI) that runs OSGeo software. It looks like it will be a challenge. Has anyone done this?

The more compelling EC2 sites allow users to extract/transform/load and recombine their own data with others. Jamglue does it with Audio. Very slick, check out the tour!. Pixnate does it with imagery. Why can’t we do it with spatial data?

steve martin
I can’t help it, but every time I look into OpenSource and run into a problem I think of Steve Martin’s epiphany in “The Jerk”: “Ahh, so its a profit making deal!”.

With that in mind, maybe the Amazon Web Services Start-up Challenge will get some open source geo folks busy building some spatially aware AMIs.

One possible idea for the challenge might be a site that lets users upload, georeference and publish their own spatial data. Maybe it could even be made available for purchase at an Amazon store?

Look at it from Amazon’s perspective: all the other big players (Microsoft, Google, and Yahoo) are all heavily involved in geospatial data. Amazon is missing the boat. Here’s their chance to catch up.

Navteq and Teleatlas provide spatial data for GPS receivers in the same way iTunes provides data for iPods. Since Amazon is now competing with iTunes by selling DRM-free MP3 files, maybe they could do something similar to compete with TeleAtlas/Navteq?

Of course the difficulty would be incentivizing data collectors to take the effort to get the same quality offered by NavTeq/Teleatlas. Maybe Inrix’s smart dust concept could be extended? The coverage of the Inrix real-time data is disappointing, they need more data collectors. As city-wide WiFi becomes commonplace, WiFi enabled GPS will follow. As traffic gets worse, demand for real time traffic data will increase. Amazon could sell this data by giving data collectors a discount. Every time you drive to work with your GPS on, you are collecting potentially valuable data. More on this later.

Tiling Tools

This news about a new chip called Tilera is interesting. Strange naming coincidence - it seems the chip could be useful for keeping map cache tiles updated.

I’ve heard that keeping map tile caches updated is a challenge with ArcGIS Server. I see that the GenerateMapServerCache has a thread count property. I wonder how hard it would be to generalize this to spawn subprocesses on different chips.

Or more generally, would it be possible to write a tile cache generator that could run in Amazon EC2, writing tiles to S3?

Geography & Data Centers

With so many earthquakes near Silicon Valley, it probably wasn’t hard for Frank Robles of CityNap to convince Google last year to store geographic data on more stable ground here in San Antonio.

Google is one of CityNAP’s first customers. CityNAP is a secondary site for storage of the geographical data that Google provides on the Internet.

Now Microsoft is building a huge data center here too.

Environmental Impacts

I like data centers. They add a lot to the tax base without adding much to the population. They do consume a lot of electricity though. Microsoft will be CPS Energy’s largest customer. I don’t understand why CPS doesn’t provide time-of-use billing. If electricity were sold at a lower prices during off-peak hours data centers would be motivated to implement Thermal Energy Storage (TES). Specifically, making ice at night to provide cooling during the following day. Kilowatts generated during off-peak are a lot cleaner.

There’s interest in wind generated electricity, but the wind blows strongest at night. Some have looked at TES for wind, but seems like that job could be shifted to the data center itself.

Programmer Impacts
It appears this data center might also host Microsoft’s answer Yahoo’s Electric Cloud. There must be a bad pun in this somewhere … every cloud has a silverlighting?

What I want to know is how I can perform geoprocessing against different layers in different databases in a data center without being killed by round-trips. It seems to me that if two different data vendors store their data in the same data center there should be a way to spatially join their data behind the firewall, relieving me from fetching each feature here to the client. Isn’t there some sort of protocol to support this?

Right now if I want a list of all points from vendor A, that fall within polygons from vendor B, I am stuck with lots of round trips, or locally caching and doing the overlay. (?)

It seems like if a protocol existed, there would be a network effect such that geographic data would become more valuable - Vendor A and B could sell data at a higher prices, thus attracting even more vendors the marketplace.

If Microsoft intends to sell ads

Once the service is formally launched, Microsoft will keep it free for light users but ask heavy users either to allow Microsoft to sell advertising space in exchange for unlimited usage or pay a nominal fee.

It seems like advertisers interested in geographically targeting an audience would be keen to pay for ads returned by multi-vendor spatial queries.

Taxes, Web 2.0

To put Web2.0 to the test I decided to wait til the last minute to e-file my taxes. Using Turbotax online last night, I was surprised to experience only a few timeouts.

Update: others weren’t so lucky.

I wonder how they go about scaling their servers. What do they do with all the spare capacity during the rest of the year? I’ve heard EC2 is Amazon’s way of generating revenue during offpeak periods - I’d be interested in hearing how much EC2 performance degrades during Christmas shopping season.

Perhaps more importantly, what does Turbotax do with all the data they collect? They promise me confidentiality, and yet they implicate me and everyone else here in San Antonio as being stingy.

OK, so if they’ve decided publication of aggregated tax return data doesn’t violate confidentiality standards, why don’t they become a data vendor? It’d be interesting to see a map showing zipcodes shaded based on percent of last minute filers. In looking through their products, I don’t see any mention of geodata.

Next Page »