[OSM-dev] Distributed Data Store

Thu Jan 22 10:54:52 GMT 2009

Hi Scott,

Scott Shawcroft wrote:
> Stefan de Konink wrote:
>> - Admins don't want to maintain multiple systems
>> - The fear of anything new not developed by the devs (especially if it 
>> is not build in Ruby)
> Who are the admins for the systems?

Tom Hughes is a factor to take in account. All your base...

>  We're open to particular solutions 
> and if there is a bias towards Ruby we'd look closer at it.  However, it 
> may be that there is a better solution.  Who are the designated devs?

That is basically a 'free for all'. Read the history of SVN and/or this 
list to find out which people are working on OSM. Personally I am 
working on a C implementation of the API. Other people tend to work on 
the official RubyOnRails one.

> Also, Amazon WebServices could be used to have virtual machines instead 
> of real ones which need maintenance.

If Amazon wants to sponsor OSM, that is a great thing ;)

>> Technical problems might be more interesting:
>>
>> - Synchronization issues, even for a proxy solution; single or 
>> multiple write databases should distribute their data. Out of sync 
>> scenarios etc.
>> - Especially geo related issues, how to distribute a real geoquery.
> Totally, synchronization is important.  Simple partitioning wouldn't 
> have this problem but if multiple copies will be shared then we could 
> get into trouble.
> 
> I think the geo element is what makes this more interesting than the 
> standard data storage issue.

The main point is that OSM by design in not a GIS database, we can make 
it one, but the current features approach the dataset in a 'traditional' 
way, this is not bad perse, though some problems would tend to love GIS 
solutions.

>>> We're interested in trying our hand at creating a better system for 
>>> storing OSM data.  We're interested in what kind of computing 
>>> resources to design for (how many machines) and whether we can get 
>>> access logs in order to test our implementation against.
>>
>> Related to accesslogs I found a long brick wall, it might be a better 
>> thing to use a requester that just makes random requests. Sources are 
>> available for that.
> Well, randomness is probably not the best model.  I imagine that the 
> server's traffic patterns are also geo related.  For example, people are 
> more likely to work on areas they are near and areas on the earth in 
> daylight or evening are more likely to have those people accessing the 
> site.   Or perhaps a mapping party has a number of people working on the 
> same area all at once.  A simple geo partitioning would drive all of 
> this traffic to one particular server.  This simple access does work 
> better when retrieving data because it will utilize all the different 
> machines.

Like Erik pointed out, diffs will give you writes. I think reads are 
more interesting.

>>> Also, we'd love to have OSM community members involved since we're 
>>> new to the organization.
>>>
>>> Lastly, I think we plan to donate our code to the community with the 
>>> hope that it is useful.
>>>
>>> What do you think?
>>
>> I love to brainstorm with you :) The next month I want to spend on my 
>> MSc thesis about improving native geospatial support in MonetDB. And 
>> the OSM data in it. It would ofcourse be great if the ideas comming 
>> out of such session can make it to State of the Map 2009.
>>
>> It would be good to point you at DBslayer (the standard implementation 
>> or the Cherokee one), it will balance requests but with a better 
>> balancer could do geobalancing too :)
> I'll have to take a look at it.  Existing solutions are good but we are 
> really looking at laying down some code too I think.

Creating for example a specific SQL based scheduler that can handle 
partitions was a thing I was thinking about in the night:

http://code.google.com/p/cherokee/issues/detail?id=328

Stefan