[OSM-dev] Bounding Box

Tue Feb 6 01:04:53 GMT 2007

Nick,

(I'm taking this over to dev, from talk; it is one of those "wouldn't we 
be better off with another indexing scheme" discussions.)

 > z)You can also add to that things like: Tendency for a query to flush 
 > other used data from a cache
 >
 > y) I/O taken to retrieve perhaps hundreds of millions of data points,
 > sort them, find duplicates, return whatever page

 > If we used a more efficient indexing scheme, the database may perhaps 
 > more easily determine the implication of the request from bounding 
box > information.

 > To perform tests, get the mysql schema from SVN. Load it into mysql.
 > Get the planet dump then use planet2mysql to generate the OSM dataset
 > in mysql.

 > Download the API from SVN. Point the API at the database. Run queries.
 > Make changes to the database. Find some fabulous new way of increasing
 > the speed with unlimited bounding boxes and no server slowdown. give
 > us all the figures.

 > Convince everyone. I'll buy you a pint.

Let me first say that you have my highest regard for the sheer amount 
and detail of experimentation you have already done. I have followed 
this for a while now, and I can imagine how you must be getting tired 
having someone pop up and request that you use <insert name of arbitrary 
established product or technique> to make everything perfect on a weekly 
basis.

I have experienced similar situations, much less frequent of course, 
when I described the project and the database setup to friends, who 
invariably shrugged and said something like "well, there are databases 
that specialize in this kind of job, why don't they..." and so on, to 
which I of course always replied that the complexity is not to be 
underestimated, that tests and comparisons have been run, that no 
out-of-the-box PostGIS can match the current performance, etc.etc.etc.

Nonetheless I wonder why you are so keen on having everything on this 
one database server. In my eyes, this builds up complexity in a totally 
unnecessary way. Read requests are much more frequent than writes. As we 
are nearing completion of the globe [;-)] the percentage of writes will 
become smaller all the time. Is it not viable to separate reading from 
writing?

Firstly, we don't support transactions anyway. Nobody can be sure that 
data he has read in one instant is still unchanged when he uploads a 
modification seconds later. I don't think we will ever, I don't think we 
need to, it's fine as it is. So it wouldn't matter if the data had been 
read from some "trickle-down" server that gets updates from the central 
machine, instead of directly from the central machine.

Secondly, we have such a distributed read-only scheme in effect already, 
albeit a very imperfect one: The planet file. The planet file is a 
lifeline for a multitude of worthwile projects already - how else would 
I be able to draw a map of all railway lines in my country? - and people 
will not always be content with 7-day-old data.

A solution separating reads from writes would be indefinitely scalable, 
at the expense of minimal delays. You could have one "root" server at 
your site which serves bounding box requests as it does now, on a 
limited basis, plus you would build an interface by which a small number 
of "peers" all over the world would receive immediate updates. (Unsure 
here if actual mysql replication is advisable over the Internet. If not, 
choose something simple.) These peers would then be free in defining 
their own access restrictions, e.g. allow larger bounding boxes or 
whatever. In the long run, we would see specialized peers popping up - 
some might only carry data within their local area, some might have a 
thematic filter and only carry railroads worldwide, some might have big 
funding and carry everything but make it available only to their 
organisation. Whatever. Peers could also be cascaded.

I firmly believe that ultimately, we will need something like this. 
Writing to the database is a whole different ballgame and as distributed 
writing is immensely complex, I'd stick to the central server for 
writing as long as possible.

It is here that I would like to see brain power invested: First, setting 
up one single machine doing only the read requests, at your site. Then, 
more machines, and "divide and conquer". I, for example, do not have the 
resources to run a full copy of all OSM data. But I could easily set 
aside a server that would carry the "Germany" bounding box and serve any 
and all read requests for the next year or so, and I believe so could 
others. That would take a huge strain away from your systems, and you 
will need the computing power to cope with write requests and things 
like anti-vandalism protection that you will have to implement over the 
coming years.

Of course, the mechanisms for selected data feeds have still to be 
devised, and editors have to be modified to issue read requests to a 
configurable server while sending writes to the central server. There 
will be issues like peers carrying incomplete or erroneous data for 
which solutions have to be devised. I'm not saying it is easy, but I 
dare say it is the way forward.

Any thought invested in how we can serve end-users with complex bounding 
boxes from our central server would be better used for devising a 
sophisticated scheme to distribute our data.

I don't want to appear all talk and no action, and I'd happily help with 
steps in that general direction, but frankly I don't know where to 
start. Would it be worthwile for me to modify the current API to 
generate a changelog suitable for distributed replication? Or is the API 
about to be thrown away anyway? Or are there other constraints or 
objections that would keep you from using such a scheme? Is hardware a 
problem? I'd hate to waste time by programming something that's not in 
demand.

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00.09' E008°23.33'