[OSM-dev] Bounding Box
Frederik Ramm
frederik at remote.org
Tue Feb 6 01:04:53 GMT 2007
Nick,
(I'm taking this over to dev, from talk; it is one of those "wouldn't we
be better off with another indexing scheme" discussions.)
> z)You can also add to that things like: Tendency for a query to flush
> other used data from a cache
>
> y) I/O taken to retrieve perhaps hundreds of millions of data points,
> sort them, find duplicates, return whatever page
> If we used a more efficient indexing scheme, the database may perhaps
> more easily determine the implication of the request from bounding
box > information.
> To perform tests, get the mysql schema from SVN. Load it into mysql.
> Get the planet dump then use planet2mysql to generate the OSM dataset
> in mysql.
> Download the API from SVN. Point the API at the database. Run queries.
> Make changes to the database. Find some fabulous new way of increasing
> the speed with unlimited bounding boxes and no server slowdown. give
> us all the figures.
> Convince everyone. I'll buy you a pint.
Let me first say that you have my highest regard for the sheer amount
and detail of experimentation you have already done. I have followed
this for a while now, and I can imagine how you must be getting tired
having someone pop up and request that you use <insert name of arbitrary
established product or technique> to make everything perfect on a weekly
basis.
I have experienced similar situations, much less frequent of course,
when I described the project and the database setup to friends, who
invariably shrugged and said something like "well, there are databases
that specialize in this kind of job, why don't they..." and so on, to
which I of course always replied that the complexity is not to be
underestimated, that tests and comparisons have been run, that no
out-of-the-box PostGIS can match the current performance, etc.etc.etc.
Nonetheless I wonder why you are so keen on having everything on this
one database server. In my eyes, this builds up complexity in a totally
unnecessary way. Read requests are much more frequent than writes. As we
are nearing completion of the globe [;-)] the percentage of writes will
become smaller all the time. Is it not viable to separate reading from
writing?
Firstly, we don't support transactions anyway. Nobody can be sure that
data he has read in one instant is still unchanged when he uploads a
modification seconds later. I don't think we will ever, I don't think we
need to, it's fine as it is. So it wouldn't matter if the data had been
read from some "trickle-down" server that gets updates from the central
machine, instead of directly from the central machine.
Secondly, we have such a distributed read-only scheme in effect already,
albeit a very imperfect one: The planet file. The planet file is a
lifeline for a multitude of worthwile projects already - how else would
I be able to draw a map of all railway lines in my country? - and people
will not always be content with 7-day-old data.
A solution separating reads from writes would be indefinitely scalable,
at the expense of minimal delays. You could have one "root" server at
your site which serves bounding box requests as it does now, on a
limited basis, plus you would build an interface by which a small number
of "peers" all over the world would receive immediate updates. (Unsure
here if actual mysql replication is advisable over the Internet. If not,
choose something simple.) These peers would then be free in defining
their own access restrictions, e.g. allow larger bounding boxes or
whatever. In the long run, we would see specialized peers popping up -
some might only carry data within their local area, some might have a
thematic filter and only carry railroads worldwide, some might have big
funding and carry everything but make it available only to their
organisation. Whatever. Peers could also be cascaded.
I firmly believe that ultimately, we will need something like this.
Writing to the database is a whole different ballgame and as distributed
writing is immensely complex, I'd stick to the central server for
writing as long as possible.
It is here that I would like to see brain power invested: First, setting
up one single machine doing only the read requests, at your site. Then,
more machines, and "divide and conquer". I, for example, do not have the
resources to run a full copy of all OSM data. But I could easily set
aside a server that would carry the "Germany" bounding box and serve any
and all read requests for the next year or so, and I believe so could
others. That would take a huge strain away from your systems, and you
will need the computing power to cope with write requests and things
like anti-vandalism protection that you will have to implement over the
coming years.
Of course, the mechanisms for selected data feeds have still to be
devised, and editors have to be modified to issue read requests to a
configurable server while sending writes to the central server. There
will be issues like peers carrying incomplete or erroneous data for
which solutions have to be devised. I'm not saying it is easy, but I
dare say it is the way forward.
Any thought invested in how we can serve end-users with complex bounding
boxes from our central server would be better used for devising a
sophisticated scheme to distribute our data.
I don't want to appear all talk and no action, and I'd happily help with
steps in that general direction, but frankly I don't know where to
start. Would it be worthwile for me to modify the current API to
generate a changelog suitable for distributed replication? Or is the API
about to be thrown away anyway? Or are there other constraints or
objections that would keep you from using such a scheme? Is hardware a
problem? I'd hate to waste time by programming something that's not in
demand.
Bye
Frederik
--
Frederik Ramm ## eMail frederik at remote.org ## N49°00.09' E008°23.33'
More information about the dev
mailing list