[OSM-dev] Too many slow queries in db

Tue Sep 4 08:45:25 BST 2007

In message <20070904015357.A36428-100000 at xs2.xs4all.nl>
        Stefan de Konink <skinkie at xs4all.nl> wrote:

> On Tue, 4 Sep 2007, Tom Hughes wrote:
>
>> In message <20070904003537.I31320-100000 at xs2.xs4all.nl>
>>           Stefan de Konink <skinkie at xs4all.nl> wrote:
>>
>> > On Mon, 3 Sep 2007, Tom Hughes wrote:
>> >
>> > > Seriously, if you've got a magic way to make a database query
>> > > a 100 million row table in the blink of an eyelid I'll be only
>> > > to happy to deploy it.
>> >
>> > By prioritizing queries?
>>
>> How do I prioritise queries in MySQL? Whose are most important
>> anyway? Yours? Mine?
>
> The restriction of MySQL is self made, if someone needs to cook and extra
> wrapper for this (poor) database to support this, it has to be done. The
> one that are most important are the interactive ones. Maybe by spatial
> size for example.

We already restrict spatial size and (effectively) deprioritise the
main sources of bulk queries (tiles at home and the bulk uploader).

>> > In a way putting some thing [1] in front of this *poor* 100 million row
>> > database because in the current way the system will never be able to scale?
>>
>> How is a proxy supposed to help? Does it cache responses or something?
>
> You could plug in extra hardware without downtime and without coding an
> extra OSM specific way. Since this same layer *should* support query
> translation you could even think about spatial migration of data. Tricky,
> maybe, but if it is done right in the real programming language with one
> letter I assume no problems.

But first we would have to do significant work on our code to make it
work on a data set that had been split in multiple pieces.

Implementing an entire proxy solution just to avoid a few minutes
downtime while switching to a new database server is bonkers - there
would be far more downtime while the data was copied anyway.

When there is need to use multiple physical database servers, and
assuming we can engineer a way to split the data (which I think will
be very hard) we can worry about whether something like the proxy
would be useful.

>> > Although that every query walks trough 100 million rows, *sounds* a bit
>> > absurt.
>>
>> No, but they frequently read a substantial percentage due to the
>> index being one dimensional and the data two dimensional. Combined
>> with table level locking you have a recipe for problems.
>
> And this is the reason why some individuals suggested PostGIS in
> january 2007 on OSM-Talk? (Read this before I posted my first message, it
> was about spatial indexing of this system.)

and nine months on we're still waiting for some numbers...

Let me make this plain. I would *love* to be using Postgres instead
of MySQL. My experience with MySQL so far has been horrible and my
experience with Postgres much better. That said I might quickly change
my mind if I had to deal with Postgres on a data set as big as ours.

That doesn't mean I'm in favour of R-Tree indexes, let along using
the full PostGIS stuff. I was certainly in favour of R-Tree indexes
when I first came across OSM because I had had good experience with
them in the past on much smaller datasets.

I can equally see the logic of the counter argument about total index
size the amount of disk I/O and reduced chance of caching that goes
with that. So I want to see numbers to see how much R-Tree indexes
or PostGIS would help.

I would love to use Postgres mainly because I find it not to be
completely braindead in other ways like MySQL is - none of this
nonsense where everything in the manual is clarified by a list of
database engines that the feature does/doesn't work on. The ability
to change a table without rebuilding it completely. The list goes
on basically.

> Catching this all with a proxy (for now) and maybe add some stupid extra
> hardware could get you maybe 3 months to test a better solution.

Caching is out of the question I suspect, and I'm not even clear 
that the proxy thing can do that anyway.

I'm looking at improving the indexing, which I think can probably
buy us a fair bit of time. I actually dumped current_nodes last
night so that I can play with a few things. I will probably also
dump gps_points (which is what is causing many of the current
problems) and play with that.

Tom

-- 
Tom Hughes (tom at compton.nu)
http://www.compton.nu/