[OSM-dev] Database Schema (was Re: [OSM-talk] Oh dear - planet has duplicate id's)

Sat Jun 23 18:25:39 BST 2007

In message <2fc2c5f10706230946p1c3b9a7dk359f7af8a79d2dce at mail.gmail.com>
          "Martijn van Oosterhout" <kleptog at gmail.com> wrote:

> On 6/23/07, Tom Hughes <tom at compton.nu> wrote:
> >     - Split the current index on lat+lon into two separate indexes
> >       for each axis - as it stands the lon component of the index can
> >       only be used when exactly one latitude has been selected (more
> >       or less impossible as it is floating point) so is never used. With
> >       separate indexes the database will be able to decide which is most
> >       efficient for each query and the keys will be smaller so there
> >       will be less I/O for whichever index it uses.
> 
> Are you sure? Firstly, a logically btree index on lat+lon can
> certainly be used for queries like:
> 
> lat < a and lat > b and lon < c and lon > d
> 
> and reading the mysql it docs tells me that a multi-column index would
> be used in this case, since it uses a b-tree compatable operators and
> references both columns. Have you actually examined the plan to check
> that it really isn't using it? Because if it isn't it really needs to
> be documented somewhere as a gotcha as it's rather unexpected.

A BTree index can't be used to do that, at least not sensibly. How
would you decide which ranges of keys to read? You can't because you
can't enumerate the latitudes which are in the range - especially
when it is a floating point field.

In general you can only do a range match like that on the final
component of a key - leading components need to be subject to an
equality match (either directly or because you are somehow able
to enumerate the values which fit a criteria on that component).

Imagine that I have 16 points (1,1) to (4,4) and that I do a query
looking for lat between 2 and 3 and lon between 2 and 3. My index
will look like this (keys I want to match marked with *):

  1+1
  1+2
  1+3
  1+4
  2+1
  2+2 *
  2+3 *
  2+4
  3+1
  3+2 *
  3+3 *
  3+4
  4+1
  4+2
  4+3
  4+4

Now ideally I want to do a keyed read on 2+2 then next to 2+3, then
do a keyed read on 3+2 and next to 3+3. But I don't actually know
what values will match in the first column before I start (and there
might be a very large number in a real world case).

The alternative (and what will happen now) is that I start at 2+2 and
next to 3+3, but that is only marginally better than dropping the second
column and effectively starting at 2+1 and nexting to 3+4 as I am only
saving a little bit on the first and last latitudes.

Tom

-- 
Tom Hughes (tom at compton.nu)
http://www.compton.nu/