[OSM-dev] Temporary tables for the map query?

Mon May 26 17:37:29 BST 2008

2008/5/26 Alex Wilson <alex_wilson at pobox.com>:

> Thanks for the reply and giving me some more background. In fact I wasn't
> planning to use the temporary table queries I included within rails - but to
> assemble the data for streaming for the C++ implementation I'm working on.
> Which, using the aforementioned SQL, shouldn't need to have a memory
> footprint above a few kilobytes. But I guess this is at the expense of
> relying on MySQL to behave sensibly when working on queries that return
> large datasets (i.e. to assemble the data lazily as it's needed by the
> caller - rather than to assemble it all at once in memory).

I'm not sure how well that works when accessed directly from C code
but I think it can be persuaded to behave sensibly. I know that the
ruby code always reads all the results into memory but that is a ruby
issue I think.

> On your point about MySQL's rubbish optimiser: is it genuinely so bad that
> it's better in this case to send a huge number of ids back into the db for
> subsequent queries rather than build that id set up locally in the db? I can
> believe it in general for more complicated queries - but it strikes me
> (perhaps naiviely) that the example I sent is sufficiently trivial that
> MySQL should be able to handle it if it can do any form of optimisation on
> joins whatsover. Perhaps I should set up some test-cases to resolve either
> way...

Well the main issue with your suggested code is probably the use of
"SELECT DISTINCT" which tends to be a performance problem in my
experiences, although I've never really understood why it should be as
bad as it always seems to be.

The other big issue is that it doesn't actually work as it doesn't
return any nodes which are in the area but which are not part of a
way.

What I was saying about joins is that in theory you should be able to
find the ways that you want with something like this:

SELECT *
FROM current_ways
WHERE id IN (
  SELECT cwn.id
  FROM current_way_nodes cwn INNER JOIN current_nodes cn ON cwn.node_id = cn.id
  WHERE ( cn.tile BETWEEN 2062281666 AND 2062281667 OR
          cn.tile BETWEEN 2062281670 AND 2062281679 OR
          cn.tile BETWEEN 2062281682 AND 2062281683 OR
          cn.tile BETWEEN 2062281686 AND 2062281727 OR
          cn.tile BETWEEN 2062282370 AND 2062282371 OR
          cn.tile BETWEEN 2062282374 AND 2062282383 OR
          cn.tile BETWEEN 2062282386 AND 2062282387 OR
          cn.tile BETWEEN 2062282390 AND 2062282431 OR
          cn.tile BETWEEN 3493937472 AND 3493937479 OR
          cn.tile BETWEEN 3493937488 AND 3493937495 OR
          cn.tile BETWEEN 3493938176 AND 3493938183 OR
          cn.tile BETWEEN 3493938192 AND 3493938199 ) AND
        cn.latitude BETWEEN 517478993 AND 517880945 AND
        cn.longitude BETWEEN -349531 AND 108480
);

But MySQL decides to table scan current_ways and process the subquery
for each one instead of running the subquery to get a list of way ids
and then using the current_ways index to get the details.

You can recode that as a join instead, but you really need to use DISTINCT then:

SELECT  cw.*
FROM current_ways cw
INNER JOIN current_way_nodes cwn ON cw.id = cwn.id
INNER JOIN current_nodes cn ON cwn.node_id = cn.id
WHERE ( cn.tile BETWEEN 2062281666 AND 2062281667 OR
        cn.tile BETWEEN 2062281670 AND 2062281679 OR
        cn.tile BETWEEN 2062281682 AND 2062281683 OR
        cn.tile BETWEEN 2062281686 AND 2062281727 OR
        cn.tile BETWEEN 2062282370 AND 2062282371 OR
        cn.tile BETWEEN 2062282374 AND 2062282383 OR
        cn.tile BETWEEN 2062282386 AND 2062282387 OR
        cn.tile BETWEEN 2062282390 AND 2062282431 OR
        cn.tile BETWEEN 3493937472 AND 3493937479 OR
        cn.tile BETWEEN 3493937488 AND 3493937495 OR
        cn.tile BETWEEN 3493938176 AND 3493938183 OR
        cn.tile BETWEEN 3493938192 AND 3493938199 ) AND
      cn.latitude BETWEEN 517478993 AND 517880945 AND
      cn.longitude BETWEEN -349531 AND 108480;

That does at least have a sensible execution plan, except for the need
to create a temporary table for the sort operation to allow the
DISTINCT to be processed. That always seems to be very slow for some
reason.

Testing on the live server, that query takes 0.11 seconds and returns
6322 rows without the distinct, and 0.13 seconds returning 896 rows
with the distinct. So maybe the overhead is OK in fact - it may be
that when I've looked at it in the past the result set that it has
been trying to distinct has been much larger or something.

Tom

-- 
Tom Hughes (tom at compton.nu)
http://www.compton.nu/