[OSM-dev] API/XAPI caching proxy server

Fri Dec 17 22:37:26 GMT 2010

Hi Stefan,

On Fri, Dec 17, 2010 at 11:45 AM, Stefan Keller <sfkeller at gmail.com> wrote:

> Hi Brett
>
> Thanks very much for your detailed instructions.
>
> > In my experience the biggest limitation in performance is disk seeking,
> > rather than the amount of data returned.
>
> If that's the bottleneck (or the amount of data returned before
> processing), then pl/pgsql or pl/python could help, since stored
> procedures are close to the data.
>

Unfortunately I don't think that will help here.  In-database code can help
if you have large numbers of queries, or need to sift through large amounts
of returned data but neither of those apply here.  A fixed number of queries
get executed per bounding box retrieval (I forget exactly, but less than 10
I think ...).  The early queries build up results in temp tables, and
finally the contents of those temp tables are retrieved.  The overhead of
queries being invoked outside the database server shouldn't add much in the
way of overhead because the query issuance time is trivial compared to the
amount of time the database then spends processing those queries, and only
essential data is returned to Osmosis itself.

>
> If topology is an issue then perhaps the (future) topology data type
> could help (http://trac.osgeo.org/postgis/wiki/UsersWikiPostgisTopology
> ). But I'm would test first the performance of relational (and hstore)
> structures.
>

I have no idea about this one :-)

>
> Finally, this sounds really like an postgres optimization task, when
> you speak of several days for full planet indexing.
>

Keep in mind that there are over 800 million nodes now, and something like
70 million ways.  To build indexes on those rows, and then completely
re-layout the data to be grouped geographically is always going to be time
consuming.  I did spend a bit of time tweaking up the PostgreSQL tuning
parameters to the point where further tweaks made little change.  I'm sure
further improvements are possible, but without improving the disk subsystem
underneath it I think the gains will not be that great.

More than happy to be proven wrong though ;-)

> And finally optimization: I think this begins with the whole db
> architecture: I heard about a typical architecture, where there is a
> master and a slave disk: The master gets updated by the diffs, the
> slave is being replicated (postgres 9.0 can do that now!) and indexed.
> => Seems to be a case for our Postgres/PostGIS gurus :->
>

Sure, once you get into database level replication there's lots of things
you can do to improve performance.  A single-master, multi-slave layout can
provide near linear speed ups for read-only queries.  Of course it requires
lots of hardware, and administrative overhead too ...

For reference, I measured the following times when setting up a full planet.
Raw Data Import: 18 hours, 34 minutes
Index Creation: 22 hours, 32 minutes
Clustering: 39 hours, 40 minutes

Cheers,
Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20101218/5c0c9789/attachment.html>