[osmosis-dev] Changes to Osmosis Pgsql Schema
dlc at halibut.com
Sat Aug 7 13:10:14 BST 2010
Brett, et al:
Please consider leaving the current "simple" schema intact (with
the addition of the index clustering) and instead create a third
schema for the HSTORE version.
Losing the current schema would break at least two of my use cases:
1: I use postGIS as back-end storage for GIS client software, and
use views and tables derived from queries against the OSM simple
schema to create various layers of OSM data in the GIS client,
e.g. roads, footpaths/trails/bike routes, hydrography, landuse,
structures (buildings), etc. This is such an obvious technique
that I'd be very surprised not to find a lot of others doing the
2: I am beginning a project to parallelize OSM data processing
with Hadoop, and the postgreSQL copy-format output is perfect
for loading into HDFS. (If this goes well, I'd want to discuss
ideas for adapting Osmosis to talk to Hadoop, eventually.)
I also think that the simple schema lends itself better to
research uses, especially when the input is a country/state
extract or smaller.
That said, I can also see cases where it would be useful to have
the option to create a "simple++" schema, with the separate
tags table *and* the HSTORE columns as well.
The CLUSTER operation is indeed a big performance booster. Also,
creating the optional ways.bbox column is at least two orders of
magnitude faster when performed against an indexed ways.linestring
column rather than as-shipped, which uses the nodes geometry before
that has been indexed.
On Sat, Aug 07, 2010 at 09:00:07PM +1000, Brett Henderson wrote:
> Hi All,
> I'm currently working on some changes to the Osmosis "simple" schema which
> may be of interest to others. I'd be interested to hear if anybody has any
> major issues with this, or any better suggestions.
> The current schema performs poorly, largely due to the data for typical
> queries being spread across the disk. It is well indexed, but retrieving
> large numbers of rows requires huge numbers of disk seeks. Performance
> would be better if data was physically grouped according to geospatial
> location. I am planning several changes to address this:
> - CLUSTER the nodes table by the geom column index, and ways column by
> the (optional) linestring column index. I've already tested this out for
> bbox style queries and it makes queries on these tables significantly
> quicker. It takes a long time to perform the CLUSTER operation, but
> subsequent queries are then improved.
> - Move the tags tables into hstore tags columns on the nodes, ways and
> relations tables. This will avoid the need to join to external tables, and
> will allow the tags data to also be clustered geospatially by the geospatial
> indexes. For entities with large numbers of tags or large tags the data may
> be stored externally (
> http://www.postgresql.org/docs/8.4/interactive/storage-toast.html), but
> this should be the exception and most tags should fit inline in the table.
> - Create a nodes column on the ways table. This will contain an array
> which holds only the ids of nodes that make up the way. For typical
> bounding box style queries this will allow "completeWays" style
> functionality to be performed more efficiently without having to join to
> large numbers of rows in the way_nodes table. For bbox style queries in
> some use cases it will also be possible to create synthetic node entities
> (without tag or user info) for missing nodes lying outside the bounding box
> which will further improve performance.
> So far I've written a migration script for moving tags data into hstore
> columns, and I've figured out how to get Java and JDBC playing nicely with
> hstore columns. The next step is to update existing tasks to use these
> columns. As part of this change I will also change the way the bounding box
> queries work so that they store more data in the temporary tables to avoid
> having to join back to the main data tables. Again, this will significantly
> reduce disk seeking.
> I'll move onto the addition of a way.nodes column after I've finished the
> tags changes.
> I'm not sure when I'll find time to finish all of this, but it's the main
> thing I'm working on.
> osmosis-dev mailing list
> osmosis-dev at openstreetmap.org
More information about the osmosis-dev