[OSM-dev] Complete history of OSM data - questions and discussion

Thu Nov 12 16:28:21 GMT 2009

> just remember that new code = new bugs ;-)

hehe yeah that's true but I assure you: It would probably be worse if
I tried to do it in C or as an Osmosis task. And as I said before: No
harm in me trying. It will be your decision if its up to your
standards. And I learned a great deal from Osmosis and planet.c.

I am partly done with my Java version. There are a few
questions/problems/remarks:
- As of now the XML is not indented. I use Woodstox[1] for XML output
and that doesn't have an option to "pretty print" the output. It is
not a problem for me but if it is requested I can use StaxMate or
something else to properly indent the XML
- I changed the order in which the versions of a single element are
dumped. The API delivers them in descending order, I dump them in
ascending order. I think that is the better/easier way if one wants to
do processing on the data
- I added a "visible" attribute to every element
- I changed it so that the common attributes (id, version, timestamp,
changeset, user, uid, visible) are output in the same order on every
element
- Changesets: num_changes from the database isn't dumped in planet.c.
It is queried from the database but not used anywhere. The data _can_
be calculated but it isn't that easy if not using the standard db
schema and not easily done by reading the XML stream. I could just
dump it too. I haven't had a look at the API if this field is set
correctly at all?!
- Is there a dump of the database available from just prior to the
switch from API 0.4 to 0.5? I could try to use that to merge the
history of the segments to the ways (as briefly discussed by Frederik)
- Any information on the size (in rows) of the tables would be nice
(for testing purposes)
- What is the default_statistics_target for the columns/tables in
question? Are there any other options set that would affect the query
planner? I've seen the query planner make wildly inappropriate
decisions so I'll try to check if the statements I use will work. I
used the same technique as planet.c and only adapted the queries to
versions and history tables.
- Do I have to take precautions in regards to database/machine/disk
load? I could do something like the Auto-Vacuum daemon[2] or
monitoring the load.
- I'm using the same technique as planet.c in regards to the output of
the data (just streaming it to standard output), I just assume that
this is okay? Are there any other things I'll have to change in
comparison to the way planet.c works?

As far as I can see (judging by the apidb script from Osmosis) all the
necessary indexes are in place but I'll go over it again. I still have
a lot of testing to do.

Thanks for your help!

Cheers,
Lars

[1]: http://woodstox.codehaus.org/
[2]: http://www.postgresql.org/docs/8.3/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-VACUUM-COST