[Tile-serving] [openstreetmap/osm2pgsql] Changing the way osm2pgsql handles multiple input files (#1167)

Mon May 11 18:38:06 UTC 2020

Osm2pgsql can handle any number of input files. The current code will just read the files one after the other. I think this isn't actually that useful and I propose to change it.

## Use case 1: Create mode

Importing several OSM files into a database can be useful, for instance when you want to import two extracts with the data from two different countries.

Unfortunately the current code only allows you to do this if there is no overlap between the two files. If there is any object in both files, osm2pgsql will try to insert it twice into the database and will fail.

To make this work you need a tool like Osmium or Osmosis to merge the input files first.

## Use case 2: Reading several change files in append mode

The other case where this could be useful is when reading multiple change files appending all changes to a database. This could be useful if, say, you want to update a database that is several days old with the change files from all of those days. Unfortunately this doesn't work, because the replication diffs you can download from planet.osm.org can contain multiple versions of the same object which will either not work at all or lead to extra work because you are updating the database more often than needed.

Currently you need a tool like Osmium or Osmosis to first merge all the change files and simplify them.

## Proposed change

To make this feature more useful I propose to change the functionality. Instead of reading the input files sequentially, we can read them all at once. While reading, any object version but the newest is simply discarded and the newest is only processed once. So it doesn't matter any more if there are overlaps in the input files or if there are multiple changes to the same object. Both use cases above will just work.

The change has one drawback: The input files must all be sorted. Otherwise we can't do this merging of the input files on the fly. This isn't a huge problem though. Most OSM data and OSM change files normal users will encounter are sorted. And for those who have unsorted files due to some special circumstances, they can still use tools such as Osmium or Osmosis to sort their files first.

So we'd be removing some burden from the "normal" user while adding a little burden to users with special cases. I think this is a good tradeoff.

The change has additional benefits:

* Checking that the data coming in is sorted is cheap and easy and we can give the user a good error message.
* If we can rely on sorted data, some data structures inside osm2pgsql might become simpler and faster.
* Processing multiple and/or unsorted files is currently not tested well and it isn't at all clear whether it actually works reliably.

Note that this is related to #1097, because with negative ids, we'll have to define exactly how to sort objects with negative ids.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openstreetmap/osm2pgsql/issues/1167
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/tile-serving/attachments/20200511/df4e8b3f/attachment.htm>