[OSM-dev] osm2pgsql for 64-bit IDs

Tue May 24 09:52:19 BST 2011

On Mon, May 23, 2011 at 04:18:05AM -0500, Scott Crosby wrote:
> On Sat, May 21, 2011 at 9:52 AM, Jochen Topf <jochen at remote.org> wrote:
> >
> > If we use unsigned ints we have some more time. Problematic would only be
> > a few cases where negative IDs are currently used (like in JOSM for data
> > thats not yet uploaded to the server). But it seems wasteful to me, to go
> > to 64bit a year or so earlier than needed to accommodate this case.
> 
> The 64 bit transition is unavoidable. I think this would double the
> effort, because we'd all have to go through our software twice, once
> to fix signedness bugs, and a second time to go to 64 bits. In
> addition, the Java stack couldn't transition to unsigned ints anyways,
> as Java lacks unsigned types. An unsigned int transition would be a
> 64-bit transition.

First: It has always been clear that sooner or later we will need the 64bit
space for OSM IDs.

The file formats used for exchanging OSM data already allow them. For XML
there is really no limit on the size of the ID and for PBF the IDs are
defined as sint64. So we are fine here.

But in practice in their software people have often used 32bit IDs instead,
because a) currently they are enough and b) they are often more efficient in
space and/or time.

I think it is up to the implementor of each software to decide what internal
representation he uses for IDs. Implementors just have to be aware of all
the issue involved.

One problem with 64bit IDs is simply that they need twice as much space. If you
store a billion node IDs that might be the difference between needing 4GB of
RAM or 8GB. So I think it is worth it trying to live with 32bit IDs as long
as possible. Hardware is getting cheaper. So preserving 32bit IDs for a year
longer might mean investments can be postponed and/or we can actually do things
we could not do otherwise, because there is no money for more hardware.

The negative IDs throw a bit of a wrench in this whole thing. I can think of
only one way to solve this: Define a set of, say 10.000 IDs, for the use cases
where negative IDs are currently used. The implementation on the API side would
be trivial: Increment the counter in the Postgres that gives out IDs manually
and check in the API for that ID range and make sure nobody can write IDs in
it. Changing all the software using negative IDs currently would be a bit more
difficult. This would give us that extra bit for the price of a few thousand
extra bits. And it would be rather ugly. I can't say I really like that idea.

So we are probably stuck with the negative IDs. But I could well imagine
people writing software that does not work with negative IDs so that they
can still work with 32bit IDs a while longer.

And while we are at that subject: There is another problem here. Most of the
usual GIS software uses 32bit IDs, when using QGIS with Postgres for instance
it would not accept a 64bit Postgres ID column. (This might have been fixed in
the mean time, I haven't checked for a while.) I have talked about this on
several occasions to the people who work on these projects and they all said,
they'd work on it. But in the meantime there is an awful lot of software
around that can't handle this case.

Oh, yes, and shapefiles only allow 32bit IDs.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298