[OSM-dev] osm2pgsql, .o5m import interface

Sun Jul 3 01:18:00 BST 2011

Hi Jon,

thanks for your suggestions and ideas!

> Do you have any speed benchmarks you can add to the o5m wiki page?
> Perhaps the timing from an "osm2pgsql -O null <file>" could be added to
> the existing table listing the file sizes?

OK, I added speed information for the most important formats. This is very hardware-dependent, of course. For these tests I used osmconvert with disabled output module. Unfortunately, I cannot provide results for internally uncompressed PBF files. My guess would be that they would take between 40 and 50 seconds.

I would not expect huge speed advantages for osm2pgsql because the database access is much slower than the file parsing.

> It would be nice if users had the option to read compressed o5m files
> directly as they do with .osm files. I can see that the benefit is
> smaller when you are using the o5m format. No one would be forced to use
> it.

Yes, this really would be nice. But where to start? Include lzop and 7zip and gzip? This would be very comfortable in deed. The next step would be to offer special options for multi-thread decompression (7za), etc.
I too had these plans but decided to drop them in the end. You never will reach the power and the variety external tools can offer. Osmosis manual, for example, suggests that you should use external decompression programs rather than Osmosis-internal functions due to higher speed.

> - Add "o5m" to the --input-reader help text

Yes, of course.

> - It would be helpful to indicate the type (o5c vs o5h) in the header
> instead of just relying on the file name otherwise this won't work with
> stdin.

This is right - and a problem. The file format is identical, for .o5m and .o5c. If stdin is used, this information gets lost. I will extend the format definition accordingly.

> - Validating the "o5m2" header would be useful to prevent non-o5m files
> being processed by mistake. 

Yes!
On top of this, it would be safer to read the first bytes before deciding which format the input file has. Is there such a mechanism already available in osm2pgsql?

> - The format appears endian-specific. If you choose not to make it
> endian-neutral it would be good for the endianness to be recorded in the
> header and checked to prevent mistakes if files are moved between
> systems.

As far as I understand the format is not endian-specific. The storage unit is byte.

> - The best practice is to use enumerated types instead of the hard coded
> hex numbers for the protocol fields. Ideally all the protocol
> definitions should be in a header file.

You're right. However, since there are only a few dataset ids in o5m definition, it's not fatal.

> - The best practice for macros is to wrap them in a "do {...} while(0)".
> This avoids problems with trailing ;'s and nested if/else's. An example
> using your PERR macro would be:
> #define PERR(f) do { \
>   fprintf(stderr,"osm2pgsql Error: " f "\n"); \
> } while (0)

Nice idea for the future. I like it. Have you checked if the faked loop construct is optimized-out by the compiler?

Markus

-------- Original-Nachricht --------
> Datum: Sat, 02 Jul 2011 23:07:46 +0100
> Von: Jon Burgess <jburgess777 at gmail.com>
> An: marqqs at gmx.eu
> CC: dev at openstreetmap.org
> Betreff: Re: [OSM-dev] osm2pgsql, .o5m import interface

> On Sat, 2011-07-02 at 21:58 +0200, marqqs at gmx.eu wrote:
> > Hi Jon,
> > 
> > no, reading .o5m.gz (resp. .o5c.gz) is not supported at present.
> > You usually don't do zlib compression with .o5m files, users will rather
> use lzop if processing speed is important, or 7zip if a minimal file size
> is required.
> > 
> > However, the .o5m file format is usually chosen because of its speed,
> and this advantage would get lost if you compressed the data. Therefore you
> may expect input files to be uncompressed (an uncompressed .o5m file has
> nearly the same size as a conventional .osm.bz2 file).
> 
> Do you have any speed benchmarks you can add to the o5m wiki page?
> Perhaps the timing from an "osm2pgsql -O null <file>" could be added to
> the existing table listing the file sizes?
> 
> It would be nice if users had the option to read compressed o5m files
> directly as they do with .osm files. I can see that the benefit is
> smaller when you are using the o5m format. No one would be forced to use
> it. 
> 
> 
> > > My main concern would be whether the changes introduce
> > > new external dependencies.
> > 
> > Don't worry, there should be no additional dependencies.
> > 
> > > Can we see the code? Maybe attach it to a ticket in trac.
> > 
> > I just uploaded the necessary files to
> http://m.m.i24.cc/o5m_osm2pgsql_20110702.zip
> > 
> > parse-o5m.c (new)
> > parse-o5m.h (new)
> > osm2pgsql.c (a few minor changes)
> > Makefile.am (added parse-o5m.c and parse-o5m.h)
> > 
> > Please consider the source as experimental. The import of a small test
> region (100x100 km) worked fine, but there sill might be some bugs...
> 
> I had a quick look through the code but I have not run it at all. My
> comments are related to the file format and the parsing code:
> 
> - Add "o5m" to the --input-reader help text
> - It would be helpful to indicate the type (o5c vs o5h) in the header
> instead of just relying on the file name otherwise this won't work with
> stdin.
> - Validating the "o5m2" header would be useful to prevent non-o5m files
> being processed by mistake. 
> - The format appears endian-specific. If you choose not to make it
> endian-neutral it would be good for the endianness to be recorded in the
> header and checked to prevent mistakes if files are moved between
> systems.
> - The best practice is to use enumerated types instead of the hard coded
> hex numbers for the protocol fields. Ideally all the protocol
> definitions should be in a header file.
> - The best practice for macros is to wrap them in a "do {...} while(0)".
> This avoids problems with trailing ;'s and nested if/else's. An example
> using your PERR macro would be:
> 
> #define PERR(f) do { \
>   fprintf(stderr,"osm2pgsql Error: " f "\n"); \
> } while (0)
> 
> 
>   Jon
> 
>