[OSM-dev] osm2pgsql, .o5m import interface

Sat Jul 2 23:07:46 BST 2011

On Sat, 2011-07-02 at 21:58 +0200, marqqs at gmx.eu wrote:
> Hi Jon,
> 
> no, reading .o5m.gz (resp. .o5c.gz) is not supported at present.
> You usually don't do zlib compression with .o5m files, users will rather use lzop if processing speed is important, or 7zip if a minimal file size is required.
> 
> However, the .o5m file format is usually chosen because of its speed, and this advantage would get lost if you compressed the data. Therefore you may expect input files to be uncompressed (an uncompressed .o5m file has nearly the same size as a conventional .osm.bz2 file).

Do you have any speed benchmarks you can add to the o5m wiki page?
Perhaps the timing from an "osm2pgsql -O null <file>" could be added to
the existing table listing the file sizes?

It would be nice if users had the option to read compressed o5m files
directly as they do with .osm files. I can see that the benefit is
smaller when you are using the o5m format. No one would be forced to use
it. 

> > My main concern would be whether the changes introduce
> > new external dependencies.
> 
> Don't worry, there should be no additional dependencies.
> 
> > Can we see the code? Maybe attach it to a ticket in trac.
> 
> I just uploaded the necessary files to http://m.m.i24.cc/o5m_osm2pgsql_20110702.zip
> 
> parse-o5m.c (new)
> parse-o5m.h (new)
> osm2pgsql.c (a few minor changes)
> Makefile.am (added parse-o5m.c and parse-o5m.h)
> 
> Please consider the source as experimental. The import of a small test region (100x100 km) worked fine, but there sill might be some bugs...

I had a quick look through the code but I have not run it at all. My
comments are related to the file format and the parsing code:

- Add "o5m" to the --input-reader help text
- It would be helpful to indicate the type (o5c vs o5h) in the header
instead of just relying on the file name otherwise this won't work with
stdin.
- Validating the "o5m2" header would be useful to prevent non-o5m files
being processed by mistake. 
- The format appears endian-specific. If you choose not to make it
endian-neutral it would be good for the endianness to be recorded in the
header and checked to prevent mistakes if files are moved between
systems.
- The best practice is to use enumerated types instead of the hard coded
hex numbers for the protocol fields. Ideally all the protocol
definitions should be in a header file.
- The best practice for macros is to wrap them in a "do {...} while(0)".
This avoids problems with trailing ;'s and nested if/else's. An example
using your PERR macro would be:

#define PERR(f) do { \
  fprintf(stderr,"osm2pgsql Error: " f "\n"); \
} while (0)

  Jon