[OSM-dev] Planet Dump Timings

Sebastian Spaeth Sebastian at SSpaeth.de
Sun Sep 9 11:18:42 BST 2007


On Sun, Sep 09, 2007 at 01:47:27AM +1000, Brett Henderson wrote:

> STEP 1 - Get the current planet dumper working locally.
> STEP 2 - Produce a baseline osm file and baseline timing measurements using 
> existing dumper.
> STEP 3 - Produce an osmosis osm file.
> STEP 4 - Perform a comparison of both files to ensure they contain the same 
> data.

> So far all have been caused by differences in handling of non-latin 
> characters, and some rounding differences in double values.
>
> The rounding differences so far have always been when a 5 is rounded one 
> way or the other so I don't think this is a concern.

Given that the 7th decimal place is something like 1cm (or was it 1mm?) at the equator, I am not very concerned that this makes a big difference. It's interesting though, that osmosis doesn't seem to do proper rounding here. Do you use sprintf to output your values?

> The handling of 
> non-latin characters concerns me but I have no idea if it's a serious 
> problem or not.

> "Kronprinsesse Märthas allé" is written by planet.rb (viewed by 
> less).
> "Kronprinsesse MÃÆärthas allÃÆé" is written by osmosis 
> (viewed by less).
> "Kronprinsesse Märthas allé" is displayed by MySQL Query 
> Browser.
> "Kronprinsesse Märthas allé" is displayed when viewed from the API in 
> firefox.

> Do you know much about unicode? Is there a way I can verify which of these 
> outputs is correct (if not both)?

OK, I have to admit that my charset knowledge is rather minimal, so  don't know how serious this is. However given that planet.rb and osmosis have different output doesn't look good. The API should hand out all strings in UTF-8 as far as I knows

It seems that Java uses a modified version of UTF-8 for its stream: http://en.wikipedia.org/wiki/UTF-8#Java. looking at your example (loaded in Firefox, I see):
Node 78270
"Kronprinsesse M=C3=C2=A4rthas all=C3=C2=A9" is written by planet.rb (vie=
wed by less).
"Kronprinsesse M=C3=C6=C3=C2=A4rthas all=C3=C6=C3=C2=A9" is written by os=
mosis (viewed by less).
"Kronprinsesse M=C3=83=C2=A4rthas all=C3=83=C2=A9" is displayed by MySQL =
Query Browser.
"Kronprinsesse M=E4rthas all=E9" is displayed when viewed from the API in=
=20
firefox.

For example, the first letter ä is described as http://www.fileformat.info/info/unicode/char/00e4/index.htm in UTF-8/16/32.

Somebody else who has got more experience with that should know what to do with that, as i said, I have no experience with that.

spaetz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 186 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20070909/ca5bd0b9/attachment.pgp>


More information about the dev mailing list