[OSM-dev] planet.osm - fix

Tue Aug 15 11:05:37 BST 2006

* @ 14/08/06 08:47:58 PM MStrecke at gmx.de wrote:
> Jonas Svensson wrote:
> 
> > Good to see a new dump. Unfortunatly there are about the same
> > number of UTF-8 errors in this as in the july-dump. Well assuming
> > my UTF8sanitizer is correct.
> 
> Judging from on earlier dump, there are various codesets (e.g. latin-1)
> used in the planet.osm file, which lead to UTF-8 errors.
> 
> According to Wikipedia:
> 
>    http://en.wikipedia.org/wiki/Utf-8
> 
> UTF-8 uses 1 to 4 octets to encode the Unicode character. Valid ranges are:
> 
> (all numbers in hex, x = (0..F))
> 
> 1 octet:
>   00 - 7F    (= ASCII char)
> 
> 2 octets:
>   Cx or Dx, followed by (8x to Bx)
> 
> 3 octets:
>   Ex, followed by 2 * (8x to Bx)
> 
> 4 octets:
>   F0 to F7, followed by 3 * (8x to Bx)
> 
> The Wikipedia article then explains how to calculate the original
> Unicode number.
> 
> Which means, for example:
> If you find the character 9F in the XML file which has not been prefixed
> by (C0 to F7), it is not a valid UTF-8 character.
> 
> I'm just writing a short program to identify the offending elements.

Any patches to planet.rb gratefully accepted

> In this context... why are different charsets used in the mysql database?

The tables were created at different times at different versions of
MySQL. Unintentional, in other words.

have fun,

SteveC steve at asklater.com http://www.asklater.com/steve/