[OSM-dev] planet.osm - fix

Michael Strecke MStrecke at gmx.de
Mon Aug 14 20:47:58 BST 2006


Jonas Svensson wrote:

> Good to see a new dump. Unfortunatly there are about the same
> number of UTF-8 errors in this as in the july-dump. Well assuming
> my UTF8sanitizer is correct.

Judging from on earlier dump, there are various codesets (e.g. latin-1)
used in the planet.osm file, which lead to UTF-8 errors.

According to Wikipedia:

   http://en.wikipedia.org/wiki/Utf-8

UTF-8 uses 1 to 4 octets to encode the Unicode character. Valid ranges are:

(all numbers in hex, x = (0..F))

1 octet:
  00 - 7F    (= ASCII char)

2 octets:
  Cx or Dx, followed by (8x to Bx)

3 octets:
  Ex, followed by 2 * (8x to Bx)

4 octets:
  F0 to F7, followed by 3 * (8x to Bx)

The Wikipedia article then explains how to calculate the original
Unicode number.

Which means, for example:
If you find the character 9F in the XML file which has not been prefixed
by (C0 to F7), it is not a valid UTF-8 character.

I'm just writing a short program to identify the offending elements.

In this context... why are different charsets used in the mysql database?

(from http://svn.openstreetmap.org/sql/mysql-schema.sql)


CREATE TABLE `area_tags` (
  `id` bigint(64) NOT NULL default '0',
  `k` varchar(255) default NULL,
  `v` varchar(255) default NULL,
  `version` bigint(20) NOT NULL default '0',
  `sequence_id` int(11) NOT NULL auto_increment,
  PRIMARY KEY  (`id`,`version`,`sequence_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

CREATE TABLE `current_way_tags` (
  `id` bigint(64) default NULL,
  `k` varchar(255) default NULL,
  `v` varchar(255) default NULL,
  KEY `current_way_tags_id_idx` (`id`),
  FULLTEXT KEY `current_way_tags_v_idx` (`v`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;






More information about the dev mailing list