[OSM-dev] planet.osm - fix
Michael Strecke
MStrecke at gmx.de
Mon Aug 14 20:47:58 BST 2006
Jonas Svensson wrote:
> Good to see a new dump. Unfortunatly there are about the same
> number of UTF-8 errors in this as in the july-dump. Well assuming
> my UTF8sanitizer is correct.
Judging from on earlier dump, there are various codesets (e.g. latin-1)
used in the planet.osm file, which lead to UTF-8 errors.
According to Wikipedia:
http://en.wikipedia.org/wiki/Utf-8
UTF-8 uses 1 to 4 octets to encode the Unicode character. Valid ranges are:
(all numbers in hex, x = (0..F))
1 octet:
00 - 7F (= ASCII char)
2 octets:
Cx or Dx, followed by (8x to Bx)
3 octets:
Ex, followed by 2 * (8x to Bx)
4 octets:
F0 to F7, followed by 3 * (8x to Bx)
The Wikipedia article then explains how to calculate the original
Unicode number.
Which means, for example:
If you find the character 9F in the XML file which has not been prefixed
by (C0 to F7), it is not a valid UTF-8 character.
I'm just writing a short program to identify the offending elements.
In this context... why are different charsets used in the mysql database?
(from http://svn.openstreetmap.org/sql/mysql-schema.sql)
CREATE TABLE `area_tags` (
`id` bigint(64) NOT NULL default '0',
`k` varchar(255) default NULL,
`v` varchar(255) default NULL,
`version` bigint(20) NOT NULL default '0',
`sequence_id` int(11) NOT NULL auto_increment,
PRIMARY KEY (`id`,`version`,`sequence_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `current_way_tags` (
`id` bigint(64) default NULL,
`k` varchar(255) default NULL,
`v` varchar(255) default NULL,
KEY `current_way_tags_id_idx` (`id`),
FULLTEXT KEY `current_way_tags_v_idx` (`v`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
More information about the dev
mailing list