[OSM-dev] UTF8 problem with last night's daily .osc

Sat Aug 30 11:51:13 BST 2008

On Sat, Aug 30, 2008 at 11:51 AM, Frederik Ramm <frederik at remote.org> wrote:
> I know it is not Osmosis' fault because Osmosis just uses standard XML
> parsing but I'm a bit unhappy about the gigantic waste of processing
> power incurred by all this parsing stuff - it seems anyone touching the
> XML file with a "proper library" has to actually decipher the UTF8
> sequences, whereas if I just want to use Osmosis to split out a section
> of the data or apply a patch, I would be perfectly happy if it would
> process whatever data is there *without* trying to make sense of it more
> than it absolutely has to. (I don't think there can ever be a double or
> single quote anywhere in bytes 2-n of a n-byte UTF-8 sequence, can there?)

UTF-8 parsing is trivial. Read first byte, lookup table tells you the
length. Copy that many extra bytes. No, a double quote can't appear
anywhere in a sequence, that's why it complained. Otherwise we would
just have seen mysterious data corruption.

Sure, your regex parsing wouldn't have failed on this particular test,
but the moment you want to actually use the result to load into JOSM,
osm2pgsql, coastline checker, it will barf anyway, so you may as well
get it right the first time.

> I have verified that the regular API suffers from the same problem.
> Upload something with an UTF-8 character at position 255 and you get
> into trouble. So actually the API needs to do something as well, either
> truncate the data properly, or at least return an error to the client if
> you try to set a value of more than 255 bytes.

Can you set mysql to strict mode so it throws an error instead of
silently truncating?

> If one wanted to silently truncate, then I believe the rule is: You may
> cut a string between neighbouring characters "a" and "b" if "b" is in
> the range 0..127 (0x00..0x7f - regular one-byte character) or 192..253
> (0xc0..0xfd - first byte of UTF-8 sequence). Anything outside these
> ranges may be a follow-on byte of an UTF-8 sequence.

That's about right.

Have a nice day,
-- 
Martijn van Oosterhout <kleptog at gmail.com> http://svana.org/kleptog/