[OSM-dev] UTF8 problem with last night's daily .osc

Sat Aug 30 16:39:24 BST 2008

On Sat, Aug 30, 2008 at 2:51 AM, Frederik Ramm <frederik at remote.org> wrote:

> Hi,
>
> Richard Fairhurst wrote:
> > Well, the relevant bit of the migration is
> [...]
>
> I looked that up myself but didn't go the extra length to find out that
> "string" actually means 255 characters only... I always assumed that we
> allowed longer values. There goes the freedom of stuffing your sales
> brochure into a tag value ;-)
>
> (Nodes currently do allow longer values but that's because they are
> still all lumped together and not broken out into their own node_tags
> table, right?)
>
> > So I guess the solution is either for Osmosis to conform to Postel's
> > Law;
>
> I know it is not Osmosis' fault because Osmosis just uses standard XML
> parsing but I'm a bit unhappy about the gigantic waste of processing
> power incurred by all this parsing stuff - it seems anyone touching the
> XML file with a "proper library" has to actually decipher the UTF8
> sequences, whereas if I just want to use Osmosis to split out a section
> of the data or apply a patch, I would be perfectly happy if it would
> process whatever data is there *without* trying to make sense of it more
> than it absolutely has to. (I don't think there can ever be a double or
> single quote anywhere in bytes 2-n of a n-byte UTF-8 sequence, can there?)
>
> I guess I'll have to resort to dirty regex parsingif I want that.
>
> > or for
> > Potlatch/amf_controller, which don't currently have any limit on key/
> > value lengths (well, 64k :) ), to preprocess keys/values by
> > truncating at the nearest UTF-8 boundary before 255 bytes.
>
> I have verified that the regular API suffers from the same problem.
> Upload something with an UTF-8 character at position 255 and you get
> into trouble. So actually the API needs to do something as well, either
> truncate the data properly, or at least return an error to the client if
> you try to set a value of more than 255 bytes.
>
> I think it could be done in the add_tag_keyval method of the
> way/relation controllers. I am actually in favour of returning an error
> instead of silently truncating data because I think we should only
> return "ok" if we have stored all data that was sent.
>
> Returning an error should be no more difficult than throwing an
> exception there if v.length exceeds 255, only thing I am not sure about
> is whether Ruby will try to be smart and return the length not in bytes
> but in characters...?
>
> If one wanted to silently truncate, then I believe the rule is: You may
> cut a string between neighbouring characters "a" and "b" if "b" is in
> the range 0..127 (0x00..0x7f - regular one-byte character) or 192..253
> (0xc0..0xfd - first byte of UTF-8 sequence). Anything outside these
> ranges may be a follow-on byte of an UTF-8 sequence.
>
> Bye
> Frederik
>

If I recall correctly, the database column is not actually set for UTF-8
(but is double-encoded to return actual UTF-8 to the client...). Wouldn't it
be a better long-term fix to change the database to UTF-8 (or whatever),
then presumably MySql wouldn't allow invalid sequences to be stored? Still
would be a good idea to raise an error if the length was too long, though.

Karl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20080830/90ad3e5c/attachment.html>