[OSM-dev] UTF8 problem with last night's daily .osc
Frederik Ramm
frederik at remote.org
Sat Aug 30 10:51:32 BST 2008
Hi,
Richard Fairhurst wrote:
> Well, the relevant bit of the migration is
[...]
I looked that up myself but didn't go the extra length to find out that
"string" actually means 255 characters only... I always assumed that we
allowed longer values. There goes the freedom of stuffing your sales
brochure into a tag value ;-)
(Nodes currently do allow longer values but that's because they are
still all lumped together and not broken out into their own node_tags
table, right?)
> So I guess the solution is either for Osmosis to conform to Postel's
> Law;
I know it is not Osmosis' fault because Osmosis just uses standard XML
parsing but I'm a bit unhappy about the gigantic waste of processing
power incurred by all this parsing stuff - it seems anyone touching the
XML file with a "proper library" has to actually decipher the UTF8
sequences, whereas if I just want to use Osmosis to split out a section
of the data or apply a patch, I would be perfectly happy if it would
process whatever data is there *without* trying to make sense of it more
than it absolutely has to. (I don't think there can ever be a double or
single quote anywhere in bytes 2-n of a n-byte UTF-8 sequence, can there?)
I guess I'll have to resort to dirty regex parsingif I want that.
> or for
> Potlatch/amf_controller, which don't currently have any limit on key/
> value lengths (well, 64k :) ), to preprocess keys/values by
> truncating at the nearest UTF-8 boundary before 255 bytes.
I have verified that the regular API suffers from the same problem.
Upload something with an UTF-8 character at position 255 and you get
into trouble. So actually the API needs to do something as well, either
truncate the data properly, or at least return an error to the client if
you try to set a value of more than 255 bytes.
I think it could be done in the add_tag_keyval method of the
way/relation controllers. I am actually in favour of returning an error
instead of silently truncating data because I think we should only
return "ok" if we have stored all data that was sent.
Returning an error should be no more difficult than throwing an
exception there if v.length exceeds 255, only thing I am not sure about
is whether Ruby will try to be smart and return the length not in bytes
but in characters...?
If one wanted to silently truncate, then I believe the rule is: You may
cut a string between neighbouring characters "a" and "b" if "b" is in
the range 0..127 (0x00..0x7f - regular one-byte character) or 192..253
(0xc0..0xfd - first byte of UTF-8 sequence). Anything outside these
ranges may be a follow-on byte of an UTF-8 sequence.
Bye
Frederik
--
Frederik Ramm ## eMail frederik at remote.org ## N49°00'09" E008°23'33"
More information about the dev
mailing list