[OSM-dev] UTF8 problem with last night's daily .osc

Sat Aug 30 10:51:32 BST 2008

Hi,

Richard Fairhurst wrote:
> Well, the relevant bit of the migration is
[...]

I looked that up myself but didn't go the extra length to find out that 
"string" actually means 255 characters only... I always assumed that we 
allowed longer values. There goes the freedom of stuffing your sales 
brochure into a tag value ;-)

(Nodes currently do allow longer values but that's because they are 
still all lumped together and not broken out into their own node_tags 
table, right?)

> So I guess the solution is either for Osmosis to conform to Postel's  
> Law; 

I know it is not Osmosis' fault because Osmosis just uses standard XML 
parsing but I'm a bit unhappy about the gigantic waste of processing 
power incurred by all this parsing stuff - it seems anyone touching the 
XML file with a "proper library" has to actually decipher the UTF8 
sequences, whereas if I just want to use Osmosis to split out a section 
of the data or apply a patch, I would be perfectly happy if it would 
process whatever data is there *without* trying to make sense of it more 
than it absolutely has to. (I don't think there can ever be a double or 
single quote anywhere in bytes 2-n of a n-byte UTF-8 sequence, can there?)

I guess I'll have to resort to dirty regex parsingif I want that.

> or for  
> Potlatch/amf_controller, which don't currently have any limit on key/ 
> value lengths (well, 64k :) ), to preprocess keys/values by  
> truncating at the nearest UTF-8 boundary before 255 bytes.  

I have verified that the regular API suffers from the same problem. 
Upload something with an UTF-8 character at position 255 and you get 
into trouble. So actually the API needs to do something as well, either 
truncate the data properly, or at least return an error to the client if 
you try to set a value of more than 255 bytes.

I think it could be done in the add_tag_keyval method of the 
way/relation controllers. I am actually in favour of returning an error 
instead of silently truncating data because I think we should only 
return "ok" if we have stored all data that was sent.

Returning an error should be no more difficult than throwing an 
exception there if v.length exceeds 255, only thing I am not sure about 
is whether Ruby will try to be smart and return the length not in bytes 
but in characters...?

If one wanted to silently truncate, then I believe the rule is: You may 
cut a string between neighbouring characters "a" and "b" if "b" is in 
the range 0..127 (0x00..0x7f - regular one-byte character) or 192..253 
(0xc0..0xfd - first byte of UTF-8 sequence). Anything outside these 
ranges may be a follow-on byte of an UTF-8 sequence.

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00'09" E008°23'33"