<div dir="ltr">On Sat, Aug 30, 2008 at 2:51 AM, Frederik Ramm <span dir="ltr"><<a href="mailto:frederik@remote.org">frederik@remote.org</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi,<br>
<div class="Ih2E3d"><br>
Richard Fairhurst wrote:<br>
> Well, the relevant bit of the migration is<br>
</div>[...]<br>
<br>
I looked that up myself but didn't go the extra length to find out that<br>
"string" actually means 255 characters only... I always assumed that we<br>
allowed longer values. There goes the freedom of stuffing your sales<br>
brochure into a tag value ;-)<br>
<br>
(Nodes currently do allow longer values but that's because they are<br>
still all lumped together and not broken out into their own node_tags<br>
table, right?)<br>
<div class="Ih2E3d"><br>
> So I guess the solution is either for Osmosis to conform to Postel's<br>
> Law;<br>
<br>
</div>I know it is not Osmosis' fault because Osmosis just uses standard XML<br>
parsing but I'm a bit unhappy about the gigantic waste of processing<br>
power incurred by all this parsing stuff - it seems anyone touching the<br>
XML file with a "proper library" has to actually decipher the UTF8<br>
sequences, whereas if I just want to use Osmosis to split out a section<br>
of the data or apply a patch, I would be perfectly happy if it would<br>
process whatever data is there *without* trying to make sense of it more<br>
than it absolutely has to. (I don't think there can ever be a double or<br>
single quote anywhere in bytes 2-n of a n-byte UTF-8 sequence, can there?)<br>
<br>
I guess I'll have to resort to dirty regex parsingif I want that.<br>
<div class="Ih2E3d"><br>
> or for<br>
> Potlatch/amf_controller, which don't currently have any limit on key/<br>
> value lengths (well, 64k :) ), to preprocess keys/values by<br>
> truncating at the nearest UTF-8 boundary before 255 bytes.<br>
<br>
</div>I have verified that the regular API suffers from the same problem.<br>
Upload something with an UTF-8 character at position 255 and you get<br>
into trouble. So actually the API needs to do something as well, either<br>
truncate the data properly, or at least return an error to the client if<br>
you try to set a value of more than 255 bytes.<br>
<br>
I think it could be done in the add_tag_keyval method of the<br>
way/relation controllers. I am actually in favour of returning an error<br>
instead of silently truncating data because I think we should only<br>
return "ok" if we have stored all data that was sent.<br>
<br>
Returning an error should be no more difficult than throwing an<br>
exception there if v.length exceeds 255, only thing I am not sure about<br>
is whether Ruby will try to be smart and return the length not in bytes<br>
but in characters...?<br>
<br>
If one wanted to silently truncate, then I believe the rule is: You may<br>
cut a string between neighbouring characters "a" and "b" if "b" is in<br>
the range 0..127 (0x00..0x7f - regular one-byte character) or 192..253<br>
(0xc0..0xfd - first byte of UTF-8 sequence). Anything outside these<br>
ranges may be a follow-on byte of an UTF-8 sequence.<br>
<br>
Bye<br>
Frederik<br>
</blockquote></div><br>If I recall correctly, the database column is not actually set for UTF-8 (but is double-encoded to return actual UTF-8 to the client...). Wouldn't it be a better long-term fix to change the database to UTF-8 (or whatever), then presumably MySql wouldn't allow invalid sequences to be stored? Still would be a good idea to raise an error if the length was too long, though.<br>
<br>Karl<br></div>