<div dir="ltr">On Sat, Aug 30, 2008 at 2:51 AM, Frederik Ramm <span dir="ltr"><<a href="mailto:frederik@remote.org">frederik@remote.org</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi,<br>

<div class="Ih2E3d"><br>

Richard Fairhurst wrote:<br>

> Well, the relevant bit of the migration is<br>

</div>[...]<br>

<br>

I looked that up myself but didn't go the extra length to find out that<br>

"string" actually means 255 characters only... I always assumed that we<br>

allowed longer values. There goes the freedom of stuffing your sales<br>

brochure into a tag value ;-)<br>

<br>

(Nodes currently do allow longer values but that's because they are<br>

still all lumped together and not broken out into their own node_tags<br>

table, right?)<br>

<div class="Ih2E3d"><br>

> So I guess the solution is either for Osmosis to conform to Postel's<br>

> Law;<br>

<br>

</div>I know it is not Osmosis' fault because Osmosis just uses standard XML<br>

parsing but I'm a bit unhappy about the gigantic waste of processing<br>

power incurred by all this parsing stuff - it seems anyone touching the<br>

XML file with a "proper library" has to actually decipher the UTF8<br>

sequences, whereas if I just want to use Osmosis to split out a section<br>

of the data or apply a patch, I would be perfectly happy if it would<br>

process whatever data is there *without* trying to make sense of it more<br>

than it absolutely has to. (I don't think there can ever be a double or<br>

single quote anywhere in bytes 2-n of a n-byte UTF-8 sequence, can there?)<br>

<br>

I guess I'll have to resort to dirty regex parsingif I want that.<br>

<div class="Ih2E3d"><br>

> or for<br>

> Potlatch/amf_controller, which don't currently have any limit on key/<br>

> value lengths (well, 64k :) ), to preprocess keys/values by<br>

> truncating at the nearest UTF-8 boundary before 255 bytes.<br>

<br>

</div>I have verified that the regular API suffers from the same problem.<br>

Upload something with an UTF-8 character at position 255 and you get<br>

into trouble. So actually the API needs to do something as well, either<br>

truncate the data properly, or at least return an error to the client if<br>

you try to set a value of more than 255 bytes.<br>

<br>

I think it could be done in the add_tag_keyval method of the<br>

way/relation controllers. I am actually in favour of returning an error<br>

instead of silently truncating data because I think we should only<br>

return "ok" if we have stored all data that was sent.<br>

<br>

Returning an error should be no more difficult than throwing an<br>

exception there if v.length exceeds 255, only thing I am not sure about<br>

is whether Ruby will try to be smart and return the length not in bytes<br>

but in characters...?<br>

<br>

If one wanted to silently truncate, then I believe the rule is: You may<br>

cut a string between neighbouring characters "a" and "b" if "b" is in<br>

the range 0..127 (0x00..0x7f - regular one-byte character) or 192..253<br>

(0xc0..0xfd - first byte of UTF-8 sequence). Anything outside these<br>

ranges may be a follow-on byte of an UTF-8 sequence.<br>

<br>

Bye<br>

Frederik<br>

</blockquote></div><br>If I recall correctly, the database column is not actually set for UTF-8 (but is double-encoded to return actual UTF-8 to the client...). Wouldn't it be a better long-term fix to change the database to UTF-8 (or whatever), then presumably MySql wouldn't allow invalid sequences to be stored? Still would be a good idea to raise an error if the length was too long, though.<br>

<br>Karl<br></div>