[OSM-dev] broken utf8 in minute changeset 200907140650

Richard Fairhurst richard at systemed.net
Tue Jul 14 16:19:14 BST 2009


Ævar Arnfjörð Bjarmason wrote:
> * Potlatch will enter whatever raw binary string the user 
> supplies into the database that the main API would reject 
> as an invalid request, hence the corrupt data

Sort of.

>From a client point of view, the bug you filed is that Linux Flash Player
has long been broken beyond belief and doesn't permit non-ASCII characters
to be entered into a textfield. (See http://bugs.adobe.com/jira/browse/FP-40
.)

This morning is actually a different issue AFAICT. Potlatch (the SWF client)
has long used an ActionScript method, textField.restrict, to prevent control
characters (0x00-0x1F) being input into textfields. Unfortunately the latest
version of Ming (the open-source Flash compiler used to compile Potlatch),
0.4.2, appears to be broken and will not compile textField.restrict
correctly - it randomly uppercases character input (letters D to U, IIRC)
which is a whole heap of no good for entering tags. (See
http://bugs.libming.org/show_bug.cgi?id=88 .)

Consequently when I needed to commit a new revision of Potlatch at SOTM, and
only had a laptop with 0.4.2 installed, this check was temporarily removed.
It'll be back in this evening now I'm back with a machine with Ming 0.3 on
it.

As I mentioned to you the other day, it would be really useful if some
Linux-using OSMers could expand the reports at
http://trac.openstreetmap.org/ticket/1936 so we can find exactly _how_ FP
for Linux is breaking encoding, and fix it either in Potlatch or at the API.
>From the two examples you give, for two-byte UTF8, it appears to be adding
0x03 before the first byte and 0x83 0xC2 after it. But we need to work out
whether this is a universal pattern for all two-byte UTF8 sequences, and
what happens with longer sequences. This should be fairly trivial for
someone with the Rails port installed on a Linux machine, I'd hope.

> And as has been pointed out there's an ambiguity as to what 
> sequences of bytes can be written to the database whether that 
> be full UTF-8 or some XML subset of it.

Indeed.

cheers
Richard
-- 
View this message in context: http://www.nabble.com/broken-utf8-in-minute-changeset-200907140650-tp24475713p24481719.html
Sent from the OpenStreetMap - Dev mailing list archive at Nabble.com.





More information about the dev mailing list