[OSM-dev] strange Osmosis/XML/...? problem yesterday

Andy Allan gravitystorm at gmail.com
Fri Aug 14 15:16:44 BST 2009


On Fri, Aug 14, 2009 at 12:54 PM, Frederik Ramm<frederik at remote.org> wrote:
> Hi,
>
> Frederik Ramm wrote:
>> The result file should have been something like 400 bytes. This sounds
>> trivial but in the original case where the .osc contained a large number
>> of these characters, I suddenly had 2 MB of data in one tag.
>
> I forgot to mention: I'm posting this here on dev and not on the osmosis
> list because it seems that other (at least Java) programs are also
> affected; someone fixed then node later with a commit comment of "JOSM
> says string too long" or so...

The code points for these gothic characters are fine. See the
following (awesome) site:

http://decodeunicode.org/en/gothic

A rough transliteration is HEJSPANOA. However, they lie outside the
Basic Multilingual Plane (BMP) and can't be represented by a 16bit
integer. Java stores characters internally as 16-bit UCS-2 characters
and so everything is going horribly wrong.

IANAJavaProgrammer, but there's lots of very relevant stuff on
http://en.wikipedia.org/wiki/UTF-16 with the following choice quotes:

"UCS-2 (2-byte Universal Character Set) is an obsolete character
encoding which is a predecessor to UTF-16. The UCS-2 encoding form is
identical to that of UTF-16, except that it does not support surrogate
pairs and therefore can only encode characters in the BMP range U+0000
through U+FFFF."

NB: U+10337 is outside that range

"Java used UCS-2 initially, and added UTF-16 supplementary character
support in J2SE 5.0. Note that several widely-used String methods can
still create and return unpaired surrogates; e.g. any code written
assuming that substring is always safe, or that charAt returns a
unicode character, may give rise to bugs[4][5]."

I'm guessing more unit tests need writing ;-)

Cheers,
Andy

In the meantime of course, let BAN JOSM!!!!!!11!11 :-)




More information about the dev mailing list