<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

Andy Allan wrote:

<blockquote

 cite="mid:c4193f8c0908140831w11d9597bi3f63ad0b686baae5@mail.gmail.com"

 type="cite">

  <pre wrap="">On Fri, Aug 14, 2009 at 3:16 PM, Andy Allan<a class="moz-txt-link-rfc2396E" href="mailto:gravitystorm@gmail.com"><gravitystorm@gmail.com></a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">On Fri, Aug 14, 2009 at 12:54 PM, Frederik Ramm<a class="moz-txt-link-rfc2396E" href="mailto:frederik@remote.org"><frederik@remote.org></a> wrote:

    </pre>

    <blockquote type="cite">

      <pre wrap="">Hi,

Frederik Ramm wrote:

      </pre>

      <blockquote type="cite">

        <pre wrap="">The result file should have been something like 400 bytes. This sounds

trivial but in the original case where the .osc contained a large number

of these characters, I suddenly had 2 MB of data in one tag.

        </pre>

      </blockquote>

      <pre wrap="">I forgot to mention: I'm posting this here on dev and not on the osmosis

list because it seems that other (at least Java) programs are also

affected; someone fixed then node later with a commit comment of "JOSM

says string too long" or so...

      </pre>

    </blockquote>

    <pre wrap="">The code points for these gothic characters are fine. See the

following (awesome) site:

<a class="moz-txt-link-freetext" href="http://decodeunicode.org/en/gothic">http://decodeunicode.org/en/gothic</a>

A rough transliteration is HEJSPANOA. However, they lie outside the

Basic Multilingual Plane (BMP) and can't be represented by a 16bit

integer. Java stores characters internally as 16-bit UCS-2 characters

and so everything is going horribly wrong.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Installing an SMP-aware font shows what JOSM is doing more easily than

reading Unicode code-points.

<a class="moz-txt-link-freetext" href="http://code2000.net/code2001.htm">http://code2000.net/code2001.htm</a>

I'll keep my (horrid) transliterations going here for the sake of everyone else.

v31 - HEJSPANOA

v32 - HHEHEJHEJSHEJSPHEJSPAHEJSPANHEJSPANOHEJSPANOA

i.e. the first letter, the first two letters, the first three letters etc.

I can see how you can quickly end up with a 2MB tag using this encoding scheme!

Cheers,

Andy

  </pre>

</blockquote>

Thanks for all this.  These unicode problems are the bane of my

existence :-)  Any help is much appreciated.<br>

<br>

I've run some experiments.  I've been using the unicode character

0x10330 and experimenting with creating a test file then copying it via

osmosis.  It seems that if I create a tag containing a single instance

of that character I can copy it okay.  But when I create a tag with

multiple 0x10330 characters it starts to get duplicated.<br>

<br>

If I create a tag with a single 0x10330 character it gets copied

correctly.<br>

If I surround the character with normal latin characters it copies

correctly.<br>

If I put 2 0x10330 characters in the tag, 3 get written to the output.<br>

If I put 3 0x10330 characters in the tag, 6 get written to the output.<br>

If I make each 0x10330 character non-consecutive by surrounding them

with latin characters they still get duplicated.<br>

<br>

I've run this under a debugger and it seems that the data gets

duplicated during input, not output.  My ElementWriter class may have

some issues with surrogate pairs, but it appears that it isn't the

source of this problem.<br>

<br>

I've opened the file directly using a UTF-8 input stream under a

debugger and the characters are read in correctly there as well.<br>

<br>

I've tried using the osmosis --fast-read-xml-0.6 task and the problem

goes away.  This alternative XML reading task uses the Woodstox StAX

XML parser.<br>

<br>

So to summarise it seems like the standard Java XML parser (based on

Apache Xerces I believe) is somehow introducing surrogate pair

duplication when multiple surrogate pairs are involved.  I don't know

if this is a bug or a problem in how we're using it.  I'm always

hesitant to assume bugs in the Java runtime but it seems like there

might be one here.<br>

<br>

Options:<br>

1. Try to find the source of the problem in the Java XML parser or our

use of it.<br>

2. Switch over to the Woodstox StAX XML parser which isn't exhibiting

the problem.<br>

<br>

Given that Woodstox StAX parsing gives an approx 20% performance

improvement, it might be a good time to implement option 2.<br>

<br>

Brett<br>

<br>

</body>

</html>