<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Brett Henderson wrote:

<blockquote cite="mid:4A87BBE1.2080103@bretth.com" type="cite">

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

Thanks for all this.  These unicode problems are the bane of my

existence :-)  Any help is much appreciated.<br>

  <br>

I've run some experiments.  I've been using the unicode character

0x10330 and experimenting with creating a test file then copying it via

osmosis.  It seems that if I create a tag containing a single instance

of that character I can copy it okay.  But when I create a tag with

multiple 0x10330 characters it starts to get duplicated.<br>

</blockquote>

If anybody wishes to repeat my tests, I used the following code snippet

in Java.<br>

<br>

        int unicodeInput;<br>

        StringBuilder builder;<br>

        <br>

        unicodeInput = 0x10330;<br>

        <br>

        builder = new StringBuilder();<br>

        //builder.append("prefix");<br>

        for (int i = 0; i < 3; i++) {<br>

            builder.append("x");<br>

            builder.appendCodePoint(unicodeInput);<br>

            builder.append("x");<br>

        }<br>

        //builder.append("suffix");<br>

        <br>

        XmlWriter xmlWriter = new XmlWriter(new File("bh-test.osm"),

CompressionMethod.None);<br>

        Node node;<br>

        node = new Node(1, 2, new Date(), OsmUser.NONE, 3, 4, 5);<br>

        node.getTags().add(new Tag("test", builder.toString()));<br>

        xmlWriter.process(new NodeContainer(node));<br>

        xmlWriter.complete();<br>

        xmlWriter.release();<br>

        <br>

        FileInputStream iStream = new FileInputStream("bh-test.osm");<br>

        InputStreamReader reader = new InputStreamReader(iStream,

"UTF-8");<br>

        BufferedReader bufferedReader = new BufferedReader(reader);<br>

        for (String line = bufferedReader.readLine(); line != null;

line = bufferedReader.readLine()) {<br>

            System.out.println(line);<br>

        }<br>

<br>

<br>

It first creates a string containing a unicode character requiring a

surrogate pair when represented in UTF-16.  It then creates an

XmlWriter (the Osmosis class implementing the --write-xml/--wx task)

and writes out a very basic osm file called bh-test.osm containing a

single node with a tag with the previously created string as the

value.  Then it reads back the file and prints it to stdout.<br>

<br>

The stdout will probably print '?' characters (at least on Windows),

but under a debugger you can verify the characters are being read in

correctly as UTF-16.<br>

<br>

I then ran the file through the normal osmosis app to copy it into a

new file.  This triggered the bug.<br>

osmosis --rx bh-test.osm --wx bh-test-out.osm<br>

<br>

I was able to avoid the bug by using the Woodstox StAX XML parser.<br>

osmosis --fast-read-xml-0.6 bh-test.osm --wx bh-test-out.osm<br>

<br>

</body>

</html>