[OSM-dev] strange Osmosis/XML/...? problem yesterday

Brett Henderson brett at bretth.com
Sun Aug 16 13:27:32 BST 2009


Brett Henderson wrote:
> Thanks for all this.  These unicode problems are the bane of my 
> existence :-)  Any help is much appreciated.
>
> I've run some experiments.  I've been using the unicode character 
> 0x10330 and experimenting with creating a test file then copying it 
> via osmosis.  It seems that if I create a tag containing a single 
> instance of that character I can copy it okay.  But when I create a 
> tag with multiple 0x10330 characters it starts to get duplicated.
If anybody wishes to repeat my tests, I used the following code snippet 
in Java.

        int unicodeInput;
        StringBuilder builder;
       
        unicodeInput = 0x10330;
       
        builder = new StringBuilder();
        //builder.append("prefix");
        for (int i = 0; i < 3; i++) {
            builder.append("x");
            builder.appendCodePoint(unicodeInput);
            builder.append("x");
        }
        //builder.append("suffix");
       
        XmlWriter xmlWriter = new XmlWriter(new File("bh-test.osm"), 
CompressionMethod.None);
        Node node;
        node = new Node(1, 2, new Date(), OsmUser.NONE, 3, 4, 5);
        node.getTags().add(new Tag("test", builder.toString()));
        xmlWriter.process(new NodeContainer(node));
        xmlWriter.complete();
        xmlWriter.release();
       
        FileInputStream iStream = new FileInputStream("bh-test.osm");
        InputStreamReader reader = new InputStreamReader(iStream, "UTF-8");
        BufferedReader bufferedReader = new BufferedReader(reader);
        for (String line = bufferedReader.readLine(); line != null; line 
= bufferedReader.readLine()) {
            System.out.println(line);
        }


It first creates a string containing a unicode character requiring a 
surrogate pair when represented in UTF-16.  It then creates an XmlWriter 
(the Osmosis class implementing the --write-xml/--wx task) and writes 
out a very basic osm file called bh-test.osm containing a single node 
with a tag with the previously created string as the value.  Then it 
reads back the file and prints it to stdout.

The stdout will probably print '?' characters (at least on Windows), but 
under a debugger you can verify the characters are being read in 
correctly as UTF-16.

I then ran the file through the normal osmosis app to copy it into a new 
file.  This triggered the bug.
osmosis --rx bh-test.osm --wx bh-test-out.osm

I was able to avoid the bug by using the Woodstox StAX XML parser.
osmosis --fast-read-xml-0.6 bh-test.osm --wx bh-test-out.osm

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20090816/2af61d23/attachment.html>


More information about the dev mailing list