<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Brett Henderson wrote:
<blockquote cite="mid:4A87BBE1.2080103@bretth.com" type="cite">
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
Thanks for all this. These unicode problems are the bane of my
existence :-) Any help is much appreciated.<br>
<br>
I've run some experiments. I've been using the unicode character
0x10330 and experimenting with creating a test file then copying it via
osmosis. It seems that if I create a tag containing a single instance
of that character I can copy it okay. But when I create a tag with
multiple 0x10330 characters it starts to get duplicated.<br>
</blockquote>
If anybody wishes to repeat my tests, I used the following code snippet
in Java.<br>
<br>
int unicodeInput;<br>
StringBuilder builder;<br>
<br>
unicodeInput = 0x10330;<br>
<br>
builder = new StringBuilder();<br>
//builder.append("prefix");<br>
for (int i = 0; i < 3; i++) {<br>
builder.append("x");<br>
builder.appendCodePoint(unicodeInput);<br>
builder.append("x");<br>
}<br>
//builder.append("suffix");<br>
<br>
XmlWriter xmlWriter = new XmlWriter(new File("bh-test.osm"),
CompressionMethod.None);<br>
Node node;<br>
node = new Node(1, 2, new Date(), OsmUser.NONE, 3, 4, 5);<br>
node.getTags().add(new Tag("test", builder.toString()));<br>
xmlWriter.process(new NodeContainer(node));<br>
xmlWriter.complete();<br>
xmlWriter.release();<br>
<br>
FileInputStream iStream = new FileInputStream("bh-test.osm");<br>
InputStreamReader reader = new InputStreamReader(iStream,
"UTF-8");<br>
BufferedReader bufferedReader = new BufferedReader(reader);<br>
for (String line = bufferedReader.readLine(); line != null;
line = bufferedReader.readLine()) {<br>
System.out.println(line);<br>
}<br>
<br>
<br>
It first creates a string containing a unicode character requiring a
surrogate pair when represented in UTF-16. It then creates an
XmlWriter (the Osmosis class implementing the --write-xml/--wx task)
and writes out a very basic osm file called bh-test.osm containing a
single node with a tag with the previously created string as the
value. Then it reads back the file and prints it to stdout.<br>
<br>
The stdout will probably print '?' characters (at least on Windows),
but under a debugger you can verify the characters are being read in
correctly as UTF-16.<br>
<br>
I then ran the file through the normal osmosis app to copy it into a
new file. This triggered the bug.<br>
osmosis --rx bh-test.osm --wx bh-test-out.osm<br>
<br>
I was able to avoid the bug by using the Woodstox StAX XML parser.<br>
osmosis --fast-read-xml-0.6 bh-test.osm --wx bh-test-out.osm<br>
<br>
</body>
</html>