[OSM-dev] strange Osmosis/XML/...? problem yesterday
Brett Henderson
brett at bretth.com
Sun Aug 16 13:27:32 BST 2009
Brett Henderson wrote:
> Thanks for all this. These unicode problems are the bane of my
> existence :-) Any help is much appreciated.
>
> I've run some experiments. I've been using the unicode character
> 0x10330 and experimenting with creating a test file then copying it
> via osmosis. It seems that if I create a tag containing a single
> instance of that character I can copy it okay. But when I create a
> tag with multiple 0x10330 characters it starts to get duplicated.
If anybody wishes to repeat my tests, I used the following code snippet
in Java.
int unicodeInput;
StringBuilder builder;
unicodeInput = 0x10330;
builder = new StringBuilder();
//builder.append("prefix");
for (int i = 0; i < 3; i++) {
builder.append("x");
builder.appendCodePoint(unicodeInput);
builder.append("x");
}
//builder.append("suffix");
XmlWriter xmlWriter = new XmlWriter(new File("bh-test.osm"),
CompressionMethod.None);
Node node;
node = new Node(1, 2, new Date(), OsmUser.NONE, 3, 4, 5);
node.getTags().add(new Tag("test", builder.toString()));
xmlWriter.process(new NodeContainer(node));
xmlWriter.complete();
xmlWriter.release();
FileInputStream iStream = new FileInputStream("bh-test.osm");
InputStreamReader reader = new InputStreamReader(iStream, "UTF-8");
BufferedReader bufferedReader = new BufferedReader(reader);
for (String line = bufferedReader.readLine(); line != null; line
= bufferedReader.readLine()) {
System.out.println(line);
}
It first creates a string containing a unicode character requiring a
surrogate pair when represented in UTF-16. It then creates an XmlWriter
(the Osmosis class implementing the --write-xml/--wx task) and writes
out a very basic osm file called bh-test.osm containing a single node
with a tag with the previously created string as the value. Then it
reads back the file and prints it to stdout.
The stdout will probably print '?' characters (at least on Windows), but
under a debugger you can verify the characters are being read in
correctly as UTF-16.
I then ran the file through the normal osmosis app to copy it into a new
file. This triggered the bug.
osmosis --rx bh-test.osm --wx bh-test-out.osm
I was able to avoid the bug by using the Woodstox StAX XML parser.
osmosis --fast-read-xml-0.6 bh-test.osm --wx bh-test-out.osm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20090816/2af61d23/attachment.html>
More information about the dev
mailing list