[OSM-dev] control characters in planet.osm
Brett Henderson
brett at bretth.com
Fri Jul 13 04:08:26 BST 2007
Richard Fairhurst wrote:
> Frederik Ramm wrote:
>
>
>> Q1: Both ways were created by Potlatch alpha. While having a control
>> character in a tag value is not technically invalid, I do not think
>> that these were inserted on purpose. Maybe there is something about
>> the Potlatch UI that makes people erroneously insert these ^Ses?
>>
>
> Hm. I don't actually know how you'd do that! But evidently you can.
>
> Should I just add something to the Potlatch API that strips out any
> 0-31 characters? (Presumably a Ruby regex should be able to do that
> fairly easily...)
>
> cheers
> Richard
>
I've done some more investigation. It's not a bug in the XML parser
which is a good result because OSM code is infinitely more open to
patches than the JDK :-)
*** The Proof ;-)
It's not just Java that will have a problem with this 0x13 character,
Microsoft parsers also refuse to parse it:
http://support.microsoft.com/kb/325694
This link shows the valid characters for XML. 0x13 isn't one of them.
http://www.xml.com/axml/target.html#charsets
This link describes how different quote characters should be represented
using Unicode. In HTML an acute accent is "‘", not sure if that
applies to XML.
http://www.pemberley.com/janeinfo/latin1.html#unicode
*** Potlatch Bug
It seems like this character should never have been entered in Potlatch
anyway. I haven't tried it yet but what does Ctrl-S do if you're
entering data in Potlatch? Perhaps people are trying to save by
pressing the Ctrl-S key combination?
Either a modification to the Potlatch API, or a modification to Potlatch
itself to prevent this character being entered sounds appropriate. Not
sure if you should remove all low characters because some are acceptable
XML (at least carriage return 0x0D, line feed 0x0A and tab 0x09).
*** Workaround
The dodgy character can be removed from a planet file with the following
command:
cat planet-input.osm | sed 's/\x13//' > planet-output.osm
More information about the dev
mailing list