[OSM-dev] control characters in planet.osm

Brett Henderson brett at bretth.com
Fri Jul 13 04:08:26 BST 2007


Richard Fairhurst wrote:
> Frederik Ramm wrote:
>
>   
>> Q1: Both ways were created by Potlatch alpha. While having a control
>> character in a tag value is not technically invalid, I do not think
>> that these were inserted on purpose. Maybe there is something about
>> the Potlatch UI that makes people erroneously insert these ^Ses?
>>     
>
> Hm. I don't actually know how you'd do that! But evidently you can.
>
> Should I just add something to the Potlatch API that strips out any  
> 0-31 characters? (Presumably a Ruby regex should be able to do that  
> fairly easily...)
>
> cheers
> Richard
>   
I've done some more investigation.  It's not a bug in the XML parser 
which is a good result because OSM code is infinitely more open to 
patches than the JDK :-)

*** The Proof ;-)
It's not just Java that will have a problem with this 0x13 character, 
Microsoft parsers also refuse to parse it:
http://support.microsoft.com/kb/325694

This link shows the valid characters for XML.  0x13 isn't one of them.
http://www.xml.com/axml/target.html#charsets

This link describes how different quote characters should be represented 
using Unicode.  In HTML an acute accent is "‘", not sure if that 
applies to XML.
http://www.pemberley.com/janeinfo/latin1.html#unicode

*** Potlatch Bug
It seems like this character should never have been entered in Potlatch 
anyway.  I haven't tried it yet but what does Ctrl-S do if you're 
entering data in Potlatch?  Perhaps people are trying to save by 
pressing the Ctrl-S key combination?

Either a modification to the Potlatch API, or a modification to Potlatch 
itself to prevent this character being entered sounds appropriate.  Not 
sure if you should remove all low characters because some are acceptable 
XML (at least carriage return 0x0D, line feed 0x0A and tab 0x09).

*** Workaround
The dodgy character can be removed from a planet file with the following 
command:
cat planet-input.osm | sed 's/\x13//' > planet-output.osm






More information about the dev mailing list