[OSM-dev] control characters in planet.osm

David Earl david at frankieandshadow.com
Thu Jul 12 13:01:13 BST 2007


On 12/07/2007 12:40, Frederik Ramm wrote:
> Hi,
> 
>     I'd probably never have noticed since I usually process the  
> planet file with regular expressions instead of a proper XML parser.  
> I used "osmosis" last week, and its XML parser refused to process way  
> 4845936 (highway=secondary, name=Queen ^Street) because the value for  
> contained an un-escaped control character (hex 0x13, ASCII 19, here  
> depicted as ^S). (It took some time to find out that the ^S was at  
> the root of the problem, and it was Brett Henderson who found it.)
> 
> The problem is fixed in this week's planet file, however another ^S  
> has appeared in way 4827686 (highway=motorway, ref=A30, nat_ref=^SA30).
> 
> That problem is also fixed in the database.
> 
> Q1: Both ways were created by Potlatch alpha. While having a control  
> character in a tag value is not technically invalid, I do not think  
> that these were inserted on purpose. Maybe there is something about  
> the Potlatch UI that makes people erroneously insert these ^Ses?
> 
> Q2: Is it valid XML to have an un-escaped ^S somewhere in the  
> attribute CDATA? If yes, then the XML parser used by osmosis should  
> be repaired. If no, then the XML exporter writing the planet file  
> should be repaired.
> 
> Bye
> Frederik
> 

According to the XML spec:

<quote>
In the content of elements, character data is any string of characters 
which does not contain the start-delimiter of any markup and does not 
include the CDATA-section-close delimiter, "]]>". In a CDATA section, 
character data is any string of characters not including the 
CDATA-section-close delimiter, "]]>".
</quote>

The only escapes are for < & " and >

So I think your parser is wrong.

David




More information about the dev mailing list