[OSM-dev] control characters in planet.osm
David Earl
david at frankieandshadow.com
Thu Jul 12 13:01:13 BST 2007
On 12/07/2007 12:40, Frederik Ramm wrote:
> Hi,
>
> I'd probably never have noticed since I usually process the
> planet file with regular expressions instead of a proper XML parser.
> I used "osmosis" last week, and its XML parser refused to process way
> 4845936 (highway=secondary, name=Queen ^Street) because the value for
> contained an un-escaped control character (hex 0x13, ASCII 19, here
> depicted as ^S). (It took some time to find out that the ^S was at
> the root of the problem, and it was Brett Henderson who found it.)
>
> The problem is fixed in this week's planet file, however another ^S
> has appeared in way 4827686 (highway=motorway, ref=A30, nat_ref=^SA30).
>
> That problem is also fixed in the database.
>
> Q1: Both ways were created by Potlatch alpha. While having a control
> character in a tag value is not technically invalid, I do not think
> that these were inserted on purpose. Maybe there is something about
> the Potlatch UI that makes people erroneously insert these ^Ses?
>
> Q2: Is it valid XML to have an un-escaped ^S somewhere in the
> attribute CDATA? If yes, then the XML parser used by osmosis should
> be repaired. If no, then the XML exporter writing the planet file
> should be repaired.
>
> Bye
> Frederik
>
According to the XML spec:
<quote>
In the content of elements, character data is any string of characters
which does not contain the start-delimiter of any markup and does not
include the CDATA-section-close delimiter, "]]>". In a CDATA section,
character data is any string of characters not including the
CDATA-section-close delimiter, "]]>".
</quote>
The only escapes are for < & " and >
So I think your parser is wrong.
David
More information about the dev
mailing list