[OSM-dev] Escaping special characters when writing tags in OSM files with osm-subset.pl & write.pm

Jon Burgess jburgess at uklinux.net
Fri Nov 10 09:55:40 GMT 2006

On Fri, 2006-11-10 at 08:12 +0100, Joerg Ostertag (OSM Munich/Germany)
> ...
> > I found and fixed the another UTF8 issue with these scripts which
> > happened because the file handle was not set to utf8 mode. They now
> > happily extract large amounts of planet.osm without seeing any UTF8
> > issues (provided the input file is valid UTF8).
> >
> > These changes have been added to SVN.
> This was for writing; can we do the same for reading? 
> Because this could mean we don't need to call UTF8Sanitize any more before 
> reading; which would save 300MB/planet.osm on my hdd :-)

No, I think the problem is that the planet.osm contains invalid UTF8. I
believe the XML::Parse routines are doing the correct character
conversion when reading (probably based on the declared encoding in the
<?xml> tag which defaults to UTF8).

> > I've not exposed tag2osm from Writer.pm. I'm thinking that the better
> > long term answer is for the writer code to be enhanced to support
> > on-the-fly data output as is done by osm-subset.pl then all the XML
> > writing can be moved over into Writer.pm. This could be used to reduce
> > the memory usage of my simplify.pl code too.
> If I understand you correctly; you want to split the writer into something 
> like write_header,write_node,write_segment,write_way,write_end?

Yes, that sounds right. If the routines wrote only a single item, then
there could also be a layer immediately above which would write out a
sequence from an hash. Then the existing write_osm_file() could be
re-implemented call each of these functions to preserve support for the
existing API.


More information about the dev mailing list