[OSM-dev] planet.osm - fix

David Sheldon dave-osm at earth.li
Tue Aug 15 17:23:56 BST 2006


On Tue, Aug 15, 2006 at 05:44:48PM +0200, Lars Aronsson wrote:
> David Sheldon wrote:
> 
> > XML specifies UTF-8 as the default if no encoding is specified. 
> 
> Correct.
> 
> > I vote for ASCII, and anything that cannot be represented in ASCII being
> > represented by entities in the &#..; form.
> 
> This is a requirement that differs from the XML standard.  If your 
> proposal is accepted, then I cannot use just any 
> standard-conformant XML library that correctly outputs UTF-8

I was thinking that we should strive to generate that, particularly for
the planet.osm. This way we can be as nice as possible in what we
generate. 

As far as accepting XML for the uploads, then I'm all for accepting any
valid XML, though we will have to define a minimum set of character sets
that we accept it in. The XML standard specifies that at least UTF-8 and
UTF-16 must be supported, should we leave it at that?

> I must write my own encoding routines that only output ASCII with 
> &#..; entities.

Some XML libraries allow you to set what character set that your output
XML will be in, if you set that to ASCII then they should correctly
escape non-ascii characters as entities.

Alternatively you could filter the UTF-8 XML through a filter like 
 http://www.earth.li/~dave/techie/projects/utf8conv.c.txt
to encode all the characters outside the ASCII range. 

> I vote for standard XML.

I'm happy with that, but it seems that a lot of people here don't know
what that means.

David




More information about the dev mailing list