[OSM-dev] broken utf8 in minute changeset 200907140650

Ævar Arnfjörð Bjarmason avarab at gmail.com
Tue Jul 14 15:10:52 BST 2009


On Tue, Jul 14, 2009 at 8:42 AM, Jon Burgess<jburgess777 at googlemail.com> wrote:
> I noticed that the diff parsing on the tile server stopped this morning.
> This changeset seems to be the cause:
>
> $ gzip -dc 200907140650-200907140651.osc.gz | xmllint -noout -
> -:36: parser error : invalid character in attribute value
>      <tag k="name" v="▒Meycauayan City Northbound Entry Point"/>
>                       ^
> -:36: parser error : attributes construct error
>      <tag k="name" v="▒Meycauayan City Northbound Entry Point"/>
>                       ^
>
> http://www.openstreetmap.org/browse/node/410383150
> http://www.openstreetmap.org/browse/node/441527354

I filed a bug for this a while ago:

http://trac.openstreetmap.org/ticket/1936

The problem is that:

* Potlatch will enter whatever raw binary string the user supplies
into the database that the main API would reject as an invalid
request, hence the corrupt data
* Other tools that read from the database don't deal properly with
escaping this data once it's in the DB

And as has been pointed out there's an ambiguity as to what sequences
of bytes can be written to the database whether that be full UTF-8 or
some XML subset of it.




More information about the dev mailing list