[OSM-dev] broken utf8 in minute changeset 200907140650
Ævar Arnfjörð Bjarmason
avarab at gmail.com
Tue Jul 14 15:10:52 BST 2009
On Tue, Jul 14, 2009 at 8:42 AM, Jon Burgess<jburgess777 at googlemail.com> wrote:
> I noticed that the diff parsing on the tile server stopped this morning.
> This changeset seems to be the cause:
>
> $ gzip -dc 200907140650-200907140651.osc.gz | xmllint -noout -
> -:36: parser error : invalid character in attribute value
> <tag k="name" v="▒Meycauayan City Northbound Entry Point"/>
> ^
> -:36: parser error : attributes construct error
> <tag k="name" v="▒Meycauayan City Northbound Entry Point"/>
> ^
>
> http://www.openstreetmap.org/browse/node/410383150
> http://www.openstreetmap.org/browse/node/441527354
I filed a bug for this a while ago:
http://trac.openstreetmap.org/ticket/1936
The problem is that:
* Potlatch will enter whatever raw binary string the user supplies
into the database that the main API would reject as an invalid
request, hence the corrupt data
* Other tools that read from the database don't deal properly with
escaping this data once it's in the DB
And as has been pointed out there's an ambiguity as to what sequences
of bytes can be written to the database whether that be full UTF-8 or
some XML subset of it.
More information about the dev
mailing list