[OSM-dev] way 27483626 UTF-8 truncation

Brett Henderson brett at bretth.com
Sat Oct 4 01:15:54 BST 2008

Florian Lohoff wrote:
> On Fri, Oct 03, 2008 at 01:36:31PM +0100, Matt Amos wrote:
>> Subject: [OSM-dev] way 27483626 UTF-8 truncation
>> i just noticed that the hourly change file
>> 2008100310-2008100311.osc.gz has an invalid UTF-8 string in the note
>> tag for way 27483626 (
>> http://www.openstreetmap.org/browse/way/27483626/history ). i have
>> trunctated it to the nearest word, so this email is just to give
>> forewarning that hourly or daily diff imports today might have a bit
>> of trouble.
>> its the same problem as discussed here
>> http://lists.openstreetmap.org/pipermail/dev/2008-August/011525.html
> Another 2 change files contain utf-8 bugs and osmosis refuses to process
> them:
> 200810031022-200810031023.osc
> 200810031023-200810031024.osc
Any idea which nodes or ways are broken in these?

This isn't an osmosis bug.  The database now has incorrect/corrupted tag 
data in the history tables that needs to be corrected.  Following the URL:


results in random results from the API.

If we can identity the broken records we can ask TomH nicely to fix 
them.  I can then move osmosis backwards in time to re-generate the 
affected time period.  I don't know how this broken data gets created in 
the first place.  There was some discussion about this the last time it 
happened, I'll have to try to dig up the emails.

It's not simple to fix osmosis to prevent this occurring.  Osmosis is 
reading doubly encoded data from the database and removing the double 
encoding as it writes to the xml file.  It's a hack and there is no 
simple way of verifying the data before it gets written to the file.  I 
have a local process running at home verifying the output which has 
detected the problem, but I was asleep at the time it occurred :-)

More information about the dev mailing list