[OSM-dev] Odd data in daily diffs (potlatch related?)

Mon Mar 31 11:26:48 BST 2008

Osmosis is using the inbuilt Java SAX parser which directly reads data 
from an InputStream, there is no simple place where you can check data 
and sanitise it before processing it.  It may be possible to write my 
own FilterInputStream that sits in between the underlying data stream 
and the SAX parser and "fixes" things as it encounters them but it isn't 
a two minute hack.

I'm hesitant to add too many hacks like this for something which 
shouldn't be in there in the first place.  I've gone through enough pain 
working around the db encoding issues :-)  I especially hate the thought 
of silently fixing data, that will remove any incentive to fix it at the 
source.  It would be "nicer" if the API could reject (ie. not silently 
fix) data like this during upload, this would make people fix any tools 
introducing these problems.  I see that is already being discussed in 
other emails so hopefully this problem is already going away ...

Anyway, yell if anybody violently disagrees and thinks there is a bug 
that needs fixing in osmosis.

Jon Burgess wrote:
> On Sat, 2008-03-29 at 12:41 +0100, Frederik Ramm wrote:
>   
>> Hi,
>>
>>     
>>> In the file daily-20080326-20080327.osc.bz2 there is this relation:
>>>
>>>     <relation id="8571" timestamp="2008-03-26T22:05:03Z" user="wiesel111">
>>>       <tag k="ESCESC" v=""/>
>>>       <tag k="created_by" v="Potlatch 0.8"/>
>>>       <tag k="type" v=""/>
>>>     </relation>
>>>
>>> Those are real escapes "\x1d". Fetching via the API doesn't have them,
>>> the osmosis XML parser is barfing on them. Looks like some mismatch
>>> between the output and input of osmosis here.
>>>       
>> Seems to be two problems in one, first: how did the key get in there
>> in the first place, second: why does it not get exported in a way that
>> Osmosis can read.
>>
>> I was hoping to fix the diff by simply running "recode" on it and
>> instructing it to ignore invalid characters, however I was surprised
>> to see that recode converted the file from UTF8 ut UTF16 without
>> complaint (and back again to give an identical file). - Would running
>> one of the many existing "UTF8 sanitizers" have resolved the problem?
>>     
>
> Character 27 is valid UTF-8, but is not valid as content within an XML
> document: http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
>
> More details and some Java code which might be useful for Osmosis:
> http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html
>
>
> I dumped the same data myself with the planet dump tools and it produces
> the same invalid output. I have added a line into the planet dump code
> to replace this with a ?. 
>
> Now that I have found the links above I should perhaps add an even
> stricter test to drop everything < 32 except for 9, 10 & 13.
>
> 	Jon
>
>
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
>