[OSM-dev] Odd data in daily diffs (potlatch related?)
brett at bretth.com
Mon Mar 31 11:26:48 BST 2008
Osmosis is using the inbuilt Java SAX parser which directly reads data
from an InputStream, there is no simple place where you can check data
and sanitise it before processing it. It may be possible to write my
own FilterInputStream that sits in between the underlying data stream
and the SAX parser and "fixes" things as it encounters them but it isn't
a two minute hack.
I'm hesitant to add too many hacks like this for something which
shouldn't be in there in the first place. I've gone through enough pain
working around the db encoding issues :-) I especially hate the thought
of silently fixing data, that will remove any incentive to fix it at the
source. It would be "nicer" if the API could reject (ie. not silently
fix) data like this during upload, this would make people fix any tools
introducing these problems. I see that is already being discussed in
other emails so hopefully this problem is already going away ...
Anyway, yell if anybody violently disagrees and thinks there is a bug
that needs fixing in osmosis.
Jon Burgess wrote:
> On Sat, 2008-03-29 at 12:41 +0100, Frederik Ramm wrote:
>>> In the file daily-20080326-20080327.osc.bz2 there is this relation:
>>> <relation id="8571" timestamp="2008-03-26T22:05:03Z" user="wiesel111">
>>> <tag k="ESCESC" v=""/>
>>> <tag k="created_by" v="Potlatch 0.8"/>
>>> <tag k="type" v=""/>
>>> Those are real escapes "\x1d". Fetching via the API doesn't have them,
>>> the osmosis XML parser is barfing on them. Looks like some mismatch
>>> between the output and input of osmosis here.
>> Seems to be two problems in one, first: how did the key get in there
>> in the first place, second: why does it not get exported in a way that
>> Osmosis can read.
>> I was hoping to fix the diff by simply running "recode" on it and
>> instructing it to ignore invalid characters, however I was surprised
>> to see that recode converted the file from UTF8 ut UTF16 without
>> complaint (and back again to give an identical file). - Would running
>> one of the many existing "UTF8 sanitizers" have resolved the problem?
> Character 27 is valid UTF-8, but is not valid as content within an XML
> document: http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
> More details and some Java code which might be useful for Osmosis:
> I dumped the same data myself with the planet dump tools and it produces
> the same invalid output. I have added a line into the planet dump code
> to replace this with a ?.
> Now that I have found the links above I should perhaps add an even
> stricter test to drop everything < 32 except for 9, 10 & 13.
> dev mailing list
> dev at openstreetmap.org
More information about the dev