[OSM-dev] Odd data in daily diffs (potlatch related?)
jburgess777 at googlemail.com
Sat Mar 29 13:20:03 GMT 2008
On Sat, 2008-03-29 at 12:41 +0100, Frederik Ramm wrote:
> > In the file daily-20080326-20080327.osc.bz2 there is this relation:
> > <relation id="8571" timestamp="2008-03-26T22:05:03Z" user="wiesel111">
> > <tag k="ESCESC" v=""/>
> > <tag k="created_by" v="Potlatch 0.8"/>
> > <tag k="type" v=""/>
> > </relation>
> > Those are real escapes "\x1d". Fetching via the API doesn't have them,
> > the osmosis XML parser is barfing on them. Looks like some mismatch
> > between the output and input of osmosis here.
> Seems to be two problems in one, first: how did the key get in there
> in the first place, second: why does it not get exported in a way that
> Osmosis can read.
> I was hoping to fix the diff by simply running "recode" on it and
> instructing it to ignore invalid characters, however I was surprised
> to see that recode converted the file from UTF8 ut UTF16 without
> complaint (and back again to give an identical file). - Would running
> one of the many existing "UTF8 sanitizers" have resolved the problem?
Character 27 is valid UTF-8, but is not valid as content within an XML
More details and some Java code which might be useful for Osmosis:
I dumped the same data myself with the planet dump tools and it produces
the same invalid output. I have added a line into the planet dump code
to replace this with a ?.
Now that I have found the links above I should perhaps add an even
stricter test to drop everything < 32 except for 9, 10 & 13.
More information about the dev