[OSM-dev] invalid UTF8 (was RE: Optimal free compression algorithm for OSM XML data)
david at frankieandshadow.com
Wed May 16 15:22:43 BST 2007
David Earl wrote:
> One thing that would also speed up processing planet files is to correct
> UTF-8 so the need for the utf8 sanitizer goes away. There's only a small
> number of errors. Of course, this requires changes to be checked and
> rejected so others don't get reintroduced - maybe this happens
> already, I
> don't know. A small change to sanitize would turn it into a checker.
> I will volunteer to locate and correct all the problem entries if someone
> else would put in utf8 validation on input.
>From today's planet file I have located all 12 elements with UTF8 errors
(some had more than one) and, I believe, fixed them. 9 were airports with
incorrect accents in the names or is_in's, all done by the same person. Two
were non-utf8 german ß's in ...straße and one was several supposed u-umlauts
in a note applied to way 4431539(*).
So how can we stop more coming in? There must be somewhere where the XML is
receieved that it can reject the request if it doesn't contain a valid XML
file (it's already rejected based on structure, so why not also on invalid
I will check next week's planet as well.
(*) This last way is a curious way because it has no segements. It is
described as cycleway Schollbrunn and the note (now) says "Schollbrunn -
Forsthaus Sylvan - Altenbuch - Sandacker - Kartause Grünau - Schneidmühle -
Nickelsmühle - Schollbrunn: 17 km". Anyone recognise it? It's pretty
pointless on its own, but also hard to delete graphically.
More information about the dev