[OSM-talk] A large part of London Missing?

Tue Dec 9 13:06:21 GMT 2008

On Mon, Dec 8, 2008 at 5:30 PM, Ed Loach <ed at loach.me.uk> wrote:
> Can anyone work out how to find out what possible UTF8 errors there
> might be? Or would you be able to tell some other way if this were
> the problem?

You should try using decode("UTF-8") instead of decode("utf8") in your
checking routine. UTF-8 and utf8 are not equivalent under Encode, the
former is more strict, see
http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

  encode("utf8",  "\x{FFFF_FFFF}", 1); # okay
  encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks

It would also be useful to log these errors, they suggest invalid
byte-sequences in the OSM dataset and it would be useful to fix them
at their source.