[OSM-dev] Planet Dump Timings

Martijn van Oosterhout kleptog at gmail.com
Mon Sep 10 07:59:18 BST 2007


On 9/10/07, Brett Henderson <brett at bretth.com> wrote:
> How do you know if something is valid UTF-8?

Well, the official definition is here:
http://en.wikipedia.org/wiki/UTF-8
But if you're a layman just checking some text if it looks heres some tips:
Characters 0x00-0x7F are normal
Everything else is a character 0xC0-0xFF followed by one or more 0x80-0xBF

Just knowing that was enough to rule out the examples in the original
email. If you want to be tricky then 0xC0-0xDF is followed by one
char, 0xD0-0xDF by two...

If you're looking for a program, I see this program in SVN:
applications/utils/planet.osm/python/utf8osmchecker.py
which find bad UTF-8 in the input and displays it.

Have a nice day,
-- 
Martijn van Oosterhout <kleptog at gmail.com> http://svana.org/kleptog/




More information about the dev mailing list