[OSM-talk] address interpolation
David Earl
david at frankieandshadow.com
Fri Oct 2 10:14:19 BST 2009
> The problem is as follows:
>
> You see an interpolation "25a to 25c". How do you know that this means
> "25a, 25b, 25c"? You know by removing the number and then starting with
> the "a" go through code points adding one until you reach "c". Easy.
> This will work for all alphabets where that are layed out in alphabetical
> order in Unicode, and they probably all are. (but thats an assumption on
> my part :-)
Ouch. Unicode order has no meaning in the real world, and only really
works for English (and not even then properly for subtle cases, like
ligatures, not that these would ever be used in these kind of addresses).
You need to know the lexical ordering, which means you need to know the
language. Sometimes you can guess from the character, and two characters
make it easier than one, but the problem doesn't go away with two - the
"null" variant isn't central to this problem.
There's also a cultural assumption about how you might do this in other
countries. I've no idea how Chinese addresses are formulated normally -
whether they even use digits, and if those digits are the arabic
numerals - let alone what these exceptional cases might be. But IF you
know it is Chinese and IF the scheme fits, with digits + Chinese
Character, then the null case still works (Chinese characters still have
a lexical ordering, I believe it has to do with the number of strokes,
but any relationship to Unicode order is purely coincidental)
So I'm coming round to the view that alphabetic should explicitly only
mean only
n nA nB ... nZ
where you can start and end at any point in the sequence, and not even
try to deal with other characters from other alphabets (not even other
latin ones). Any other sequence from other cultures needs its own
interpolation style or additional qualifying tag to identify it, just as
we'd tag an email with the encoding.
David
More information about the talk
mailing list