[OSM-talk] street names on several languages

Mon Oct 1 11:56:50 BST 2007

On 01/10/2007, D Tucny <d at tucny.com> wrote:
> On 01/10/2007, Dave Stubbs <osm.list at randomjunk.co.uk> wrote:
>
> > On 01/10/2007, Tapio Sokura <oh2kku at iki.fi> wrote:
> > > Dave Stubbs wrote:
> > > > For an example go see the node for London
> > > > ( http://www.openstreetmap.org/api/0.4/node/107775)
> > > > It has:
> > > > name=London   (the default English)
> > > > name:fr=Londres   (French)
> > > > name:cy=Llundain   (Welsh)
> > >
> > > Some might even say that a complete version of that tagging would also
> > > include "name:en=London", because nothing is explicitly defining what
> > > language the default name is on. At least not until we figure out how to
> > > tag defaults based on an area. You can always guess, but computers are
> > > quite bad at guessing things.
> >
> > Yeah, missing out the :en works fine as long as you only have one
> > language preference, ie: give me the English, and then fall back to
> > the default. But it does break somewhat if you have a list of
> > languages, ie: give me Welsh, then English, then French, before
> > falling back to the default.
> >
> > Maybe a language:name=en tag -- just to avoid the redundancy of adding
> > the name twice, with all the update inconsistencies that could result
> > in. Or just keep cut'n'pasting.
> >
> > >
> > > About the two letter language codes, should we think about moving on to
> > > ISO 639-2 or -3 three letter language codes? Because the 639-1 list
> > > doesn't cover the languages spoken in the world that well, especially
> > > closely related and smaller ones.
> > >
> >
> > If you stick enough of them in the database, by the time anyone
> > develops useful applications for these tags they might feel the need
> > to add support ;-)
> >
> > For those languages where there is a 2-letter code, there isn't much
> > harm in using them -- they don't actually prevent the 3-letter codes
> > being used.
>
> Even the 3-letter codes don't cover enough...
>
> Staying with the example of London...
>
> English: London
> Simplified Chinese: 伦敦
> Traditional Chinese: 倫敦
> Mandarin pinyin romanisation: Lun dun
> Mandarin pinyin romanisation with tones: Lún dūn
> Mandarin pinyin romanisation with numeric tones: Lun2 dun1
> Cantonese Yale romanisation: Leun deun
> Cantonese Yale romanisation with tones: Lèuhn dēun
> Cantonese Yale romanisation with numeric tones: Leun4 deun1
> Cantonese Jyutping romanisation: Leon deon
> Cantonese Jyutping romanisation with numeric tones: Leon4 deon1
>
>
> Those last 10 only have zh as a 2 character language code and zho/chi as the
> 3 character options...
>
> And this all have value...
>

When this first came up I suggested using the w3c language tags:
http://www.w3.org/International/articles/language-tags/

these basically use language-script-region, suppressing the parts that
aren't needed.
(language can be the 2/3letter-3letter 639-3 style code)
so:
London: en
London: en-GB (obviously not much point here...)
伦敦: zh-Hans
倫敦: zh-Hant
Lún dūn: zh-cmn-Latn  (zh-cmn, or zho-cmn is mandarin)
Lèuhn dēun: zh-yue-Latn (zh-yue, or zho-yue is cantonese)

there is still some oddness there,and you can't easily distinguish
between the different processes of romanisation... and the IANA
registry thing doesn't quite mesh with that idea either... and then
you have to figure out how much of the ISO codes to pay attention...

... then using them becomes a little tricky as you have to basically
implement the RFC unless someone has written a reference library by
now.