[OSM-talk] Looking for "primary language" map

moltonel at gmail.com moltonel at gmail.com
Fri Apr 14 13:26:44 UTC 2017



On 11 April 2017 08:26:14 IST, Rory McCann <rory at technomancy.org> wrote:
>You could try to run the "name" tag though a language detection 
>algorithm and see what comes out. I think Google released one a few 
>years ago: cf. https://github.com/Mimino666/langdetect
>
>Ethnologue has some. But I think it would cost a lot to licence.
>https://www.ethnologue.com/ and is probably much more precise than you
>need.

KDE's Sonnet is another library that springs to mind.


Another approach that might be interesting is to look at nearby objects in osm. Look for objects with a clearly-identifiable language (ie if name tag has  same value as exatly one of the name:xx tags of the object). If 90% of those identify as 'English' for example, then other unidentified languages in the same area are probably English too.

To get decent performance, split the world in tiles and figure out the dominant clearly-tagged language for each tile. Use that preprocessed data as your language-guessing "shapefile".
-- 
Vdp
Sent from a phone.



More information about the talk mailing list