[Talk-in] Automating OSM translation into Indic languages
safincrazy at gmail.com
Sat Apr 4 18:24:16 UTC 2015
Long email warning.
I've been thinking a little bit about automating the translation of maps
into multiple Indic languages ever since I saw the Kannada map at geoBLR in
I started some work on it today, and I have lots of interesting things to
report. Right now I am mostly transliterating as opposed to translating but
if a dictionary of common words/tags can be compiled, upgrading the script
to translate instead of transliterating should be doable.
Here's the algorithm I followed:
1. Get the nodes within a bounding box from OSM using the python wrapper
for Overpass - overpy
<http://python-overpy.readthedocs.org/en/latest/example.html> - This
returns a collection of nodes and associated ID, tags, lat, lon and other
attributes. This can also be repeated for ways by using the corresponding
2. Filter nodes that have tags
3. From the result of the filter, identify nodes with Indic language
tags - eg:["name:kn"]
4. Transliterate the string value for tag["name:kn"] to another language
- I used Tamil - and store it within tag["name:ta"] - I used the Indic
transliterator <http://silpa.org.in/Transliteration> APIs from SILPA
<http://transliteration.readthedocs.org/en/latest/> for this
5. Create a new changeset and upload the result(node with
tag["name:ta"]) to OSM using osmapi
I did it only for one node:
- *Indic to Indic transliterations - ✓*The Indic transliterator APIs
seem to convert quite effortlessly from one Indic language to another.
Right now, support is available for Hindi, Tamil, Punjabi, Gujarati,
Malayalam, Oriya, Bengaliand Kannada. So, if a Kannada tag exists in OSM,
the same text can be transliterated into multiple Indic languages using the
naive algorithm I described above.
- *English to Indic transliterations - X*: Though the Indic
Transliterator works for English To Indic transliterations as well, it is
not very useful. This is because only English words that are in the CMU
dictionary are capable of being transliterated - which means that we can't
transliterate "Raajaajeenagar", even if we had a custom tag for
transliteration on OSM. On emailing the developer
<http://thottingal.in/blog/about/> of the transliterator about extending
the capabilities of English transliteration, I was told that extending the
dictionary by adding additional words is one option. I am not sure of how
feasible this is, or how much more optimal it is as compared to translating
to one Indic language and transliterating+translating to the rest.
- *Translations of English Words - X* - Right now, I am only able to
transliterate words, but if a list of common words(I am guessing all the
OSM tags, and other common words) could be compiled, and translated into
all the Indic languages, the translation process can be automated quite
easily. This would require the algorithm to have 2 additional steps
1. From an Indic tag(i.e., an already translated tag, we would have to
identify portions that are in the translations list, and leave
them out of
the transliteration process.
2. For the word(s) identified in step 1, we must find a translation
in the translations list for the language we are translating into. This
must then be suffixed or prefixed with the transliterated portion. I am
guessing suffix will be the norm, while prefixes might occasionally be
- *Tracking node version numbers - X *- Right now, I am unable to track
the version attribute of a node tag using the overpy API. I entered the
version number manually. Not sure if I am missing something. This is just a
"need-to-figure-out" issue more than anything. This is very important for
automatically updating a node to the server because if there's a mismatch
between the version number being passed to the API and the version number
on the server, the API won't work.
- *Which Indic Language to begin transliterating in* - Issues might
arise if a language like Tamil - where the letter for ka, kha, ga, gha etc
is the same - is say used to transliterate to Hindi. But, if we use a
language like Kannada or Hindi for the first time, this issue can probably
be resolved easily.
The script is on Github
Feel free to fork it, use it, work on it, edit it and suggest changes,
different language, other possibilities, alternatives etc. Pull Requests
very welcome. :)
This is my first time writing code in Python, so advice on improving code
would be very welcome. Also, let me know if I'm missing something else,
obvious or subtle.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Talk-in