[Talk-in] Automating OSM translation into Indic languages

I Chengappa imchengappa at gmail.com
Sat Apr 4 20:21:17 UTC 2015


Hi Aruna


 There's a lot of interesting material in your post. But first, can you
clarify, are you seeking to add the transliterations to the OSM database,
or adding them when rendering?. It would be worth keeping in mind this
'guideline' - [
http://wiki.openstreetmap.org/wiki/Names#Avoid_transliteration], which I
take to mean as 'don't add a transliteration if a machine can transliterate
on the fly for you'. (I've been adding transliterations myself in ISO15199,
despite this).

 Regards, indigomc (I. M. Chengappa)

On 4 April 2015 at 19:24, Aruna S <safincrazy at gmail.com> wrote:

> Hello!
>
> Long email warning.
>
> I've been thinking a little bit about automating the translation of maps
> into multiple Indic languages ever since I saw the Kannada map at geoBLR in
> March.
>
> I started some work on it today, and I have lots of interesting things to
> report. Right now I am mostly transliterating as opposed to translating but
> if a dictionary of common words/tags can be compiled, upgrading the script
> to translate instead of transliterating should be doable.
>
> Here's the algorithm I followed:
>
>    1. Get the nodes within a bounding box from OSM using the python
>    wrapper for Overpass - overpy
>    <http://python-overpy.readthedocs.org/en/latest/example.html> - This
>    returns a collection of nodes and associated ID, tags, lat, lon and other
>    attributes. This can also be repeated for ways by using the corresponding
>    overpy query.
>    2. Filter nodes that have tags
>    3. From the result of the filter, identify nodes with Indic language
>    tags - eg:["name:kn"]
>    4. Transliterate the string value for tag["name:kn"] to another
>    language - I used Tamil - and store it within tag["name:ta"] - I used the Indic
>    transliterator <http://silpa.org.in/Transliteration> APIs from SILPA
>    <http://transliteration.readthedocs.org/en/latest/> for this
>    5. Create a new changeset and upload the result(node with
>    tag["name:ta"]) to OSM using osmapi
>    <http://osmapi.divshot.io/#OsmApi.OsmApi.NodeUpdate>
>
> I did it only for one node:
> https://www.openstreetmap.org/edit?node=1118255762#map=19/12.99451/77.55430
>
>
> *Advantages*
>
>    - *Indic to Indic transliterations - ✓*The Indic transliterator APIs
>    seem to convert quite effortlessly from one Indic language to another.
>    Right now, support is available for Hindi, Tamil, Punjabi, Gujarati,
>    Malayalam, Oriya, Bengaliand Kannada. So, if a Kannada tag exists in OSM,
>    the same text can be transliterated into multiple Indic languages using the
>    naive algorithm I described above.
>
> *Limitations*
>
>    - *English to Indic transliterations - X*: Though the Indic
>    Transliterator works for English To Indic transliterations as well, it is
>    not very useful. This is because only English words that are in the CMU
>    dictionary are capable of being transliterated - which means that we can't
>    transliterate "Raajaajeenagar", even if we had a custom tag for
>    transliteration on OSM. On emailing the developer
>    <http://thottingal.in/blog/about/> of the transliterator about
>    extending the capabilities of English transliteration, I was told that
>    extending the dictionary by adding additional words is one option. I am not
>    sure of how feasible this is, or how much more optimal it is as compared to
>    translating to one Indic language and transliterating+translating to the
>    rest.
>    - *Translations of English Words - X* - Right now, I am only able to
>    transliterate words, but if a list of common words(I am guessing all the
>    OSM tags, and other common words) could be compiled, and translated into
>    all the Indic languages, the translation process can be automated quite
>    easily. This would require the algorithm to have 2 additional steps
>
>
>    1. From an Indic tag(i.e., an already translated tag, we would have to
>       identify portions that are in the translations list, and leave them out of
>       the transliteration process.
>       2. For the word(s) identified in step 1, we must find a translation
>       in the translations list for the language we are translating into. This
>       must then be suffixed or prefixed with the transliterated portion. I am
>       guessing suffix will be the norm, while prefixes might occasionally be
>       necessary.
>
>
>    - *Tracking node version numbers - X *- Right now, I am unable to
>    track the version attribute of a node tag using the overpy API. I entered
>    the version number manually. Not sure if I am missing something. This is
>    just a "need-to-figure-out" issue more than anything. This is very
>    important for automatically updating a node to the server because if
>    there's a mismatch between the version number being passed to the API and
>    the version number on the server, the API won't work.
>    - *Which Indic Language to begin transliterating in* - Issues might
>    arise if a language like Tamil - where the letter for ka, kha, ga, gha etc
>    is the same - is say used to transliterate to Hindi. But, if we use a
>    language like Kannada or Hindi for the first time, this issue can probably
>    be resolved easily.
>
> The script is on Github
> <https://github.com/anura28/Automate-Translations-OSM/blob/master/automateIndicTranslation.py>.
> Feel free to fork it, use it, work on it, edit it and suggest changes,
> different language, other possibilities, alternatives etc. Pull Requests
> very welcome. :)
>
> This is my first time writing code in Python, so advice on improving code
> would be very welcome. Also, let me know if I'm missing something else,
> obvious or subtle.
>
> Thanks!
>
> Warmly,
>
> Aruna
>
> _______________________________________________
> Talk-in mailing list
> Talk-in at openstreetmap.org
> https://lists.openstreetmap.org/listinfo/talk-in
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-in/attachments/20150404/0f6ef112/attachment.html>


More information about the Talk-in mailing list