[OSM-dev] [Geocoding] Nominatim and/or a fault-tolerant geocoder

Darafei "Komяpa" Praliaskouski me at komzpa.net
Tue Nov 29 12:27:13 UTC 2016


Have a look at libpostal for parsing addresses:
https://github.com/openvenues/libpostal
There's postgres extension: https://github.com/pramsey/pgsql-postal

вт, 29 нояб. 2016 г. в 15:24, Tom <nominatim at tscholz.net>:

> Hi Sarah and Dmitry,
>
> thanks for your responses! I will definitely investigate into the
> libpostal project later on as well as some of the geocoders Dmitry
> suggested.
>
> But right now I’m doing some tests with pg_trgm. And Sarah, I cannot
> confirm so far your comment
>
> "Trigrams only work with misspellings of a letter or two, they fail
>
> completely when trying to match up abbreviations.“
>
>
> To me the opposite seems true, as you can see in the following examples.
> Let’s take this address, as I want to look for it and the way OSM has it
> stored and spelled.
>
> (asked address) (OSM address)
> —street:  Верещагина ул улица Верещагина
> —town: Ханская ст-ца Ханская
> —city: Майкоп г городской округ Майкоп
> —region: Адыгея Респ Адыгея
>
> The Nominatim standard query is basically this (for the street):
>
> select word_id, word_token, word
> from word
> where word_token = make_standard_name('Ханская ст-ца')
>
>
> …and does not return anything.
>
> Now I enabled the extension (CREATE EXTENSION pg_trgm;) and created an
> index (CREATE INDEX word_token_trgm_idx ON word USING GIST (word_token
> gist_trgm_ops);) and modified the select slightly:
>
>
> select word_id, word_token, word,
> gettokenstring(transliteration(‚Верещагина ул')) as asked,
> similarity(word_token, gettokenstring(transliteration('Верещагина ул')))
> as sml
> from word
> where word_token % make_standard_name('Верещагина ул')
> order by sml desc
> limit 20
>
>
> …and this is the result (I hope the formatting gets through…):
>
>
> "word_id" "word_token" "word" "asked" "sml"
> 19098 " ul virishchaghina" "улица Верещагина" " virishchaghina ul " 1.0
> 19099 "ul virishchaghina" "" " virishchaghina ul " 1.0
> 19100 „virishchaghina" "" " virishchaghina ul " 0.833333
> 1525904 " virishchaghina" "Верещагина" " virishchaghina ul " 0.833333
> 115343 "ul virishchaghino" "" " virishchaghina ul " 0.8
> 115342 " ul virishchaghino" "улица Верещагино" " virishchaghina ul " 0.8
> 568775 „ n virishchaghina" "На Верещагина" " virishchaghina ul " 0.75
> 568776 "n virishchaghina" "" " virishchaghina ul " 0.75
> 1256480 " pl virishchaghina" "площадь Верещагина" " virishchaghina ul "
> 0.714286
> 1256481 "pl virishchaghina" "" " virishchaghina ul " 0.714286
> 351652 „ virishchaghin" "Верещагин" " virishchaghina ul " 0.684211
> 351653 "virishchaghin" "" " virishchaghina ul " 0.684211
> 217731 „ virishchaghinskaia ul" "Верещагинская улица"" virishchaghina ul "
> 0.666667
> 217732 "virishchaghinskaia ul" "" " virishchaghina ul " 0.666667
> 115344 "virishchaghino" "" " virishchaghina ul " 0.65
> 824366 „ v v virishchaghin" "В.В.Верещагин" " virishchaghina ul " 0.65
> 824367 "v v virishchaghin" "" " virishchaghina ul " 0.65
> 855756 „ virishchaghino" "Верещагино" " virishchaghina ul " 0.65
> 721916 „ur virishchaghino" "" " virishchaghina ul " 0.636364
> 721915 „ ur virishchaghino" "ур. Верещагино“ „ virishchaghina ul "
> 0.636364
>
> So the first two answers with a matching of 1 (=100%) are exactly the
> street I asked for!
>
> The same happens with the town („Ханская ст-ца“ <-> „Ханская“) and with
> the region („Адыгея Респ“ <-> „Адыгея“). Of course the similarity is not
> alway 1, but this doesn’t matter, as long as the best match is still my
> address. And furthermore it tells me how certain the answer is, so I can
> deal with the information.
>
> What Sarah mentions might apply to the city („Майкоп г“ <-> „городской
> округ Майкоп“), where the real answer only appears as 23. result with a
> matching of 40%, after the „best“ (but wrong) match of 70%.
>
> Maybe libpostal could help here, or the OSM data are wrong or the name I
> asked for. Anyway this would be acceptable because of the huge difference
> in spelling. It could even be healed with a clever combination of region,
> city, town and street.
>
> So, in conclusion, to me pg_trgm looks really promising! And the query
> doesn’t change a lot. Sure, Nominatim would have to deal with the
> similarity in the response, but this doesn’t seem a huge thing, is it?
>
> Kind regards,
>
> Tom
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> https://lists.openstreetmap.org/listinfo/dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20161129/1d50e2b1/attachment-0001.html>


More information about the dev mailing list