<div dir="ltr">Have a look at libpostal for parsing addresses: <a href="https://github.com/openvenues/libpostal">https://github.com/openvenues/libpostal</a> <br>There's postgres extension: <a href="https://github.com/pramsey/pgsql-postal">https://github.com/pramsey/pgsql-postal</a> </div><br><div class="gmail_quote"><div dir="ltr">вт, 29 нояб. 2016 г. в 15:24, Tom <<a href="mailto:nominatim@tscholz.net">nominatim@tscholz.net</a>>:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class="gmail_msg"><div class="gmail_msg">Hi Sarah and Dmitry,</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">thanks for your responses! I will definitely investigate into the libpostal project later on as well as some of the geocoders Dmitry suggested.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">But right now I’m doing some tests with pg_trgm. And Sarah, I cannot confirm so far your comment</div><div class="gmail_msg"><br class="gmail_msg"></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px" class="gmail_msg"></blockquote></div><div style="word-wrap:break-word" class="gmail_msg"><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px" class="gmail_msg"><div class="gmail_msg">"Trigrams only work with misspellings of a letter or two, they fail</div></blockquote></div><div style="word-wrap:break-word" class="gmail_msg"><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px" class="gmail_msg"><div class="gmail_msg">completely when trying to match up abbreviations.“</div></blockquote><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">To me the opposite seems true, as you can see in the following examples. Let’s take this address, as I want to look for it and the way OSM has it stored and spelled.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg"><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">          </span>(asked address)<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                 </span>(OSM address)</div><div class="gmail_msg"><div class="gmail_msg">—street: <span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">     </span>Верещагина ул<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                       </span>улица Верещагина</div><div class="gmail_msg">—town:<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">      </span>Ханская ст-ца<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                        </span>Ханская </div><div class="gmail_msg">—city:<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">     </span>Майкоп г<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                         </span>городской округ Майкоп </div><div class="gmail_msg">—region:<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>Адыгея Респ<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                           </span>Адыгея </div></div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">The Nominatim standard query is basically this (for the street):</div><div class="gmail_msg"><br class="gmail_msg"></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px" class="gmail_msg"><div class="gmail_msg"><div class="gmail_msg">select word_id, word_token, word</div></div><div class="gmail_msg"><div class="gmail_msg">from word</div></div><div class="gmail_msg"><div class="gmail_msg">where word_token = make_standard_name('Ханская ст-ца')</div></div></blockquote><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">…and does not return anything.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">Now I enabled the extension (CREATE EXTENSION pg_trgm;) and created an index (CREATE INDEX word_token_trgm_idx ON word USING GIST (word_token gist_trgm_ops);) and modified the select slightly:</div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px" class="gmail_msg"><div class="gmail_msg"><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">select word_id, word_token, word, gettokenstring(transliteration(‚Верещагина ул')) as asked, </div><div class="gmail_msg"><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">    </span>similarity(word_token, gettokenstring(transliteration('Верещагина ул'))) as sml</div><div class="gmail_msg">from word</div><div class="gmail_msg">where word_token % make_standard_name('Верещагина ул')</div><div class="gmail_msg">order by sml desc</div><div class="gmail_msg">limit 20</div></div></blockquote><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">…and this is the result (I hope the formatting gets through…):</div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px" class="gmail_msg"><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg"><div class="gmail_msg">"word_id"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">  </span>"word_token"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                  </span>"word"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                </span>"asked"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                               </span>"sml"</div><div class="gmail_msg">19098<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap"> </span>" ul virishchaghina"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">          </span>"улица Верещагина"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">     </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>1.0</div><div class="gmail_msg">19099<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">     </span>"ul virishchaghina"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">           </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>1.0</div><div class="gmail_msg">19100<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">     </span>„virishchaghina"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap"> </span><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>0.833333</div><div class="gmail_msg">1525904<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">      </span>" virishchaghina"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">             </span>"Верещагина"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                        </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.833333</div><div class="gmail_msg">115343<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>"ul virishchaghino"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">           </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.8</div><div class="gmail_msg">115342<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">    </span>" ul virishchaghino"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">          </span>"улица Верещагино"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">     </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.8</div><div class="gmail_msg">568775<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">    </span>„ n virishchaghina"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">              </span>"На Верещагина"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">   </span><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.75</div><div class="gmail_msg">568776<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">   </span>"n virishchaghina"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">            </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.75</div><div class="gmail_msg">1256480<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">  </span>" pl virishchaghina"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">          </span>"площадь Верещагина"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap"> </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.714286</div><div class="gmail_msg">1256481<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">      </span>"pl virishchaghina"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">           </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.714286</div><div class="gmail_msg">351652<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>„ virishchaghin"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>"Верещагин"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">          </span><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.684211</div><div class="gmail_msg">351653<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>"virishchaghin"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">               </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.684211</div><div class="gmail_msg">217731<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>„ virishchaghinskaia ul"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap"> </span>"Верещагинская улица"" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>0.666667</div><div class="gmail_msg">217732<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>"virishchaghinskaia ul"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.666667</div><div class="gmail_msg">115344<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>"virishchaghino"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">              </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.65</div><div class="gmail_msg">824366<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">   </span>„ v v virishchaghin"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">             </span>"В.В.Верещагин"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">    </span><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.65</div><div class="gmail_msg">824367<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">   </span>"v v virishchaghin"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">           </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                                    </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.65</div><div class="gmail_msg">855756<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">   </span>„ virishchaghino"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                </span>"Верещагино"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                </span><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.65</div><div class="gmail_msg">721916<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">   </span>„ur virishchaghino"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">              </span>""<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">                            </span><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>" virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">         </span>0.636364</div><div class="gmail_msg">721915<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">       </span>„ ur virishchaghino"<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">     </span><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>"ур. Верещагино“<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">     </span><span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">        </span>„ virishchaghina ul "<span class="m_1182281381617075210Apple-tab-span gmail_msg" style="white-space:pre-wrap">            </span>0.636364</div><div class="gmail_msg"><br class="gmail_msg"></div></div></blockquote><div class="gmail_msg">So the first two answers with a matching of 1 (=100%) are exactly the street I asked for!</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">The same happens with the town („Ханская ст-ца“ <-> „Ханская“) and with the region („Адыгея Респ“ <-> „Адыгея“). Of course the similarity is not alway 1, but this doesn’t matter, as long as the best match is still my address. And furthermore it tells me how certain the answer is, so I can deal with the information.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">What Sarah mentions might apply to the city („Майкоп г“ <-> „городской округ Майкоп“), where the real answer only appears as 23. result with a matching of 40%, after the „best“ (but wrong) match of 70%.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">Maybe libpostal could help here, or the OSM data are wrong or the name I asked for. Anyway this would be acceptable because of the huge difference in spelling. It could even be healed with a clever combination of region, city, town and street.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">So, in conclusion, to me pg_trgm looks really promising! And the query doesn’t change a lot. Sure, Nominatim would have to deal with the similarity in the response, but this doesn’t seem a huge thing, is it?</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">Kind regards,</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">Tom</div><div class="gmail_msg"><br class="gmail_msg"></div></div>_______________________________________________<br class="gmail_msg">
dev mailing list<br class="gmail_msg">
<a href="mailto:dev@openstreetmap.org" class="gmail_msg" target="_blank">dev@openstreetmap.org</a><br class="gmail_msg">
<a href="https://lists.openstreetmap.org/listinfo/dev" rel="noreferrer" class="gmail_msg" target="_blank">https://lists.openstreetmap.org/listinfo/dev</a><br class="gmail_msg">
</blockquote></div>