[OSM-dev] [Geocoding] Nominatim and/or a fault-tolerant geocoder

Tom nominatim at tscholz.net
Tue Nov 29 23:04:51 UTC 2016


Hi,

thanks for the clarification!

Meanwhile I read also far into the libpostal project and this sounds really cool.

> The interesting question
> is how well that works when the search terms in Nominatim have
> not been normalized with the same algorithm.


I came to that very same question. So, do I understand it correctly, that basically the geocoding process would be:

1. Preparation:
*.osm/*.pbf —> osm2psql —> address normalization with libpostal into a seperate „OSM-Address-Table“

2. Geocoding:
AddressToGeocode.csv —> libpostal —> simple lookup in „OSM-Address-Table“

I’ll do some tests on that…

Regards,
Tom



Am 29.11.2016 um 23:39 schrieb Sarah Hoffmann <lonvia at denofr.de>:

Hi,


On Tue, Nov 29, 2016 at 12:51:11PM +0100, Tom wrote:
> But right now I’m doing some tests with pg_trgm. And Sarah, I cannot confirm so far your comment
> 
> "Trigrams only work with misspellings of a letter or two, they fail
> completely when trying to match up abbreviations.“
> 
> To me the opposite seems true, as you can see in the following examples. Let’s take this address, as I want to look for it and the way OSM has it stored and spelled.
> 
> 		(asked address)			(OSM address)
> —street: 	Верещагина ул			улица Верещагина
> —town:	Ханская ст-ца			Ханская 
> —city:	Майкоп г				городской округ Майкоп 
> —region:	Адыгея Респ				Адыгея 
> 
> The Nominatim standard query is basically this (for the street):
> 
> select word_id, word_token, word
> from word
> where word_token = make_standard_name('Ханская ст-ца')
> 
> …and does not return anything.

Nominatim's query matching is actually a bit more complex. For each
place name in the Database it saves the full name as well as the
partial terms (space separated words) that make up the term. For
example, for 'улица Верещагина' it will have the full term
'улица Верещагина' and the partials 'улица' and  'Верещагина'.
Further 'улица' is abbreviated to 'ул', so that it will match
against the full word and the abbreviation later.

When searching, Nominatim does a similar thing and matches first
against full words and then against partial terms. So, while
you won't find 'Верещагина ул' int the word table, Nominatim
will still match it correctly because it finds 'Верещагина' and
'ул' and a database entry (in search_name) that contains both
words as partials.

The real problems start with 'Ханская ст-ца'. Nominatim only
has 'Ханская' as a name but no partial for 'ст-ца' or 'станица'.
And as the search algorithm never drops terms from the search
query(*), it won't return any result. It's true that trigram
search can still return a result. The problem is that the
similarity is already lower than many false positives where
the spelling is similar. Similarity is simply not a good
indicator to distinguish between superflous words and spelling
differences.

(*) Not completely true. It may drop house numbers, but only those.

That's where libpostal comes in. It is supposed to normalize
your address to something that is compatible with the names used
in OpenStreetMap. That includes removing the odd prefix or suffix
(like ст-ца), normalizing numbers etc. The interesting question
is how well that works when the search terms in Nominatim have
not been normalized with the same algorithm.

Sarah




More information about the dev mailing list