[Geocoding] geocoding misspellings

Michal Palenik michal.palenik at iz.sk
Thu May 30 15:02:57 UTC 2013


On Wed, May 29, 2013 at 07:55:22AM -0400, Stewart C. Russell wrote:
> On 13-05-29 04:32 AM, Michal Palenik wrote:
> > 
> > what would be the easiest option to connect misspelled names to their
> > properlyspelled counterparts?
> 
> How are your programming skills? The classic way of doing this is using
> an approximate string match (or "fuzzy match") using the Levenshtein or

thanks for the proper keyword
http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html
... where levenshtein_less_equal(name1, name2, 1) <= 1 ...
(for slovak language those other 2 did not make sense)
was successful, kind of...

(still had to decide if "Marin" is misspelling of city "Martin" or some
village 2000+ km away)

the distance measurment would improve a lot, if layout of the keyboard
would be considered (on qwerty keyboard, r->t change is more probable
than a->p) 

and exchange of letters would be punished less (eg
levenshtein('extralongword', 'extralognword') 
vs
levenshtein('extralongword', 'extralohjword')
both return 2 even though the first one is in human terms more likely)


i've tried to read the sourcecode
http://doxygen.postgresql.org/levenshtein_8c.html#a3887230c68a3fee3cb0cc496614468eb
but my C skills are definitely not at that level...



anyway, using levenshtein/fuzzymatch in nominatim would probably be
a great performance hit.

michal


> Damerau-Levenshtein methods. There are modules to do this for many
> scripting languages (like Text::Fuzzy in Perl). There is also the
> command line tool 'agrep' which does the same thing.
> 
> I'd recommend you manually check the results. I know it's slow, but
> there's no way to get this perfectly right automatically.
> 
> cheers,
>  Stewart
> 
> 
> _______________________________________________
> Geocoding mailing list
> Geocoding at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/geocoding

-- 
michal palenik
institut zamestnanosti
www.iz.sk



More information about the Geocoding mailing list