[Geocoding] geocoding misspellings
Michal Palenik
michal.palenik at iz.sk
Thu May 30 15:02:57 UTC 2013
On Wed, May 29, 2013 at 07:55:22AM -0400, Stewart C. Russell wrote:
> On 13-05-29 04:32 AM, Michal Palenik wrote:
> >
> > what would be the easiest option to connect misspelled names to their
> > properlyspelled counterparts?
>
> How are your programming skills? The classic way of doing this is using
> an approximate string match (or "fuzzy match") using the Levenshtein or
thanks for the proper keyword
http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html
... where levenshtein_less_equal(name1, name2, 1) <= 1 ...
(for slovak language those other 2 did not make sense)
was successful, kind of...
(still had to decide if "Marin" is misspelling of city "Martin" or some
village 2000+ km away)
the distance measurment would improve a lot, if layout of the keyboard
would be considered (on qwerty keyboard, r->t change is more probable
than a->p)
and exchange of letters would be punished less (eg
levenshtein('extralongword', 'extralognword')
vs
levenshtein('extralongword', 'extralohjword')
both return 2 even though the first one is in human terms more likely)
i've tried to read the sourcecode
http://doxygen.postgresql.org/levenshtein_8c.html#a3887230c68a3fee3cb0cc496614468eb
but my C skills are definitely not at that level...
anyway, using levenshtein/fuzzymatch in nominatim would probably be
a great performance hit.
michal
> Damerau-Levenshtein methods. There are modules to do this for many
> scripting languages (like Text::Fuzzy in Perl). There is also the
> command line tool 'agrep' which does the same thing.
>
> I'd recommend you manually check the results. I know it's slow, but
> there's no way to get this perfectly right automatically.
>
> cheers,
> Stewart
>
>
> _______________________________________________
> Geocoding mailing list
> Geocoding at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/geocoding
--
michal palenik
institut zamestnanosti
www.iz.sk
More information about the Geocoding
mailing list