[OSM-dev] Retrieving Wikipedia Entries Automatically

Zhijie Shen zjshen14 at gmail.com
Sat Feb 26 14:28:20 GMT 2011

Hi developers,

If you can remember, I've exchange emails with you to discuss the wiki tag
of OpenStreetMap two days ago. Now I have my quick solution, a Wikipedia
entry crawler, to get more Wikipedia entries automatically. Here I am eager
to share with you, and wish it can be useful. The single Java class file can
be downloaded here<http://www.comp.nus.edu.sg/%7Ez-shen/WikiEntryCrawler.java>

The crawler implements the *Sink* interface of Osmosis, whose OSM XML file
parsing functionality is leveraged. It extracts the name of entity (e.g., *
node*, *way*) from the *name* tag (hence the entities without name are
omitted), uses it as the parameter to search the candidate Wikipedia entries
by calling the Wikipedia API, and then judge which entry among the responded
results is the true one for the corresponding entity. To do this, the
crawler checks the string similarity between entity name and Wikipedia entry
title, using Levenshtein distance algorithm. Moreover, since many Wikipedia
entries that the entities may link to have geo-coordinates, the crawler also
takes advantage of this knowledge to select the true entry: it uses the
Wikipedia API again to retrieve the entry content, extracts the
geo-coordinates if they exist, and computes the distance between it the
coordinate of the entity. Afterwards, combining these two metrics together
to compute the score for each candidate entry, the crawler chooses the first
entry whose score is above the pre-defined threshold (assuming that search
functionality of the Wikipedia API ranks responded results appropriately).
During the Wikipidea entry crawling, the OSM XML file will be parsed twice:
fisrt, retrieving candidate entries for each entity having name; second,
recording the coordinates of the entities to be checked, especially for *
ways* whose coordinates cannot be in the first pass.

I've also written an wiki page to introduce this:
http://wiki.openstreetmap.org/wiki/User:Zhijie_Shen. Please have a look. I
will appreciate any of your comments.


Zhijie Shen
School of Computing
National University of Singapore
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20110226/43baaf80/attachment.html>

More information about the dev mailing list