[Openstreetmap] Linking in Wikipedia
Lars Aronsson
lars at aronsson.se
Tue Jul 19 16:54:42 BST 2005
Nick Whitelegg wrote:
> Is it possible to obtain the "source code" of a Wikipedia page? Some
> wikipedia index pages have "view source" but not other pages. If so, then
> presumably what one could do is request the "source" of a page from
> Wikipedia then use the source to format it into your own site.
Every Wikipedia page has an "edit" tab that brings up the source
wikitext in an HTML form textarea, with a save button underneath.
The exception is pages that are write protected, where the tab
instead reads "view source" and there is no save button. But the
wikitext source in the textarea is the same.
So all you need to do is to use "wget" to fetch
http://en.wikipedia.org/w/index.php?title=Berlin&action=edit
and use some regexp to extract everything between <textarea>
and </textarea>.
But this is impractical for big scale metadata extraction, such as
harvesting all "Personendaten" for all 250,000 articles in the
German Wikipedia. Instead you can download the entire database and
import it to your own MySQL instance. You can get both the
current (cur) and archived previous versions (old) of every
article of every language. But beware that this is a lot of data,
many gigabytes.
The architecture is described starting at
http://meta.wikimedia.org/wiki/MediaWiki_architecture
and the database download is available at
http://download.wikimedia.org/
> What would be good is something along the lines of:
>
> - User visits http://www.free-map.org.uk/
> - User clicks on a place name, e.g. Fernhurst
> - A request is made to Wikipedia for the Wikipedia article on Fernhurst
> - Wikipedia sends back the Fernhurst article in XML which can be processed
> by the client.
Would this assume that Wikipedia has an article on Fernhurst? (It
does not.) For many place names, Wikipedia has a "disambiguation
pages" that branch off into the specific articles, such as
http://en.wikipedia.org/wiki/San_Jose , in which case you would
want to access http://en.wikipedia.org/wiki/San_Jose%2C_California
If you download the current articles of the English Wikipedia
(http://download.wikimedia.org/wikipedia/en/20050623_cur_table.sql.gz
size 1.0 gigabyte) and dig through to find geo coordinates, you
could extract a list like this:
Lat. Long. Article name
----------- ------------ -----------------------
37°18'15" N 121°52'22" W San Jose, California
52°31′ N 13°24′ E Berlin
Since the London article doesn't contain coordinates, it would be
missing from this list. You would not find Fernhurst, since there
is no Wikipedia article of this place. And you would find no geo
coordinates in the disambiguation page "San Jose", which is fine.
For the article http://en.wikipedia.org/wiki/Limehouse
there is no lat-long coordinate, but an OS Grid Reference, that
your script could pick up and convert to something useful.
Now you can fit your free-map with links at these coordinates.
Still missing is the XML export. You might do without this, by
simply opening the plain HTML page from Wikipedia in a new window
or browser tab. That would remove the need for postprocessing.
We could still discuss this XML export as a feature request, but
it doesn't really stop you from doing the rest of the work.
Suppose there are geo coordinates in the Wikipedia articles on
Europe, Great Britain, England, London, City of London, Tower
Hamlets, and Limehouse. At which zoom level would you show the
individual townships and where would you show the overall London
or England link instead? How do you tell? Which extra fields
would you need in the coordinate list above? How should that
information best be written into the wikitext source?
--
Lars Aronsson (lars at aronsson.se)
Aronsson Datateknik - http://aronsson.se
More information about the talk
mailing list