[OSM-talk-be] Importing Villo! API data
CedB12
cednospam at gmail.com
Sun Nov 5 12:33:28 UTC 2017
Hi Glenn,
I will respond to some of your points because they are relevant to my
contributions in this thread. At the end of this email I also comment on
a survey I made today of six stations in order to evaluate the quality
of the API data.
As far as I can tell from my survey, the station names returned by the
Villo! API in the "name" field are exactly what shows up at the
stations' locations. (On the other hand the website only shows the
"address" field, which contains a name that often matches the "name"
field, but not always.) The station names are not printed on the
infrastructure: they only show up on the dynamic displays. (Only the
reference number is physically printed on the station, along with
"bonus" if it is a bonus station.)
The full official name of (most) stations, as reported by the "name" API
field, follows the format of Yves' example: "076 - PLACE VAN MEENEN/VAN
MEENENPLEIN". Of course, in OSM we want to split that into two (or
three) components: ref and name (or ref, name:fr and name:nl). Note
however that this cannot be straightforwardly automated, unlike with the
Antwerp Velo API data. There are multiple reasons for this.
First of all, names are in all-caps and (partially) stripped from
accents, and turning that into properly capitalized names with no
missing accents is nontrivial. Second, many stations are misspelled or
don't follow the standard OSM practice of expanding abbreviations (e.g.
Place St Jean -> Place Saint-Jean). Third, there is the problem of
bilingual names: Dutch names are sometimes missing while a STIB/MIVB
station nearby (or some street, or some building) has the exact same
French name and an available Dutch translation. Moreover in a couple of
instances it is not so easy to split the French and Dutch names. For
example "255 - SACRE-COEUR DE/HEILIGE HART VAN GANSHOREN". Finally names
are limited to 50 characters, and we probably don't want to encode them
as-is even if that is the official name. For example "257 - PL
MARGHERITE D'AUTRICHE / MARGARETHA VAN OO".
When I saw all those issues I decided to go through the list of station
names and clean them up myself. I did a first pass using a dictionary I
built from OSM street names to translate all-caps words to
properly-capitalized words with accents. Then I went through the list by
hand to fix conversion mistakes, misspellings, and provide Dutch
translations when they were missing. The results are in my github
repository (see my previous message in this thread), and that is what I
propose we use in name, name:fr and name:nl tags.
I don't know how we can do QA on name tags given the quality of the
source data, but at the very least we can store the official name (in
all caps, maybe with the station number stripped off) in the
official_name tag. That way we can easily compare that field against the
API in the event that it changes. Sometimes the Villo! operators change
the name to include a notice that the station is closed for works, but
this can be filtered out, either by removing all text in parentheses or
ignoring name discrepancies on stations which are marked as "closed"
(which is another field in the API).
Given that the API names are the same as the names displayed
on-location, we can reliably use them for armchair mapping, so I
wouldn't say the API "just sucks and we shouldn't use it". The API also
reports station capacity and the possibility of card payment, which is
also useful.
--------
I did a quick survey of six stations in Auderghem to compare the API
data to reality. Three stations had wrong coordinates (wrong street
block). I suppose they must have been correct at some point in the past,
but the stations have been moved since. However in two out of three
wrongly-located stations, the API "address" field pointed at the correct
house numbers. The third station was not in front of a house so the
"address" field only pointed out the street name.
I checked the "banking", "bonus" and "bike_stands" fields, which all
matched reality, as well as the sum of "available_bike_stands" and
"available_bikes". Note that sometimes this sum is not equal to
"bike_stands". I checked one of those stations (311 - Delta), where
bike_stands is 22 but available stands+bikes is 21. This is explained by
the fact that one of the stands is out of service, as indicated by a red
light on the stand. Strangely, last time I checked, one station in the
API (003 - Porte de Flandre / Vlaamsepoort) had four more available
bike+stands than "bike_stands", which makes no sense unless the station
was upgraded without updating the API field "bike_stands". I did not
survey that station.
As far as I could tell, the data reported on the interactive displays on
the stations matches the API data exactly (including the wrong
locations).
In conclusion, I think the "name" API field is perfectly OK to use after
cleanup. Columns "banking" and "bonus" matched in the six stations
surveyed. The "bike_stands" field seems to be static data, unlike
"available_bike_stands" and "available_bikes" which are dynamic. The
static data matched in my six surveyed stations, but it may be outdated
in some instances (though only one station in the API shows signs of
this). Meanwhile the dynamic data only counts stands that are in service
rather than the actual number of physical stands. Therefore I think
importing "bike_stands" data is also OK, as we only risk importing a few
outdated counts (perhaps even only one) instead of plainly incorrect
ones. On the other hand, location data is clearly problematic and should
only be used to guide mappers to the approximate location of the
stations. Perhaps we can still import yet-unmapped stations with a
"note" or "fixme" tag indicating that the location should be surveyed?
Surely this is better than not mapping the station at all, and there are
about a hundred missing stations in OSM.
Best,
Cédric
More information about the Talk-be
mailing list