[OSM-dev] Data source for robot

Tue Oct 12 15:23:57 BST 2010

Serge Wroclawski <emacsen at gmail.com> writes:

> On Tue, Oct 12, 2010 at 4:36 AM, Peter Budny <peterb at gatech.edu> wrote:
>
>> If route relations are not required, then what are
>> http://wiki.openstreetmap.org/wiki/Relation:route#Road_Routes for?
>
> Not required and "don't exist" aren't quite the same things.
>
> One major issue with relations in general is very little software
> knows how to handle them, and that's especially true for things like
> routing software, but that's not at the core of my concerns, which
> I'll elaborate on later in the mail.

I think this is somewhat orthogonal.  I'm pre-supposing that the
relations ought to be created.  If we decide no, there's probably
something similar I can do.

Keep this as an assumption and let's move on for a bit.

> *snip*
>> They /are/ required, because roads may be discontiguous in various ways:
>> a road may change names (e.g. Main Street North becomes Main Street
>> South, but to a driver or pedestrian, both are just one continuous Main
>> Street), or even be physically discontiguous (some state and even US
>> Highways do this).
>
> I'm a little confused by this example.
>
> "Main Street North becomes Main Street" - how would you handle this?
> What specifically would you do? Add a relation? What tags would you
> add, or remove, from the individual ways?

I'm not sure what answer you're looking for.  How would you tag these
roads so it's unquestionable that they are part of the same road?

As a better example, how about US-41, which looks like it has several
hundred names along its length?
http://www.openstreetmap.org/browse/relation/444690
Just in my area, it's call Cobb Parkway, Northside Drive, Northside
Parkway, etc.  How would you tag all those ways so it's totally clear
that they all make up US-41?  Road relations exist exactly for this
purpose, but maybe you envision something else like ref=* tags.

>>  Using TIGER data, we can automate the
>> process, but the bot's work will not be perfect; humans will still have
>> to check it and make a few corrections.  Still, if it does 95% of the
>> work for them correctly, this is pretty good IMO.  (After all, TIGER data
>> itself is not even close to 95% correct.)
>
> You've identified several issues in this paragraph, and I'd like to
> flush them out:
>
> 1) Your data source, TIGER, is by your own admission, not accurate. I
> don't want to get into a discussion about TIGER (that may be best left
> for osm-us), but when you start with a dataset as, let's say
> "controversial" as TIGER, you can expect a lot of concerns from the
> community.

For all its problems, TIGER comes with an abundance of metadata.  In
particular, all roads have 1 or more tiger:base_name=* tags.  This has
the name of the road minus any prefixes or suffixes (e.g., "West Main
Street" would just have "Main").

For US Highways and State Roads, this is also included as a base name,
e.g. "tiger:base_name_2=State Route 400".

Using this, it's not hard to group the pieces of the road together into
a relation (or tag their ref=* appropriately, if that's decided
instead).

However, TIGER data is inconsistent (it may have "SR" instead of "State
Route") and sometimes wrong (it may have typos like "Staet" instead of
"State").  Some fuzzy matching will help smooth this out.

> 2) You say that humans will have to check it and make corrections.
> What mechanism do you propose to integrate into your mass-edits which
> would integrate human validation? In other words, how do you plan on
> accomplishing the human validation step before modifying the database?

There are problems out there now that are not being fixed.

Example 1: Un-joined ways (due to chunks of a road being part of
multiple imports, usually at county or state boundaries).
Currently, there are hundreds of thousands of these out there, lurking
unknown.  A robot could notice that there is a gap in the road and flag
it as such on OpenStreetBugs.

Example 2: One-way roads.  TIGER isn't good about indicating the
directionality of a road, and there are a lot of rural areas that
haven't seen any editing yet.  Consequently, there are a lot of
dual-carriageways that are not marked as oneway=yes.  A robot could make
intelligent guesses at whether the road is a dual-carriageway (two
nearly-parallel roads with the same name, and at both ends only a single
with with the same name continues? hard to imagine what that could be
besides a single-carriageway becoming dual and then reverting to single)
and mark the ways as oneway=yes.

Is this likely to introduce some errors?  Yes, a few.  I consider it to
/already/ be wrong if it's supposed to be oneway=yes and isn't.  So the
robot would be fixing more errors than it's introducing.

The larger point, though, is that creating relations is really tedious
work.  If the robot can do most of the work, and then leave notes like
"Large discontinuity exists at lat-lon, please check" or "Small gap;
probably a connectivity issue from the TIGER import", then not only will
it have done >90% of the work for the user, but it will also have
identified the problem spots, which currently isn't done... users are
expected to just browse around the map until they see something wrong,
which often isn't apparent (especially with the TIGER road connectivity
problems).

> 3) The road to hell in OSM is paved with bot intentions.
>
> OSM has a long, negative history with bots. We have a very small
> number of good imports, and dozens (if not more) bad imports. Bad
> imports are so commonplace in OSM that within the OSM community, bots
> of any sort are discouraged, but especially any imports, and
> especially (as you appear to be proposing), merging existing data with
> imported data.

All the robot will do is upload new road relations... nothing more.  No
merging relations, no merging ways (at most it might set ref= or oneway=
tags on a small number of ways).  It will check for existing road
relations before creating one for a particular road.  The wiki has quite
a few lists of relations; these can also be used to avoid overlapping
data.

> 4) How well do you know OSM?
>
> Elaborating on my previous point, OSM is a very attractive project and
> very smart folks come to it all the time with a great idea about how a
> bot or an import could be very beneficial. Unfortunately, while these
> people may understand the data, and maybe the representation, unless
> you're familiar with OSM, you don't know the pitfalls that come in and
> cause the most problems.
>
> Here's a small but real example: Let's say your import is chugging
> along, and then it comes across an area where someone's already done
> the work. How will it react? Would it overwrite the contributor's
> work? Would it stop? If it stops, would it know which segments have
> been committed to the DB and which haven't (ie would it be able to
> prevent duplicates?) Would your bot handle tags which users may have
> added to the way, or relation? And so on...
>
> This is why the observation about bots we have is that no one who has
> been with the project < a year should do them. And most people who
> suggest making bots have been with the project < 6 months.

I've been contributing heavily (as heavily as a student with not much
free time can) for just about 1 year.  I've been active on the wiki (but
just joined the mailing lists, as it isn't apparent from the wiki that
this is where all the heavy discussion takes place!)

> 5) Academic Research
>
> I think that it's great that academics are interested in using OSM for
> their research. But at the same time, I've worked in academic
> computing for most of my professional career, surrounded by some of
> the smartest people in their field, at both NIH, and NASA. These are
> the best of the best.
>
> And my view of much of what's produced by academics who write software
> is that it's poo-poo.

I sympathize; as a TA I've looked at too much code to believe that being
a graduate student automatically makes one a good coder.  This is not
my first rodeo, however, and I'd ask that you not dismiss my ability to
contribute out of hand just yet.
-- 
Peter Budny  \
Georgia Tech  \
CS PhD student \