[Talk-us] Using TIGER to find missing road segments in OSM after license change

Thu Mar 29 19:32:14 BST 2012

Hi,

On Thu, Mar 29, 2012 at 7:45 AM, Ian Dees <ian.dees at gmail.com> wrote:

>
>
> On Thu, Mar 29, 2012 at 8:11 AM, Josh Doe <josh at joshdoe.com> wrote:
>
>> On Thu, Mar 29, 2012 at 8:40 AM, Richard Weait <richard at weait.com> wrote:
>> > On Wed, Mar 28, 2012 at 8:48 AM, Ian Dees <ian.dees at gmail.com> wrote:
>> >> On Wed, Mar 28, 2012 at 5:47 AM, Josh Doe <josh at joshdoe.com> wrote:
>> >
>> > [ ... ]
>> >>> I was thinking more about using TIGER 2011
>> >>> to find roads that seem to be missing in the OSM database. My PostGIS
>> >>> skills are nil, but it seems like it should be a fairly trivial query
>> >>> to buffer the
>> >>> OSM ways and find TIGER segments which don't intersect the buffered
>> ways.
>> > [ ... ]
>> >> I started playing with this last night and ended up with the Chicago
>> area
>> >> metro extract from Mike and the Cook County TIGER roads data as layers
>> in
>> >> QGIS. Next up is to play with various queries to find missing roads in
>> OSM.
>> >> I like the idea of buffer and joining as a start and will probably
>> move over
>> >> to PostGIS to do that.
>> >
>> > We used OpenJUMP and the RoadMatcher plugin in the early days of
>> > Canadian imports to generate lists of matching or missing roads.
>>
>> I've actually just converted the conflation JOSM-plugin to use the
>> Java Conflation Suite (JCS), which RoadMatcher is based on. I don't
>> expect to have RoadMatcher-like capabilities in there for quite a
>> while, but I should soon at least be able to find which segments don't
>> have a match in OSM based on string similarity (e.g. Levenshtein) and
>> curve similarity (e.g. Hausdorff, Frechet).
>
>
> After loading Cook County TIGER road features and OSM linear features into
> PostGIS, I ran a simple query to find how well the roads matched:
>
> SELECT a.name, b.fullname, ST_HausdorffDistance(a.geom, b.geom) as dist
>      FROM cook_tiger a, cook_osm b
>      WHERE (a.geom && b.geom) AND ST_HausdorffDistance(a.geom, b.geom) <
> 0.0005
>      LIMIT 50
>
> This returned results that made sense (the names matched in all 50
> results).
>
> I removed the LIMIT clause and let it run before going to work to see how
> many of the TIGER records match existing OSM features.
>
> Next up is building a table of TIGER -> OSM matches and using that to find
> TIGER rows that don't have a corresponding OSM feature.
>
> If anyone has any ideas for speeding this up I'd love to hear it. It took
> well over a couple hours to run one county. There are a lot of counties in
> the US.
>
> Hi,

I ran a similar query on Salt Lake County and found ~7000 matches in under
7 seconds. That would scale to 3400-something counties OK.

The query I used was a little more elaborate and included a levenshtein
string distance comparison where I took name, name_1, name_2 and loc_name
into account:

SELECT
    a.fullname::text as tiger_name,
    b.tags->'name' as osm_name,
    b.tags->'name_1' as osm_name1,
    b.tags->'name_2' as osm_name2,
    b.tags->'loc_name' as osm_locname,
    ST_HausdorffDistance(a.the_geom, b.linestring) as dist,
    LEAST(
        NULLIF(difference(a.fullname::text, b.tags->'name'), 0),
        NULLIF(difference(a.fullname::text, b.tags->'name_1'), 0),
        NULLIF(difference(a.fullname::text, b.tags->'name_2'), 0),
        NULLIF(difference(a.fullname::text, b.tags->'loc_name'),0))
    AS diff,
    LEAST(
        levenshtein(a.fullname::text, b.tags->'name'),
        levenshtein(a.fullname::text, b.tags->'name_1'),
        levenshtein(a.fullname::text, b.tags->'name_2'),
        levenshtein(a.fullname::text, b.tags->'loc_name'))
    AS lev
FROM
    tigerlines a,
    ways b
WHERE
    ST_HausdorffDistance(a.the_geom, b.linestring) < 0.005
AND
    (a.the_geom && b.linestring)
-- THE TIGER NAME FIELD SHOULD BE FILLED
AND
    char_length(a.fullname::text) > 0
-- AT LEAST ONE OF THE NAME FIELDS SHOULD BE FILLED
AND
    COALESCE(
        char_length(b.tags->'name'),
        char_length(b.tags->'name_1'),
        char_length(b.tags->'name_2'),
        char_length(b.tags->'loc_name')) IS NOT NULL
AND
    b.tags?'highway'
AND
    a.roadflg = 'Y'
AND
    LEAST(
        levenshtein(a.fullname::text, b.tags->'name'),
        levenshtein(a.fullname::text, b.tags->'name_1'),
        levenshtein(a.fullname::text, b.tags->'name_2'),
        levenshtein(a.fullname::text, b.tags->'loc_name'))
    < 7
ORDER BY
    lev DESC
;

I calculated the Soundex difference as well as the Levenshtein difference
for informational purposes. A Levenshtein distance of 7 represented the
sweet spot between false positives and false negatives in terms of string
comparison.

The main problem I found with the original query is the Haussdorff distance
parameter, which chosen so small would not catch TIGER ways and OSM ways
that are of very different length. I increased the distance threshold by a
factor 10 but still get about 38000 candidates for missing roads based on
this query (doing a select * from tigerlines where gid not in ....). But
maybe I'm overlooking something else altogether in this. More complex
queries tend to be overwhelming to me. Maybe I'm not using the right TIGER
source file? I picked the all lines one.

Have others made any progress with this?

> _______________________________________________
> Talk-us mailing list
> Talk-us at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/talk-us
>
>

-- 
martijn van exel
geospatial omnivore
1109 1st ave #2
salt lake city, ut 84103
801-550-5815
http://oegeo.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-us/attachments/20120329/01a3eeae/attachment.html>