[Talk-GB] TfL cycle data published - proposed conflation process

Sun Oct 13 13:17:05 UTC 2019

I've been looking at the various tools available, e.g. JOSM Conflation, 
Hootenanny, OpenStreetMap Live Conflation, etc.

Whatever tool is best, a process is needed. May I seek comments on this 
proposal which would be put to TfL in my report to them:

----

A proposed process for conflation would be:

1. Seek OSM community agreement in principle that the CID data is useful 
for OSM (done).

2. Confirm licensing compatibility (done).

3. Consult on proposed technical translation of data (as per discussion in 
this talk-gb thread, ongoing).
https://bikedata.cyclestreets.net/tflcid/conversion/

4. Write a programming script based on this technical translation, which 
converts the CID data (using the version containing OSM IDs) to .osm 
format. The fundamental aim of this is to get the CID data to be as 
compatible with OSM norms as possible, so that amount of effort needed in 
the eventual conflation tool will be as low as possible. This converted 
dataset is referred to below as the “External OSM-compatible format 
dataset”. This will require expert programming work undertaken by a 
programmer fully conversant with the OSM data model. Estimate: 10 days.

5. ALPHA STAGE: small-scale merge of data into OSM. This stage aims to 
prove that the data is capable of being converted, and to demonstrate to 
the OSM community that it can be undertaken sensitively and accurately. It 
does not seek to produce a tool selection recommendation. This work should 
be undertaken by someone with experience of JOSM and the JOSM Conflation 
plugin. Estimate: 3-5 days.

i. Identify a suitable extract of the CID data covering only an area of 
10-20 smaller streets. This should be an outer London area, and avoid main 
roads, so that in the event that problems materialise, the effect on real 
users of OSM data is low. It should include both point-based and line-based 
assets, giving a good overview. It should aim to have a good variety of CID 
assets rather than the same type of asset dominating.

ii. Install the JOSM editor and the JOSM Conflation plugin, which provides 
a toolset for this alpha project. JOSM Conflation is the most sensible 
option, as this is most widely used conflation tool in the OSM community. 
Although it requires manual inspection, it is workable for an alpha project 
at this smaller scale.

iii. Attempt a merge of the External OSM-compatible format dataset using 
this tool.

iv. Carefully and thoroughly observe the correctness of the data, iterating 
the script output and repeating these alpha steps until correctness is 
achieved.

v. Save the merged import data into the live OSM dataset and request 
community feedback.

vi. Manually fix up any identified problems arising from this feedback so 
that there is correctness, and fix the underlying problem in the script.

vii. At this point, feasibility of conversion has been established, and 
community confidence will be much stronger.

6. BETA STAGE: larger-scale merge of data for one area. This stage aims to 
identify the best merging tool for a fuller conversion with a view to 
creating a fully-optimised workflow. Estimate: 4-8 weeks.

i. Identify a suitable extract of the CID data to undertake a pilot 
conversion project. One of the 25 CID data packages would be an ideal size 
for such an evaluation, and each package is likely to contain sufficient 
variety.

ii. Identify 2-3 most likely merging tools, e.g. JOSM Conflation and 
Hootenanny (see below).

iii. Install each such merging tool and learn and practice its use. The 
time required for such installation and evaluation should not be 
underestimated. These systems involve widely different technologies (even 
requiring different operating systems to be installed using a Virtual 
Machine), so this step could easily take 5 days. Test data will need to be 
prepared, trial runs created, questions are likely to need to be asked on 
mailing lists, etc.

iv. Identify the pros and cons of each tool and move towards a recommended 
solution based on trialing with the data and the amount of manual fixing up 
required.

v. Determine and iterate the workflow required for the tool.

vi. Adapt the now near-final script to perform conversion of this larger 
dataset for the selected tool. It is likely that the bulk of the conversion 
script will be unchanged, but that the final output format (e.g. 
.osm/Shapefile/GeoJSON) would need to be different based on the tool’s 
expectations.

vii. Substantial iteration of the conversion script and/or tool workflow is 
then likely to be required. For instance, merging will involve conflating 
data from a cycle lane in the CID data to the cycle lane present in the OSM 
nearby. This scenario is likely to throw up several potentially issues. For 
instance, the OSM ID may in fact now have changed; it might now be 
represented by multiple separate OSM IDs; there might be multiple cycle 
lanes nearby which need to be disambiguated, etc. Another example would be 
the inconsistent tagging of cycle lane/track -related data in OSM, which is 
acknowledged to be one of the most complex areas of OSM. The script will 
need to be adapted to deal with various edge-cases like these, so that the 
geometries and metadata are matched together correctly and that existing 
OSM data that should be retained is not overwritten.

viii. Inspect the conflated data and determine where manual inspection will 
be unavoidable vs. where fixes can be automated.

ix. Identify whether any upstream improvements to the conflation tool being 
used could be made, with a view to facilitating further automation of the 
workflow and reduce the need for repetitive manual inspection that is 
avoidable. Liaise with the tool authors to determine feasibility and likely 
time requirements for such development work.

x. Iterate the script and workflow to minimise as far as possible the need 
for these manual changes during an inspection stage.

xi. Document a key checklist of conversion types to check.

xii. Carefully and thoroughly observe the correctness of the data, 
iterating the script output and repeating these beta steps until 
correctness is achieved. Undertake manual changes that cannot be automated. 
The time required for this should not be underestimated – there will be 
around 10,000 assets within the data package, and all the various 
combinations of data should be checked.

xiii. Report to the OSM community at this stage, seeking their consent for 
merging in the data.

xiv. Save the merged import data into the live OSM dataset and request 
community feedback.

xv. Manually fix up any identified problems arising from this feedback so 
that there is correctness, and fix the underlying problem in the script.

xvi. At this point, feasibility and timescale for conversion has been 
established, and community confidence will be much stronger. The script 
will be in a near-final state for a mass import, and a set of instructions 
for manual inspection will be established. One of the 25 areas will be in 
OSM and this data will be picked up by routing and cartography systems 
entering real-world use within days/weeks.

xvii. Relay back to TfL the findings, in the form of a short document. This 
will:

a. Confirm what data within the CID has and has not been imported.

b. Include an estimate of the time requirement for the remaining 24 areas, 
based on an extrapolation of applying the finalised script and manual 
procedures.

c. A recommendation for whether this activity should be undertaken on a 
paid professional basis or whether crowdsourcing is realistic given the 
time, complexity and data volume.

d. Include any proposals for making improvements to tools and the likely 
cost, which TfL may wish to consider funding.

7. FINAL STAGE: full merger. This step involves re-running the finalised 
script/workflow and manual procedures for each of the 24 remaining data 
package areas. Estimate: as defined in beta report.

i. Run the script to convert the data for the 24 data package areas.

ii. Conduct the workflow for each of the 24 data package areas.

iii. Seek community input as this work proceeds.

iv. Import the data and fix up issues arising from feedback.

v. Report back to the OSM community.

vi. Produce a final report for TfL confirming completion of the activity.

Martin,                     **  CycleStreets - For Cyclists, By Cyclists
Developer, CycleStreets     **  https://www.cyclestreets.net/