[OSM-dev] Import of Tiger data

Don Smith dcsmith at gmail.com
Wed Mar 28 15:59:36 BST 2007


Regarding the import script, whatever language you prefer. Having  
said that, I'm most proficient in java and would prefer it for speed  
of development, but I'm willing to do what's best for project  
maintenance.

As for the existing osm data, I think i'd like end users to decide  
which data is correct. For one thing, the street names might not be  
an exact match. Second doing it programatically, there is no way to  
reconcile what is in the data vs. the real world. If we make the  
assumption that user data for each segment will always be more  
correct we could possibly do something, but all names should then be  
an exact match to proceed programatically (I-70, I70, Interstate 70).

As for the tiger data, while I'm hopeful, I've also heard that it's  
of mixed value. I think it's excellent to prepopulate the US, but the  
people I've spoken to who've used it say that it is missing a few  
things (one way street directions, and that positions are not  
excessively precise). If anything I see tiger as a jump start for the  
US, that would still need alot of work.

Don Smith
On Mar 28, 2007, at 2:52 AM, Nathan Rover wrote:

> Don,
> Good to see that your on-board with the TIGER database first  
> approach. I really think this will be the best way to go about the  
> import. I too would like to do this right even if it takes a little  
> longer.
>
> I'm curious if Thomas has any thoughts on this course of action, or  
> anyone else who is interested in the TIGER import.
>
> Yes, I'm downloading the march 07 data.
>
> Yeah, import scripts will be the first step. Thomas and I had  
> talked about using Python, will this be alright for you? It should  
> not take too long to make import scripts. I might also write a  
> script to unzip, and build a directory structure for all the TIGER  
> text files. It would take along time to unzip them by hand.
>
> The server I'm going to use is an IBM x330, it has dual PIII 1.16  
> Ghz CPUs, dual SCSI U160 36GB drives, and at least 1 GB of ram I  
> could put in like 4GB (borrowed from other dev servers) for a few  
> days while running the import files. Not the fastest server, but  
> they are very reliable, and it won't be doing anything else, so all  
> the resources can be devoted to the import. They are also capable  
> at running at a vary high sustained load without any thermal  
> issues. I'm going to be running this out of my office, so it will  
> only have a DSL connection. I most likely will not be driving (6 hr  
> each way) to my hosting facility any time soon, if I do, I will put  
> it up on the T1 line. If you feel so inclined we could make a  
> cluster.... I have 5 available dev servers right now. but I have a  
> feeling it would take longer to set-up the cluster then it would to  
> just run the scripts on the one server. ;-)
>
> I think we should also address the issue of "Effort in the US  
> wasted until TIGER import is complete?" that way we could  
> eventually change the subject line to something like "TIGER import."
>
> My first thought is effort is not necessarily wasted, but you might  
> as well wait and see what your area looks like once the data is  
> imported. And because of the way we are going to import the data,  
> i.e. on a dev machine first, we wont have to worry about writing  
> over roads that currently exist in the OSM database. When it comes  
> time to make the merger we can then compare side by side current  
> roads to the TIGER version of the road. we should be able to set up  
> some guidelines to determine which version of the road is better,  
> then make the necessary changes. I think one thing we should work  
> out soon is a Key/Value pair for identifying whether the way or  
> node is from a user, or TIGER, and also which version of TIGER. In  
> a year, we will most likely have an updated version of the TIGER  
> files to import, so, we will need a way to first figure out what  
> changes were made between the two versions, this could probably be  
> done with the mySQL TIGER database, then from there make a set of  
> new roads and roads that need to be updated. then this information  
> will need to be sent to the server and if the existing data is  
> marked TIGER then it can be automatically updated, if it was last  
> updated by a person then the two roads will need to be analyzed by  
> a person to determine which would be the most accurate. I think  
> this TIGER data is going to be an important part of the US OSM  
> effort. It would take such a long time to map and update the roads  
> in the US that unless there were a few thousand volunteers all over  
> the country, there is no way the US would ever get mapped without  
> the TIGER data. but even with the tiger data I think volunteers  
> will still play an important part. I know after the import is done  
> one thing that I'm looking forward to doing is mapping the foot  
> paths of my local state parks. Some of the more complex interstate  
> ramps and interchanges might also need to be mapped by volunteers.  
> I'm not sure how well the TIGER data handles exit and entrance ramps.
>
> I would also like to hear from the OSM database admins regarding  
> whether or not after we're done and the data is tested we would be  
> able to make a copy of the database with all the TIGER Data then  
> ship it over on DVD, and have it imported some night. I will also  
> need a Empty version of the current database to put on the dev  
> computer. and were going to have to figure out how to handle the  
> keys between the two versions of the databases.
>
> Thanks, Nathan Rover
>
> Don Smith wrote:
>> I agree with the idea of putting it in sql first. I think it would  
>> produce more reasonable data as you're correct that dealing with  
>> interstates, or even state routes across multiple counties would  
>> be a problem. My only concern is that using a database is always  
>> slower than file io and memory, especially with large record  
>> counts. However fast and wrong is worse than slow and right.
>>
>> Did you get the latest release of the data (March 3?).
>>
>> I assume you would need something to load the data, looking at the  
>> tiger data dictionary this does not seem to bad as each column has  
>> a fixed width, and strings could be trimmed. Is the machine you  
>> have in mind for the testing environment somewhat substantial?  
>> Also no objections to debian.
>>
>> Don Smith
>> On Mar 28, 2007, at 12:26 AM, Nathan Rover wrote:
>>
>>> Don,
>>> yeah, idea expressed in last e-mail would also solve this problem.
>>>
>>> Nathan
>>>
>>> Don Smith wrote:
>>>> Tiger contains roads, railroads, various other transportation  
>>>> features, landmarks (such as churches, schools, parks, and  
>>>> cemeteries).
>>>> More info here:
>>>> http://www.census.gov/geo/www/tiger/tiger2006se/tgr2006se.html
>>>>
>>>> I didn't see the importer dealing with the other data, but I  
>>>> haven't checked too closely. A new tiger file was released at  
>>>> the beginning of the month, which claimed to correct some of the  
>>>> data, though in the second half of the year they're moving to  
>>>> shapefiles from their own proprietary text format.
>>>>
>>>> Don Smith
>>>> On Mar 27, 2007, at 11:16 PM, Cory Lueninghoener wrote:
>>>>
>>>>> As someone who has done a fair amount of US mapping over the last
>>>>> couple of months (see the Chicago area), I'm curious: what exactly
>>>>> does the TIGER database hold?  Is it just street "segments"  
>>>>> with names
>>>>> and endpoints?  Does it have interstates, house number  
>>>>> information,
>>>>> any other street information (size, direction, etc.) or  
>>>>> anything else
>>>>> of use?  I definitely look forward to having at least a base  
>>>>> for the
>>>>> whole country done within a matter of weeks (months), but assuming
>>>>> we'll still need to tag lots of information and add things like  
>>>>> train
>>>>> lines, interstates (?), parks, etc. I'll keep up my manual efforts
>>>>> with plans to port them over when the time comes.
>>>>>
>>>>> On 3/27/07, Don Smith <dcsmith at gmail.com> wrote:
>>>>>> Is there a test machine setup?
>>>>>> I'm still looking at the code. For simplicity's sake, I'd like to
>>>>>> remove the ability to daemonize the code, and just run it from  
>>>>>> the
>>>>>> command line as a regular process, with a high priority. If  
>>>>>> the idea
>>>>>> is to do this on a testing db instead of the main db, and then  
>>>>>> import
>>>>>> a dump of the db then I believe this makes sense. Another thing,
>>>>>> someone was talking about setting up a copy of osm to import  
>>>>>> into.
>>>>>> Instead I'd like to suggest that this data is just loaded into a
>>>>>> blank template until everything works, then worry about  
>>>>>> merging the
>>>>>> us data, either through using something like a temp tag to
>>>>>> differentiate it and merging manually, or by doing something
>>>>>> programatic after the fact. Whatever the user contributions are I
>>>>>> would assume they're much smaller, although probably more  
>>>>>> accurate
>>>>>> than tiger data.
>>>>>>
>>>>>> Don Smith
>>>>>> On Mar 23, 2007, at 1:29 AM, Nathan Rover wrote:
>>>>>>
>>>>>> > I'm thinking we need to set up a server with a mirror of the  
>>>>>> current
>>>>>> > database, then just run a few counties then eventually a  
>>>>>> large batch
>>>>>> > (one state at a time?). Then we can conduct some rigorous  
>>>>>> testing
>>>>>> > and if
>>>>>> > the data looks good and we won't cause too much of problem  
>>>>>> on the
>>>>>> > production server, we could then ship the data across the  
>>>>>> pond (ether
>>>>>> > over the net or fedex some DVDs). Then one night perhaps an  
>>>>>> admin
>>>>>> > could
>>>>>> > import the data. I've never been a big fan of messing with  
>>>>>> production
>>>>>> > servers, and it seems to costly both in time and bandwidth  
>>>>>> to try and
>>>>>> > run exports from TIGER data on a box in central Missouri, to  
>>>>>> the
>>>>>> > UK, at
>>>>>> > one or three second intervals. especially when this will  
>>>>>> requirer lots
>>>>>> > of testing to make it work correctly.
>>>>>> >
>>>>>> > I have the hardware for the mirror, and I'm working on  
>>>>>> getting the
>>>>>> > TIGER
>>>>>> > data. If this is a direction everyone agrees with then the next
>>>>>> > thing I
>>>>>> > need is a little guidance on how to set up a server that  
>>>>>> will be a
>>>>>> > good
>>>>>> > software mirror to the production one. The closer the  
>>>>>> software setup
>>>>>> > matches the production machine (especially the database  
>>>>>> system and the
>>>>>> > APIs ) the better our testing can be and the less likely  
>>>>>> there will
>>>>>> > be a
>>>>>> > problem down the road when we try and integrate the data  
>>>>>> onto one box.
>>>>>> >
>>>>>> > Can someone send me a copy of the ruby code?
>>>>>> >
>>>>>> > Nathan Rover
>>>>>> >
>>>>>> > Don Smith wrote:
>>>>>> >> No, I'm currently not familliar with the data model so I  
>>>>>> should look
>>>>>> >> into that.
>>>>>> >> As for a 1 sec/insert cycle, if we don't do it on the  
>>>>>> primary db, it
>>>>>> >> makes no immediate sense, and I'd be interested in  timings  
>>>>>> without
>>>>>> >> it. I have no idea how many inserts are going on but, I  
>>>>>> would guess
>>>>>> >> from a ballpark on the size of tiger that you'd be right.
>>>>>> >> Again I'll look at the ruby code tonight, and if someone  
>>>>>> has a schema
>>>>>> >> for osm that'd be nice. If whoever did the original script  
>>>>>> could
>>>>>> >> outline their thinking that would be helpful as well.
>>>>>> >> I am subscribed to the dev list.
>>>>>> >> On Mar 23, 2007, at 12:06 AM, Thomas Lunde wrote:
>>>>>> >>
>>>>>> >>
>>>>>> >>> On 3/22/07, Don Smith <dcsmith at gmail.com> wrote:
>>>>>> >>>
>>>>>> >>>> Thomas,
>>>>>> >>>> do you have a machine setup? I'll look at the code  
>>>>>> tonight, but you
>>>>>> >>>> seem to have a better grasp of the operations going on. Any
>>>>>> >>>> ideas in
>>>>>> >>>> what specifically needs to be done?
>>>>>> >>>>
>>>>>> >>>> Right now, the two tasks appear to be aggregation of related
>>>>>> >>>> segments
>>>>>> >>>> into ways (?) and the 1 sec insert cycle. I would suggest
>>>>>> >>>> instead, if
>>>>>> >>>> possible that we load the data into mysql, and then when  
>>>>>> it's
>>>>>> >>>> ready,
>>>>>> >>>> batch import the sql into the master db(does this make  
>>>>>> sense?
>>>>>> >>>> Instead
>>>>>> >>>> of running the script twice, run it once, dump the db,  
>>>>>> and then
>>>>>> >>>> import the dump?).
>>>>>> >>>>
>>>>>> >>> Don -
>>>>>> >>>
>>>>>> >>> I do have a server that could be used, but it sounds like  
>>>>>> Nathan
>>>>>> >>> has a
>>>>>> >>> better one.  He is/has downloading/downloaded the latest  
>>>>>> TIGER data.
>>>>>> >>> Both of us need to have a better understanding of the OSM  
>>>>>> current
>>>>>> >>> data
>>>>>> >>> model than at present.  Pointers at particular documentation
>>>>>> >>> would be
>>>>>> >>> helpful, otherwise I'll just look around the site and the  
>>>>>> code.
>>>>>> >>>
>>>>>> >>> What I think I understand is that the 1 sec insert cycle  
>>>>>> of the old
>>>>>> >>> Ruby code would still take weeks/months to do an import.   
>>>>>> Is that
>>>>>> >>> right?
>>>>>> >>>
>>>>>> >>> If so, it seems that there's got to be a better way. I  
>>>>>> agree with
>>>>>> >>> you
>>>>>> >>> that using seperate servers to do a higher speed import  
>>>>>> and then to
>>>>>> >>> dump the data from DB to DB directly would seem to be the  
>>>>>> smarter
>>>>>> >>> approach.
>>>>>> >>>
>>>>>> >>> Are you already familiar with the OSM data model and/or  
>>>>>> with the old
>>>>>> >>> Ruby import code?
>>>>>> >>>
>>>>>> >>> thomas
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Nathan,Don -- if y'all are subscribed to the Dev list, let me
>>>>>> >>> know and
>>>>>> >>> I shan't cc: you directly.
>>>>>> >>>
>>>>>> >>
>>>>>> >>
>>>>>> >> _______________________________________________
>>>>>> >> dev mailing list
>>>>>> >> dev at openstreetmap.org
>>>>>> >> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > dev mailing list
>>>>>> > dev at openstreetmap.org
>>>>>> > http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> dev mailing list
>>>>>> dev at openstreetmap.org
>>>>>> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
>>>>>>
>>>>>
>>>>>
>>>>> --Cory Lueninghoener
>>>>> Perl, C, & Linux Hacker
>>>>> http://www.wirelesscouch.net/~cluening/
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>





More information about the dev mailing list