[OSM-talk] tile downloader

Paul Houle paul at ontology2.com
Thu Oct 21 15:22:13 BST 2010


  On 10/20/2010 12:13 PM, Martin Koppenhoefer wrote:
> Maybe we could work around this by automatically changing the link for
> the stored tiles? This would also harm "friendly" projects with small
> tile-download-rates though. If it is technically possible to identify
> this application they could also be filtered out.
     I used to work on a website where we were always waging wars 
against webcrawlers.

     It's certainly useful to ban certain user agents,  but it's very 
easy for attackers to change their user agent to look like an ordinary 
web browser.

     We had a system called "robocop" that did a running tail -f of the 
access_log,  kept counts of how many hits we'd gotten from different IP 
addresses in the last hour,  and if somebody was downloading too much,  
we'd drop a deny directive into our .htaccess file and that would be the 
end of them.  I'd even get a text message when this happened.

      I sketched out a design for a system called "robocop 2" that would 
do this in a better way and would generally help us manage our traffic 
in real time.  I didn't get the go-ahead to build it.

     Before I had that job,  I had another "job" doing,  uh,  "difficult 
information retrieval."  I had a webcrawler called "Blackbird" that was 
designed for low observability and that was designed to understand the 
structure of a website enough that,  rather than copying the site,  it 
would copy the database behind the site.  With the right configuration,  
Blackbird could have completely subverted the defenses of the site 
mentioned above -- but I wasn't doing that kind of stuff anymore.  I got 
sick of being on mailing lists where I knew somebody was a spy but not 
who...



More information about the talk mailing list