Skip to content

Instantly share code, notes, and snippets.

@aammd
Created December 3, 2014 21:09
Show Gist options
  • Save aammd/dd9860d5eb70ffde1622 to your computer and use it in GitHub Desktop.
Save aammd/dd9860d5eb70ffde1622 to your computer and use it in GitHub Desktop.
This demonstrates web-scraping in order to generate a list of links on a site, in order to download them all. The example used is the Municipality of Burnaby's list of demolition permits
library("rvest")
library("XML")
links_list <- html("http://www.burnaby.ca/City-Services/Building/Permits-Issued.html") %>%
html_nodes("#ctl15_nestedList a")
## Where do links lead? This is encoded in the attribute "href". First get all the attributes
links_attr <- sapply(links_list, xmlAttrs)
## We have the attributes! each link just became a single named character
## vector. Each element in the vector gives the attributes value; the name of
## each element is the attribute name. We must pull out the attribute named "href":
link_address <- sapply(links_attr, function(x) x[["href"]])
## or, more succinctly, if you are into that sort of thing:
link_address <- sapply(links_attr, `[[`, i = "href")
## Final step of cleaning: the "header" links (the items you click to make data
## for a whole month appear) have a value of "#". We need to drop 'em:
demolition_permits <- Filter(function(x) x != "#", link_address)
## Note that there are many ways to do this, e.g. `which`, `subset` and `stringr::str_detect`
# All these are relative links, all with the same base url:
baseurl <- "http://www.burnaby.ca"
demolition_pdfs <- sapply(demolition_permits, function(x) paste0(baseurl, x))
## now you download em all!
get_pdf <- function(url){
download.file(url, destfile = basename(url))
}
sapply(demolition_pdfs, get_pdf)
## This has the advantage of catching weirdo link addresses, see for example
demolition_pdfs[[115]]
demolition_pdfs[202]
## these don't match the usual pattern
@jennybc
Copy link

jennybc commented Dec 3, 2014

Oh cool! I'm not going to run this now. Are you saying their own PDF links don't follow the pattern I teased out and assumed?

@aammd
Copy link
Author

aammd commented Dec 3, 2014

I was surprised too! one has four have a random dollar sign and exclamation point in it:

http://www.burnaby.ca/Assets/city+services/building/Permits+Issued/07+-+June+2014/June+13$!2c+2014.pdf

Another doesn't even have a pdf in the url! but it definitely leads to a pdf

http://www.burnaby.ca/AssetFactory.aspx?did=13300

@jennybc
Copy link

jennybc commented Dec 3, 2014

That. Is. Just. Not. Right.

And yet nothing surprises me anymore. Nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment