Skip to content

Instantly share code, notes, and snippets.

@lizconlan
Created February 24, 2011 16:11
Show Gist options
  • Save lizconlan/842353 to your computer and use it in GitHub Desktop.
Save lizconlan/842353 to your computer and use it in GitHub Desktop.
Scrapes historic committee debate info into CouchDB
require 'rubygems'
require 'nokogiri'
require 'net/http'
require 'rest_client'
require 'uri'
require 'json'
def format_member_name(input)
return "" if input.nil?
#dump anything enclosed in brackets
input = input.gsub(/\([^\)]*\)/, "")
#get rid of * character in name
input = input.gsub("*", "")
return "" if input.length == 0
if input.index(",").to_i > 0
name = input.split(",")
if name[1].nil? #deal with odd markup
member_name = name[0].strip
else
member_name = "#{name[1].strip} #{name[0].strip}"
end
else
member_name = input
end
case member_name
when "Hon. Members", /The/
return ""
else
return member_name
end
end
# define the order our columns are displayed in the datastore
#ScraperWiki.save_metadata("data_columns", ["href", "part-one", "part-two", "part-three", "bills", "members", "http-response", "content-length"])
starting_url = 'http://www.parliament.uk/business/publications/parliamentary-archives/archives-electronic/parliamentary-debates/historic-standing-committee-debates/'
#html = ScraperWiki.scrape(starting_url)
html = RestClient.get(starting_url).body
doc = Nokogiri::HTML(html)
doc.search('div.inner ul li a').each do |a|
uri = URI.parse(a['href'])
if uri.host == "data.parliament.uk"
response = Net::HTTP.get_response(URI.parse(a['href']))
members = []
bills = []
if response.code == "200"
doc_xml = Nokogiri::XML(response.body)
doc_xml.xpath('//member/text()').each do |member|
member_name = format_member_name(member.content)
members << member_name unless member_name == ""
end
doc_xml.xpath('//bill/text()').each do |bill|
bills << bill.content
end
end
record = {'members' => members.uniq, 'bills' => bills.uniq, 'http-response' => response.code, 'content-length' => response.content_length, 'href' => a['href'], 'part-one' => a.inner_html.split(' - ')[0], 'part-two' => a.inner_html.split(' - ')[1], 'part-three' => a.inner_html.split(' - ')[2]}
guid = `uuidgen`
RestClient.put("http://localhost:5984/historic-committees/#{guid}", record.to_json)
end
end
@lizconlan
Copy link
Author

Assumes you have a local CouchDB instance and an empty database called historic-committees

@lizconlan
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment