Skip to content

Instantly share code, notes, and snippets.

@SarahTaylorProject
Created April 3, 2017 10:12
Show Gist options
  • Save SarahTaylorProject/ed1cd285a9e14584bde83634b9d28404 to your computer and use it in GitHub Desktop.
Save SarahTaylorProject/ed1cd285a9e14584bde83634b9d28404 to your computer and use it in GitHub Desktop.
a further stage of the movie API testing that has successfully written director, film, and writer data to csvs, and keeps (but separates) the funny writer roles like novels...still room for improvement though
# April 3rd 2017, 7pm
# This version of testing the Open Movie Database API is working well but falls down with large numbers (possibly because the API deliberately throttles our requests after a certain threshold)
# This version incorporates feedback from the meeting, indicating that the unusual writer variations (enclosed in parentheses for some writers, e.g. J.R.R. Tolkien has 'based on the novel by') should be kept, but in a separate field...
# The other vulnerability is in international characters - it is definitely writing them out in a strange and unusable fashion
# BUT this does work: it outputs a list of results without splitting the strings (into "movie_api_result_unprocessed.csv")
# AND it writes a separate file each for writers and directors, with very good handling of the unusual text within these fields (e.g. commas, parentheses)
# A further improvement to consider adding later is a handling for parentheses in the INPUT movie titles, it looks like these might indicate foreign films, and should be separated out for better match rates with the API
require "csv"
require "json"
require "net/http"
puts "START MOVIE TEST ******"
input_movies = CSV.read("movies.csv").map { |row|
[row[0], row[1]]
}.uniq
#movie_api_results = input_movies.first(1000).map { |id, title|
movie_api_results = input_movies.map { |id, title|
begin
data = JSON.parse(`curl "http://www.omdbapi.com/?t=#{title.gsub(/\s/,"+")}"`)
rescue Exception
puts "Error getting API results from #{title}"
data = {}
end
[id, title, data]
}
# output file 1: api results COMBINED without splitting
CSV.open("movie_api_result_unprocessed.csv", 'w') do |csv|
csv << ["input_movie_id", "input_movie_title", "Movie_Title_Match", "imdbID", "All_Directors", "All_Writers"]
movie_api_results.each do |id, title, movie_api_data|
#puts title
csv << [id, title, movie_api_data["Title"], movie_api_data["imdbID"], movie_api_data["Director"], movie_api_data["Writer"]]
end# of movie_api_results
end# of writing output file 1 csv
# output file 2: api results for WRITERS, split into separate rows
CSV.open("movie_api_result_writers.csv", 'w') do |csv|
csv << ["input_movie_id", "input_movie_title", "Movie_Title_Match", "All_Writers", "Individual_Writer", "Individual_Writer_Role"]
movie_api_results.each do |id, title, movie_api_data|
if ((movie_api_data["Writer"] != nil) and (movie_api_data["Writer"] != 'N/A'))
#puts movie_api_data["Writer"]
writers = movie_api_data["Writer"].split(", ").map { |name| [name.gsub(/ \([^\)]+\)/, ''), name.match(/\([^\)]+\)/)]}
writers.each do |current_writer, current_writer_role|
puts current_writer
csv << [id, title, movie_api_data["Title"], movie_api_data["Writer"], current_writer, current_writer_role]
end
else
#puts "No movie result for #{title}"
#puts movie_api_data.inspect
end# for checking if nul
end# of movie_api_results
end# of writing output file 2 csv for writer results
# output file 3: api results for DIRECTORS, split into separate rows
CSV.open("movie_api_result_directors.csv", 'w') do |csv|
csv << ["input_movie_id", "input_movie_title", "Movie_Title_Match", "All_directors", "Individual_Director", "Individual_Director_Role"]
movie_api_results.each do |id, title, movie_api_data|
if ((movie_api_data["Director"] != nil) and (movie_api_data["Director"] != 'N/A'))
#puts movie_api_data["Director"]
directors = movie_api_data["Director"].split(", ").map { |name| [name.gsub(/ \([^\)]+\)/, ''), name.match(/\([^\)]+\)/)]}
directors.each do |current_director, current_director_role|
puts current_director
csv << [id, title, movie_api_data["Title"], movie_api_data["Director"], current_director, current_director_role]
end
else
#puts "No movie result for #{title}"
#puts movie_api_data.inspect
end# for checking if nul
end# of movie_api_results
end# of writing output file 2 csv for director results
puts "END MOVIE TEST ******"
@SarahTaylorProject
Copy link
Author

This part separates writer names when they have parentheses in their descriptions, but also puts the description in the second part of the array so that it can be written (separately) to the csv, that way the users can decide whether to keep them
directors = movie_api_data["Director"].split(", ").map { |name| [name.gsub(/ ([^\)]+)/, ''), name.match(/([^\)]+)/)]}
Currently it still has the parentheses around them, e.g. "(screenwriter)", I'm not sure that I hate this, but I do not that I can't change it! This is upsetting as I have found the regular expressions learning curve to be steep.
A further possible improvement is pulling apart the input movie titles and trying the contents to the left of, and inside of, parentheses when they appear, as these are likely to be foreign films or unnecessary inclusions of film years, which almost certainly preclude getting an API match if left as-is
Also, the duplicate film names need to be flagged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment