Created
April 3, 2017 10:12
-
-
Save SarahTaylorProject/ed1cd285a9e14584bde83634b9d28404 to your computer and use it in GitHub Desktop.
a further stage of the movie API testing that has successfully written director, film, and writer data to csvs, and keeps (but separates) the funny writer roles like novels...still room for improvement though
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# April 3rd 2017, 7pm | |
# This version of testing the Open Movie Database API is working well but falls down with large numbers (possibly because the API deliberately throttles our requests after a certain threshold) | |
# This version incorporates feedback from the meeting, indicating that the unusual writer variations (enclosed in parentheses for some writers, e.g. J.R.R. Tolkien has 'based on the novel by') should be kept, but in a separate field... | |
# The other vulnerability is in international characters - it is definitely writing them out in a strange and unusable fashion | |
# BUT this does work: it outputs a list of results without splitting the strings (into "movie_api_result_unprocessed.csv") | |
# AND it writes a separate file each for writers and directors, with very good handling of the unusual text within these fields (e.g. commas, parentheses) | |
# A further improvement to consider adding later is a handling for parentheses in the INPUT movie titles, it looks like these might indicate foreign films, and should be separated out for better match rates with the API | |
require "csv" | |
require "json" | |
require "net/http" | |
puts "START MOVIE TEST ******" | |
input_movies = CSV.read("movies.csv").map { |row| | |
[row[0], row[1]] | |
}.uniq | |
#movie_api_results = input_movies.first(1000).map { |id, title| | |
movie_api_results = input_movies.map { |id, title| | |
begin | |
data = JSON.parse(`curl "http://www.omdbapi.com/?t=#{title.gsub(/\s/,"+")}"`) | |
rescue Exception | |
puts "Error getting API results from #{title}" | |
data = {} | |
end | |
[id, title, data] | |
} | |
# output file 1: api results COMBINED without splitting | |
CSV.open("movie_api_result_unprocessed.csv", 'w') do |csv| | |
csv << ["input_movie_id", "input_movie_title", "Movie_Title_Match", "imdbID", "All_Directors", "All_Writers"] | |
movie_api_results.each do |id, title, movie_api_data| | |
#puts title | |
csv << [id, title, movie_api_data["Title"], movie_api_data["imdbID"], movie_api_data["Director"], movie_api_data["Writer"]] | |
end# of movie_api_results | |
end# of writing output file 1 csv | |
# output file 2: api results for WRITERS, split into separate rows | |
CSV.open("movie_api_result_writers.csv", 'w') do |csv| | |
csv << ["input_movie_id", "input_movie_title", "Movie_Title_Match", "All_Writers", "Individual_Writer", "Individual_Writer_Role"] | |
movie_api_results.each do |id, title, movie_api_data| | |
if ((movie_api_data["Writer"] != nil) and (movie_api_data["Writer"] != 'N/A')) | |
#puts movie_api_data["Writer"] | |
writers = movie_api_data["Writer"].split(", ").map { |name| [name.gsub(/ \([^\)]+\)/, ''), name.match(/\([^\)]+\)/)]} | |
writers.each do |current_writer, current_writer_role| | |
puts current_writer | |
csv << [id, title, movie_api_data["Title"], movie_api_data["Writer"], current_writer, current_writer_role] | |
end | |
else | |
#puts "No movie result for #{title}" | |
#puts movie_api_data.inspect | |
end# for checking if nul | |
end# of movie_api_results | |
end# of writing output file 2 csv for writer results | |
# output file 3: api results for DIRECTORS, split into separate rows | |
CSV.open("movie_api_result_directors.csv", 'w') do |csv| | |
csv << ["input_movie_id", "input_movie_title", "Movie_Title_Match", "All_directors", "Individual_Director", "Individual_Director_Role"] | |
movie_api_results.each do |id, title, movie_api_data| | |
if ((movie_api_data["Director"] != nil) and (movie_api_data["Director"] != 'N/A')) | |
#puts movie_api_data["Director"] | |
directors = movie_api_data["Director"].split(", ").map { |name| [name.gsub(/ \([^\)]+\)/, ''), name.match(/\([^\)]+\)/)]} | |
directors.each do |current_director, current_director_role| | |
puts current_director | |
csv << [id, title, movie_api_data["Title"], movie_api_data["Director"], current_director, current_director_role] | |
end | |
else | |
#puts "No movie result for #{title}" | |
#puts movie_api_data.inspect | |
end# for checking if nul | |
end# of movie_api_results | |
end# of writing output file 2 csv for director results | |
puts "END MOVIE TEST ******" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This part separates writer names when they have parentheses in their descriptions, but also puts the description in the second part of the array so that it can be written (separately) to the csv, that way the users can decide whether to keep them
directors = movie_api_data["Director"].split(", ").map { |name| [name.gsub(/ ([^\)]+)/, ''), name.match(/([^\)]+)/)]}
Currently it still has the parentheses around them, e.g. "(screenwriter)", I'm not sure that I hate this, but I do not that I can't change it! This is upsetting as I have found the regular expressions learning curve to be steep.
A further possible improvement is pulling apart the input movie titles and trying the contents to the left of, and inside of, parentheses when they appear, as these are likely to be foreign films or unnecessary inclusions of film years, which almost certainly preclude getting an API match if left as-is
Also, the duplicate film names need to be flagged.