The TroveHarvester makes it easy to download articles in bulk from Trove's digitised newspapers. Using the --text
option you can also save the fulltext content of every article.
However, this doesn't work for the Australian Womens' Weekly as the full text is not available through the Trove API. Fortunately, the article text can be downloaded from the web interface.
The one-line script below uses wget, so make sure you have it installed before you go any further. (You can install it with Homebrew if you're using a Mac.)
- Run the Troveharvester as normal to harvest the article metadata as a CSV file (don't use the
--text
option) - From the command line
cd
into the directory that contains theresults.csv
file created by your harvest
Now you have a choice. Although the full text downloaded from the web interface says it's a text file, it's not -- it's a HTML file. If you don't mind having <p>s and <div>s messing up your text you can just copy and paste this into the command line and hit enter:
for id in $(cut -d , -f 1 results.csv | sed "1 d"); do wget "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt"; done
If you want to strip the HTML tags use:
for id in $(cut -d , -f 1 results.csv | sed "1 d"); do wget -qO- "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt" | sed -e 's/<[^>]*>/ /g' > $id".txt"; done
This might result in extra spaces, but I'm assuming that won't matter too much.
Either way you'll end up with lost of little text files -- one per article.
This is what happens:
cut -d , -f 1
gets the first column of theresults.csv
file which contains the article idssed "1 d"
removes the header rowfor id in...
feeds the list of article ids into a loopwget -qO- "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt"
retrieves the article textsed -e 's/<[^>]*>/ /g'
gets rid of the HTML tags> $id".txt"
writes the text to a file named with the article id
Pretty cool huh? I'm sure there are neater ways of doing this, but I was pleased to find a one line solution.