For a few years now I’ve been harvesting downloadable text from digitised periodicals in Trove and making it easily available for exploration and research. I’ve just completed the latest harvest – here’s the summary:
- 1,163 digitised periodicals had text available for download
- Text was downloaded from 51,928 individual issues
- Adding up to a total of around 12gb of text
If you want to dive straight in, here’s a list of all the harvested periodicals, with links to download a summary of available issues, as well as all the harvested text (there’s one file per issue). You’ll notice that the list includes a large number of parliamentary papers and government reports as well as published journals.
All of the harvested text is available from a public folder on CloudStor.
The harvesting process involves a few different steps:
- First I generate a list of periodicals available in digital form from Trove. This includes digitised titles, as well as born-digital titles submitted through e-Legal Deposit. This produced a CSV file containing the details of 7,270 titles. See this notebook for details.
- Then I work through this list of titles to find out how many issues of each title are available through Trove. This information isn’t accessible through the API, so I have to do some screen scraping.
- Next I work through the list of issues and try to download the text contents. Most of the born-digital titles don’t have downloadable text.
- Once I’ve downloaded all the text I can from a title, I create a CSV file for it that lists the available issues and notes whether text is available for each. This file is stored with the text on CloudStor.
- Once I’ve checked all the titles, I generate another CSV file that lists the details of all the periodicals that have downloadable text.
- The code to harvest and document the downloaded text is available in this notebook. #dhhacks
This is a companion discussion topic for the original entry at https://updates.timsherratt.org/2021/08/06/updated-lots-and.html