We’re in the midst of planning for the HASS Research Data Commons, which will deliver some much-needed investment in digital research infrastructure for the humanities and social sciences. Amongst the funded programs are tools for text analysis as part of the Linguistics Data Commons, and a platform for more advanced research using Trove. I’m hoping that this will be an opportunity to take stock of existing tools and resources, and build flexible pathways for researchers that enable them to collect, move, analyse, preserve, and share data across different platforms and services.
To this end, I thought it might be useful to try and summarise what the GLAM Workbench offers, particularly for Trove researchers. The GLAM Workbench doesn’t really have an institutional home, and is mostly unfunded – it’s my passion project. That means that it’s easy to overlook, particularly when the big grants are being doled out. But I think it has a lot to offer and I’m looking forward to exploring ways it can connect with these new initiatives.
Getting and moving data
There’s lots of fabulous data in Trove and other GLAM collections. In fact, there’s so much data that it can be difficult for researchers to find and collect what’s relevant to their interests. There are many tools in the GLAM Workbench to help researchers assemble their own datasets. For example:
- Get newspaper articles in bulk with the Trove Newspaper and Gazette Harvester – This has been around in some form for more than ten years (it pre-dates the Trove API!). Give it the url of a search in Trove’s newspapers and gazettes and the harvester will save all the metadata in a CSV file, and optionally download the complete articles as OCRd text, images, or PDFs. The amount of data you harvest is really only limited by your patience and disk space. I’ve harvested more than a million articles in the past. The GLAM Workbench includes a web app version of the harvester that runs live in the cloud – just paste in your Trove API key and the search url, and click the button.
- Get Trove newspaper pages as images – If you need a nice, high-resolution version of a newspaper page you can use this web app. If you want to harvest every front page (or some other particular page) here’s an example that gets all the covers of the Australian Women’s Weekly. A pre-harvested collection of the AWW covers is included as a bonus extra.
- Get Trove newspaper articles as images – The Trove web interface makes it difficult to download complete images of articles, but this tool will do the job. There’s a handy web app to grab individual images, but the code from this tool is reused in other places such as the Trove Newspaper Harvester and the Omeka uploader, and could be built-in to your own research workflows.
- Upload Trove newspaper articles to Omeka – Whether you’re creating on online exhibition or building a research database, Omeka can be very useful. This notebook connects Trove’s newspapers to Omeka for easy upload. Your selected articles can come from a search query, a Trove list, a Zotero library, or just a list of article ids. Metadata records are created in Omeka for each article and newspaper, and an image of each article is attached.
- Get OCRd text from digitised periodicals in Trove – They’re often overshadowed by the newspapers, but there’s now lots of digitised journals, magazines, and parliamentary papers in Trove. You can get article-level data from the API, but not issue data. This notebook enables researchers to get metadata and OCRd text from every available issue of a periodical. To make researchers’ lives even easier, I regularly harvest all the available OCRd text from digitised periodicals in Trove. The latest harvest downloaded 51,928 issues from 1,163 periodicals – that’s about 10gb of text. You can browse the list of periodicals with OCRd text, or search this database. All the OCRd text is stored in a public repository on CloudStor.
- Get page images from digitised periodicals in Trove – There’s more than text in digitised periodicals, and you might want to download images of pages for visual analysis. This notebook shows you how to get cover images, but could be easily modified to get another page, or a PDF. I used a modified version of this to create a collection of 3,471 full page editorial cartoons from The Bulletin, 1886 to 1952 – all available to download from CloudStor.
- Get OCRd text from digitised books in Trove – Yep, there’s digitised books as well as newspapers and periodicals. You can download OCRd text from an individual book using the Trove web interface, but how do you make a collection of books without all that pointing and clicking? This notebook downloads all the available OCRd text from digitised books in Trove. The latest harvest includes text from 26,762 works. You can explore the results using this database.
- Harvest parliamentary press releases from Trove – Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and fulltext from a search of the parliamentary press releases. For example, here’s a collection of politicians talking about ‘refugees’, and another relating to COVID-19.
- Harvest details of Radio National programs from Trove – Trove creates records for programs broadcast on ABC Radio National, for the major current affairs programs these records at at segment level. Even though they don’t provide full transcripts, this data provide a rich, fine-grained record of Australia’s recent political, social, and economic history. This notebook shows you how to download the Radio National data. If you just want to dive straight in, there’s also a pre-harvested collection containing more than 400,000 records, with separate downloads for some of the main programs.
- Find all the versions of an archived web page in Trove – Many of the tools in the Web Archives section of the GLAM Workbench will work with the Australian Web Archive, which is part of Trove. This notebook shows you how to get data about the number of times a web page has been archived over time.
- Harvesting collections of text from archived web pages in Trove – If you want to explore how the content of a web page changes over time, you can use this notebook to capture the text content of every archived version of a web page.
- Convert a Trove list into a CSV file – While Trove provides a data download option for lists, it leaves out a lot of useful data. This notebook downloads full details of newspaper articles and other works in a list and saves them as CSV files. Like the Trove Newspaper Harvester, it lets you download OCRd text and images from newspaper articles.
- Collecting information about Trove user activity – It’s not just the content of Trove that provides interesting research data, it’s also the way people engage with it. Using the Trove API it’s possible to harvest details of all user created lists and tags. And yes, there’s pre-harvested collections of lists and tags for the impatient.
While I’m focusing here on Trove, there’s also tools to create datasets from the National Archives of Australia, Digital NZ and Papers Past, the National Museum of Australia and more. And there’s a big list of readily downloadable datasets from Australian GLAM organisations.
Visualisation and analysis
Many of the notebooks listed above include examples that demonstrate ways of exploring and analysing your harvested data. There are also a number of companion notebooks that examine some possibilities in more detail, for example:
- Explore your Trove newspaper harvests
- Load your Trove newspaper harvest in Datasette
- Exploring ABC Radio National metadata
- Analyse public tags added to Trove
But there are also many other notebooks that demonstrate methods for analysing Trove’s content, for example:
- QueryPic – Another tool that’s been around in different forms for a decade, QueryPic visualises searches in Trove’s newspapers. The latest web app couldn’t be simpler, just paste in your API key and a search url and create charts showing the number of matching articles over time. You can combine queries, change time scales, and download the data and visualisations.
- Visualise Trove newspaper searches over time – This is like a deconstructed version of QueryPic that walks you through the process of using Trove’s facets to assemble a dataset of results over time. It provide a lot of detail on the sorts of data available, and the questions we can ask of it.
- Visualise the total number of newspaper articles in Trove by year and state – This notebook uses a modified version of the code above to analyse the construction and context of Trove’s newspaper corpus itself. What are you actually searching? Meet the WWI effect and the copyright cliff of death! This is a great place to start if you want to get people thinking critically about digital resources are constructed.
- Analyse rates of OCR correction – Some more meta-analysis of the Trove corpus itself, this time focusing on patterns of OCR correction by Trove users.
- Identifying non-English language newspapers in Trove – There are a growing number of non-English language newspapers digitised in Trove. However, if you’re only searching using English keywords, you might never know that they’re there. This notebook analyses a sample of articles from every newspaper in Trove to identify non-English content.
- Beyond the copyright cliff of death – Most of the newspaper articles on Trove were published before 1955, but there are some from the later period. This notebook helps you find out how many, and which newspapers they were published in.
Map Trove newspaper results by state – This notebook uses the Trove
statefacet to create a choropleth map that visualises the number of search results per state.
Map Trove newspaper results by place of publication – This notebook uses the Trove
titlefacet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.
- Compare two versions of an archived web page – This notebook demonstrates a number of different ways of comparing versions of archived web pages. Just choose a repository, enter a url, and select two dates to see comparisons based on: page metadata, basic statistics such as file size and number of words, numbers of internal and external links, cosine similarity of text, line by line differences in text or code, and screenshots.
- Display changes in the text of an archived web page over time – This web app gathers all the available versions of a web page and then visualises changes in its content between versions – what’s been added, removed, and changed?
- Use screenshots to visualise change in a page over time– Create a series of full page screenshots of a web page over time, then assemble them into a time series.
There are also possibilities for using Trove data creatively. For example you can create ‘scissors and paste’ messages from Trove newspaper articles.
Documentation and examples
All the Trove notebooks in the GLAM Workbench help document the possibilities and limits of the Trove API. The examples above can be modified and reworked to suit different research interests. Some notebooks also explore particular aspects of the API, for example:
- Trove API Introduction – Some very basic examples of making requests and understanding results.
Today’s news yesterday – Uses the
dateindex and the
firstpageseqparameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.
- The use of standard licences and rights statements in Trove image records – Version 2.1 of the Trove API introduced a new rights index that you can use to limit your search results to records that include one of a list of standard licences and rights statements. We can also use this index to build a picture of which rights statements are currently being used, and by who.
- Random items from Trove – Changes to the Trove API meant that techniques you could previously use to select resources at random no longer work. This section documents some alternative ways of retrieving random-ish works and newspaper articles from Trove.
And while it’s not officially part of the GLAM Workbench, I also maintain the Trove API Console which provides lots of examples of the API in action.
In developing the GLAM Workbench I’m very aware that people will arrive with different levels of digital skill, confidence, and experience. That’s why I’ve been putting a lot of thought and effort into ways of providing a range of entry points.
Someone who might not identify as a ‘digital’ researcher can, with a single click, start up QueryPic and start exploring changes over time in Trove’s newspapers. This is possible because the GLAM Workbench is configured to make use of Binder, a service that spins up customised computing environments as needed.
Another researcher might start running the Trove Newspaper Harvester using Binder, but find that they want to run bigger and longer harvests. In that case, the GLAM Workbench offers a one-click installation of the Trove Newspaper Harvester on Reclaim Cloud. Unlike Binder, Reclaim Cloud environments are persistent, so you can run the harvester for as long as you want without the worry of interruptions.
Yet another researcher might want to understand how the Trove API works and the sorts of data that it makes available. By exploring the various notebooks they’ll find useful snippets of code they can try out in their own projects.
The GLAM Workbench connects outwards to make use of a range of other services – the notebooks run in Binder, Reclaim Cloud, and Docker; the code is all openly licensed and publicly available through GitHub and Zenodo; data is hosted on GitHub, CloudStor, and Zenodo; datasets can be explored using Datasette running on Glitch or Google CloudRun. I’m hoping that the new investments in HASS research infrastructure will embed a similar philosophy, connecting up existing services rather than starting from scratch.
This is just an outline on what the GLAM Workbench currently offers researchers wanting to make use of the data available from Trove. It’s all there now, publicly accessible, openly licensed, and ready to use – take it, use it, change it, share it. But there’s much more I’d like to do, both in regard to Trove and to encourage use of GLAM data more generally. I’m also interested in your ideas for new tools, examples, or data sources – what would help your research? You can add a suggestion in GitHub, or post a comment in the GLAM Workbench channel of OzGLAM Help.
This is a companion discussion topic for the original entry at https://updates.timsherratt.org/2021/08/26/glam-workbench-a.html