Human-Machine Collaborative Transcription

under review

Please help us out by filling in this short feedback form: https://forms.gle/EW9ixiHu8MMvGE3z7

Research

Project Aims

This project aims to test the results of a machine transcription effort, using a model trained on data produced by Zooniverse volunteers via the Anti-Slavery Manuscripts project, which ran from 2017-2020. Please visit the project link for full details on the dataset.

History & Method

This effort was inspired by ongoing efforts to provide the best possible tools for crowdsourced text transcription. The number of transcription projects on the Zooniverse platform has grown exponentially in recent years. Since 2017, we have made a concerted effort to create new and improved tools for text data collection and analysis. This approach combines machine and human classifications, with the aim of optimizing volunteer effort by making the transcription process more efficient.

Our data science team at the University of Minnesota trained a handwritten text recognition model that predicts the position of lines on an image of a page of handwritten text, and generates transcriptions for the recognized lines. The Adler-Zooniverse team built new infrastructure for the platform that supports ingestion of machine-generated data in Project Builder projects. We combined these two efforts to create a Zooniverse project that uses the Transcription Task, but with a twist: the 'first pass' annotations + transcriptions have been generated by a machine learning algorithm. We now need volunteers to help us 'correct' the machine!

This project is intended to measure the reliability of our machine-learning model, as well as to help us collect feedback on the user experience for these 'correct-a-machine' workflows.

In our original research proposal, we wrote:

The full research effort is in four parts:

Training a machine-learning model for HTR using existing transcription data from the ASM project;
Building a data pathway for uploading machine transcription data into the Zooniverse platform;
Creating a new workflow on Zooniverse to combine the machine-generated transcriptions with volunteer effort, using existing tools for collaborative text transcription;
An experiment to test the ASM HTR model on other, similar datasets from the University of Minnesota's Archives & Special Collections.

Output

The output from this total effort will be the data ingest pathway (step 2 above), and an evaluation of best practices for combining human and machine effort in the production of high-quality transcription data.

Funding

This project was created with support from a 2020 Digital Extension Grant from the American Council of Learned Societies.