From the organization:

Notes from Nature - Labs

Check out Project Cybermander, where you can help us learn how salamanders change behavior to fight off fungal pathogens. Read more about the project here: https://blog.notesfromnature.org/2023/10/13/introducing-project-cybermander/

Click the orange Cybermander button at the lower right to get started!

Results

Label Babel was an experimental expedition that ran in 2018. We asked our volunteers to draw rectangles around the main labels and then used those “training” data to see if a computer could build a model using that data to find those labels itself. We had strong success with this. The IoU (Intersection over Union) value in the images below represents a measure of accuracy of our ability to detect the label. Values of 0.91 and above are considered excellent. As you can see, we can at least detect primary labels and even get a sense of typewritten or handwritten - that is good news!

In addition to the primary label, herbarium sheets often contain other labels which record important information about the specimen (barcodes, “determination labels”, which are expert identifications of the specimen, records of some processing that happened to a specimen, etc.) Rather than simply finding the primary label, the Label Babel 2 expedition asked volunteers to find all the labels, and to identify each label by type (typed text, handwritten text, both, or barcode). The data generated by the volunteers was used to refine the algorithm that detects labels, to pull the typed labels out of the images, and pull the text out of those labels using OCR (optical character recognition).

OCR has been around for a long time and this is certainly not the first attempt to do this for biodiversity specimens. There are many challenges to OCR of museum specimens (e.g. different handwritings and fonts) and no one solution has come forward to resolve these challenges. What we are striving to do is build off of what has been done in the past and develop a human-in-the-loop workflow. This means that we anticipate that some specimens can be transcribed automatically, but many will still require human eyes. In the OC - Are They Good or Not? expedition, we asked you to help us validate OCR quality by identifying whether and what type of errors were present, so we could separate labels with good OCR output from those with bad OCR output and improve our OCR process.

In the Label Babel 3 – Rise of the Machines expedition, our volunteers were asked for additional help with the label segmentation part of our workflow. Label segmentation is the process of automatically pulling the labels off of the specimen images. This process makes the next step of text extraction more effective since there is less clutter in the image. The results of Label Babel 2 were promising, but we wanted to further refine the model by asking volunteers to evaluate the results by identifying elements in the image that were incorrectly categorized by type (typed label, handwritten label, other), and typed labels that the model did not identify.

In the OCR Correction expeditions, volunteers are shown a typed label and the text that was extracted from it by OCR (optical character recognition), and asked to modify that text to correct any differences from the original label. The results will be used to further refine our OCR process.