Research

With your help, California Plants to Pixels will enable us to harness an incredible amount of information from the California plant specimens stored in our herbarium. Species names, specimen collection dates and locality, associated species - it’s all there, just waiting to be seen and used - we just have to make it digitally available first! Once available, the information in these digitized collections can be used in so many ways to better understand the world around us. Some of that work is being done right now, with new specimen based research being published every year. You can find out more through these links detailing research done through our herbarium and at the Academy.

Not only is this research important, but it is of special significance to the state of California and neighboring bio-regions. California has over 6,500 native plant taxa and over 40% of them are endemic, meaning that they are found nowhere else in the world. In addition, many of those species (at least 1/3) are threatened or endangered. The threats to California's native flora come from many sources such as population growth, changes in land use patterns, invasive species and climate change. In order for researchers and land managers to understand how changes will affect native plants, it is necessary to have baseline data about where and how abundant these species were in the past and where things stand now - i.e., understanding our historical and current plant biodiversity.

The California Academy of Sciences (CAS) herbarium, our collection of dried plant specimens, is one of the most important resources for California plant diversity. It is the largest collection in the western U.S., and the sixth largest collection in the United States. The CAS herbarium includes approximately 2.3 million plant specimens and an estimated 1 million of these specimens are from California and neighboring bio-regions. Sometimes the specimens necessary to complete a scientist's research are located in herbaria scattered around the world, making travel expensive and time consuming. Meanwhile, shipping these often fragile specimens puts them at risk of damage or loss and has also become increasingly expensive. Until recently it has been very difficult for scientists to access data from our collections without physically coming to our herbarium and collecting the label data themselves, or by borrowing specimens from our institution, which require specialized storage and is only possible in special circumstances.

Recognizing these logistical hurdles underscores the need for new ways to share our collection with the scientific community and public. The key to overcoming these problems is full digitization of our collection so specimens can be viewed and shared by anyone in the world. Until very recently, only about 20% of our California specimens were fully digitized. “Digitization of a specimen” means that a specimen is imaged, the label data is transcribed into a digital format, and the locality where the specimen was collected is determined is recorded in a process called geo-referencing. Although the physical specimens are accessible through our herbarium, compiling and sharing critical time series and geographic data remains difficult until our collection is digitized and fully accessible online. Fully digitized specimens allow the maximum amount of data to be extracted from the collection. For example, locality data can be used to understand species ranges, collect climate information, and record time of flowering. All of these pieces of information provide a comprehensive description of the plant and where it came from.

With generous funding from the Gordon and Betty Moore Foundation, we are tackling this problem in two, simultaneous and complementary efforts (see workflow figure below). First, we are partnering with Picturae to image and generate skeletal labels of our California specimens. In an exciting and efficient twist, this imaging is being performed using a conveyor belt, accomplishing imaging of 1 million of our specimens at an unprecedented rate - on average between 3,500 and 4,500 images per day! In addition to imaging, Picturae has partnered with Alembo to transcribe several key fields for each image that will then serve as the foundation for final data transcription and georeferencing.

Key Steps in Herbarium Digitization

Second, to complete transcription and perform georeferencing, we are launching a virtual project in partnership with Notes from Nature - the reason why you’re here on this webpage, right now! The image and its label are made available online for access by a large community of people—including our herbarium and research staff, volunteers, and community scientists—to complete the label transcription using the digital image as a guide. We are excited to utilize advanced elements of the Digi-Leap toolkit (read more about that below) and georeferencing workflows to maximize efficiency, creating digital versions of our specimens that can be used far and wide.

Video showing imaging system


More about Digi-Leap

Digi-Leap is a tool that aims to automate transcription of natural history specimens. The goal is not to replace human transcription, but to make better use of human effort. For example, we know that certain kinds of information, like typewritten dates are easier for computers to detect and classify where humans are much better at reading and interpreting more complex handwritten information. The ultimate goal of Digi-Leap is to capture the core fields (e.g., scientific name, collection date) for typewritten labels in an automated way.

The Digi-Leap tool has three basic tasks. The first step is to isolate the label from the specimen sheet. The second step is to use OCR (optical character recognition) to convert a label image containing text into a machine-readable text format. The last step is to put the machine-readable text into a standardized format. For those wanting to dive in deeper, the first two steps are outlined in an open access publication called Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks. The last step is outlined in a publication called Ensemble automated approaches for producing high quality herbarium digital records, which is currently submitted for publication. Feel free to contact us for a current draft.

Step 1 & 2: Isolate the labels and extract machine readable text

Step 3. Put machine readable text into a standardized format.

Most of the Digi-Leap process takes place outside of the Zooniverse system before the subjects get loaded into the system. However, during the development of Digi-Leap the developers at the Zooniverse built two new tools to help with the overall process. We are likely to use some of these tools on the subjects in Plants to Pixels so wanted to introduce them here. The first is called Text from Subject and second is called the Highlighter tool.

The Text from Subject task allows volunteers to correct the OCR text.

The Highlighter tool allows volunteers to select a category for the text and then highlight text that belongs to that category.