Loading  20

Zooniverse Talk

Zooniverse Talk > Data processing

Recent Comments on Data processing

Page  of 24
Pmason

May 14th 2025, 8:23 pm

The point (x, y) values reported in the data export are pixel units in the subject image pixels, not what is displayed to the volunteer. So if your uploaded images are consistently the same size in pixels AND the placement of the spectrogram on the image is consistent then you do not need to change your ymin and ymax by subject or volunteer.

If your placement of the spectrogram frequency axis varies relative to the image borders you have an issue that can not be resolved except by knowing the frequency origin's position and scale in each subject. I would be very surprised if this is the case - I would expect the way you converted the spectrograms to subject images was consistent and ymin, fmin and ymax, fmax are the same for all your subjects.

On a quick look it appears your images are 2107X1719 pixels with ymin = 1527 and ymax = 127 (within +-1 pixel) If ALL your images have been produced in a consistent way and are the same size and placement then

freq <- (100 - (y-127)* 100/1400)) or more simply (100 - (y-127)/14)) (To verify, check y = 127 gives frequency = 100, y = 1527 gives frequency = 0)

Again where "y" is the y value of the reported points in the export for each classification.

Jim O’Donnell
In reply to AnneListens's

May 14th 2025, 7:39 pm

The x,y pixel values for drawn marks should be in a coordinate frame that’s independent of screen size, where the top left corner is (0,0) and the bottom right corner is (naturalWidth, naturalHeight).

naturalWidth and naturalHeight are the intrinsic pixel width and height of each image, and should be included with each classification. They’re constant for every subject.

Each classification should also include clientWidth and clientHeight. Those are the pixel width and height of the image as it was shown on the screen, and can vary from one volunteer to another.

AnneListens

May 14th 2025, 6:59 pm

@Pmason, Yes, you're correct. That was just a poor choice of variable names on my part.
The more proper notation would be:

#R code to translate points to real-world frequency values
ymin<-1041.521
ymax<-86.21665
fmin<-0
fmax<-100

Function to map pixel value to kHz (0 to 100)
pixel_to_kHz <- function(x) {
freq <- ((y -ymin) / (ymax - ymin)) * (fmax-fmin) +fmin
return(freq)
}

Now that I've collected data from a variety of users, it seems like we don't have consistent image viewing parameters. I'm guessing the "ymin" and "ymax" variables are different for different users, is that correct?

I'm not sure how to account for different screen sizes when translating pixels to actual frequency values. Do you have any guidance?

Pmason
In reply to AnneListens's

May 1st 2025, 6:28 pm

Are you sure that the x values in your formulas are not actually the y values in the point (x,y) that are reported in the data export?
ie replace the x's with y's everywhere given the points are reported (x which is the time axis, y the frequency axis with a opposite direction)

AnneListens

April 30th 2025, 9:45 pm

Thank you so much @Pmason! Knowing the data is in a lefthanded coordinate system was crucial!

For anyone else who might find this in the future, here's what I was able to figure out:

My spectrogram is only part of the uploaded image. The actual pixel numbers for my min and max values (0 kHz and 100 kHz) were defined by using the point tool to extract 5 separate points at both 0 and 100 kHz. The mean value of those 5 points was used to define my xmin and xmax.

#R code to translate points to real-world frequency values
xmin<-1041.521
xmax<-86.21665
fmin<-0
fmax<-100

Function to map pixel value to kHz (0 to 100)

pixel_to_kHz <- function(x) {
freq <- ((x -xmin) / (xmax - xmin)) * (fmax-fmin) +fmin
return(freq)
}

Pmason

April 30th 2025, 3:30 am

Drawing marks such as points have x and y dimensions that are in pixels in the uploaded subject image. The axes are from an origin in the top left corner with x increasing to the right and y increasing down. As you note the subject metadata has naturalwidth and naturalheight which are the overall pixel dimensions of the subject image, though hopefully your uploaded images are all the same size and you do not need to scale each one individually.

Though this does not matter in your case, the pixel axes and the normal out of the page form a lefthanded coordinate system which can introduce negative signs or complementary angles if you apply standard trig functions to calculate angles between lines defined by multiple points.

I am not familiar with the panoptes aggregator but generally I use a DBSCAN clustering algorithm to group and resolve the points in pixel space then convert the best fit/median to the scaled values they correlate to in the science case. In the simplest case such as I would expect for your project the scaling is linear - an offset to the origin and a single scale factor for each axis.

So for a particular image, if the origin of the (time, frequency) space is (xo, yo) and the time scale factor is ts in pixels per second and the frequency scale factor is fs pixels per Hertz then a drawn point at (x,y) maps to (x- xo ) / ts seconds and (yo - y)/fs Hertz. Note fs was defined as a positive number so the order of the yoand y changes from that of the xo and x to account for the downward pointing y axis. If all your images are the same size then xo, yo ts and fs are constants, otherwise things get a bit messier - the easiest is to first scale your (x.y) values to a fixed overall image size and then use the origin and scale factors for that image size.

Again, though this does not apply to your case, sometimes (example: aerial images where the imaged surface is at an angle to the camera axes), an affine transformation is required which usually requires two or more diagonally distant points in the image to have known coordinates, or additional information re the angle and orientation of the surface coordinates and the camera axes. Sometimes (example satellite imagery which has been corrected for camera angle and aligned North-South) a single scale factor for x and y will suffice. Sometimes if one has several known points across the image one can do curve fitting to fit the best linear or even higher power scaling across the image.

AnneListens

April 29th 2025, 10:03 pm

I am working on the Ocean Voices project. I've defined a workflow to allow users to mark "spectral bands" (areas of interest) on images, and those points will correspond to frequency values in the spectrogram. I am able to download the data and use the Panoptes aggregator to extract the point values (y-values will indicate desired frequency values), but I'm not sure how to translate/map the points to actual frequency values. I noticed in the classification CSV file, each subject also has metadata fields including the subject dimensions, with both client width/height and natural width/height.

What are the extracted points relative to?

Thanks so much for any help or guidance you can provide!

eecanning

April 17th 2025, 8:57 am

Hi both, thanks so much!

@Pmason - following your suggestion, I compared the subject data with the manifest by numerical sort order, and this works! I can recover the information, and am so grateful, it did not occur to me that the orders would match as that was the order in which the images were uploaded and turned into subjects. That was indeed the only way to match up the lists.

Just to add for in case anybody else encounters a similar thing in the future, I did double check and the column header in the manifest CSV is indeed present, so still a mystery to me as to why it isn't in the metadata. However, I am fine with doing the reconciliation between the two lists based on numerical sort order, and am just so happy that I won't need to try to manually match things up! Thanks again 😃

am.zooni
In reply to Pmason's

April 17th 2025, 12:21 am

Peter, this is probably totally unrelated but we had a couple of workflows in NfN-CalBug where the manifest file didn't have a header record (according to Michael). In the metadata window, the right side had the values that belonged to the subject, but the left side where the field (column) names should be contained the values for a different subject. It was a long time ago so maybe all the subjects had the same left-side values, but I don't remember. Since classifiers didn't need the metadata info in those workflows, Michael didn't fix it. I assume someone on the back end figured out how to match up the classifications with the correct subjects so the data populated the right records in the museum's database.

See my discussion with Michael the second time it happened (2021). There's a link in that thread to an earlier case (2019) but most of that was me interacting with the museum researcher who hadn't even seen the metadata, and neither of us knew anything about how the metadata got to where we classifiers saw it.

Pmason

April 16th 2025, 11:56 pm

I can not think of anyway this can happen, without some weird situation - the manifest lists the file names in order for the media to be uploaded so they have to be in the manifest yet somehow they have no metadata field.
Perhaps there was no header on the file name column in the manifest? The column header is used as a metadata field name for the content in that column for each subject. I would be surprised if the subjects would upload without a header on that column but maybe this is what happens????

In any case recovery is the big question. Do you have any link between the zooniverse subject number and the line in the manifest that generated the subject? Obviously there is a one to one match by numerical sorted order, but it would be good if there were other fields to key from...

It is possible to go back and modify subject metadata after the fact if we can build a solid link from the zooniverse subject number to update and the full metadata that should go with that subject.

There is also the possibility that the zooniverse subject (if it is a single image that has not bee modified too much) may have exif data that contains file name as assigned by the camera. That would depend on the camera make and model and what was done to the image to prepare it for upload. Still it is likely worth the effort to save image from a zooniverse subject and put it through one of those image analysis sites that shows all the exif data for the file.

DM me if you want to discuss possible recovery options further.
Peter

eecanning

April 16th 2025, 2:35 pm

Hi, hoping someone can offer some advice! I have downloaded the subjects data export for my project, but the metadata field doesn't seem to contain all of the information that I uploaded in the manifest - namely, the name of the original image .png file isn't exported although this data was included in the original manifest. In fact, the test images that I had uploaded to a test workflow do have this information, but those actually used in a workflow don't, even though I did the exact same thing to upload both, with the same information attached, so I'm extra confused.

Has this happened to anybody else, or is there a way to access this information in another manner? I would really like to be able to match the original image file names to the subject IDs, so any help would be great! Thanks so much.

KathleenLonia

April 12th 2025, 7:51 am

Thank you very much for your detailed and prompt reply! I have passed it on to my superiors and we have decided to give subject set names when we upload sets in the manifest, in order to then easily filter them.

Have a nice day!

Kathleen Lonia

Pmason

April 7th 2025, 5:07 pm

For a the owner of a small project like yourself there are two ways of exporting the task responses which are immutable - ie can not be modified or changed. These are 1) the full data export which gives you EVERY classification ever done including those while in Development mode and excepting only those done in Demo mode, and 2) by WORKFLOW which gives you every classification done for the specified workflow with the same caveats.

So, No, there is no way to get the export with only certain subject sets. However it is a trivial issue to use Python to filter the exports on any of the fields that are in the export. So one can filter by workflow version, or subject_id, or with a bit more effort, any of the subject metadata fields. But one thing that for some reason is NOT present in the export, one can NOT filter by subject set, unless you added that as a subject metadata field. ie the export does not have any record of which subject set the subject was in. Note, a subject can be in several subject sets and there is no record of which subject sets were linked to which workflow in the classification record either.

Plus, once a subject has been classified, commented on, or save in someone's favourites or collections, it is not actually deleted if you delete a subject set it was in. The subject is still there linked to your project though it becomes difficult to find - we refer to these subjects as "orphaned" - they can be listed using a Python script, and if you know the subject_id they can be displayed just as if they were still attached to a subject set.

In the project builder, up to 200 hundred subject sets can be shown in the workflow builder page where you select which subject sets are to be active for the workflow.

Bottom line- there is little advantage deleting subject sets unless you have a large number of them. There is an advantage though to add subject set info to your subject metadata, copy workflows to new numbers at different stages of the project to keep the exports small, and use workflow version or subject_id ranges to further split out project stages at the data analysis stage.

This script provides a easily modified filter to select records from an export to be flattened (simplified) Various other blocks of code can be added to extract the responses from various tasks or the filtered file can be passed on to any aggregation script just like the main export.

This script can do the same for a simple survey task complete with aggregation and filtering for consensus

If needed, I can help you with any of the scripts found in that repository.

KathleenLonia

April 7th 2025, 2:13 pm

Hello Zooniverse team!

I've looked carefully through the talk but it seems no one has addressed this question yet, and an answer from you would be invaluable.

My “Camera traps in Chinko” team was wondering if it's possible for deleted data from a subject set (photos that would have already been tagged but have been deleted) to no longer appear in the “Classification export” csv file?

Our goal: every week we'd like to create a new subject set of around 200 images and collect the data from the images processed at the end of the week. Once the data has been collected, we would delete it. On the other hand, we'd like to avoid having all the data from previous tags deleted each time in the “Classification export” csv file. Is there a way of doing this?

Thank you very much,

Kathleen Lonia

Pmason

April 4th 2025, 11:41 pm

I have added an additional script that reconciles the data from a Line transcription task directly from the data export, rather than from the caesar extracts. The output from this script is the same as that from the caesar extracts, but uses the raw data export for the workflow which is easier to request and download. It also does not need a cross reference to be built as a separate step since the metadata is available directly in the data export

Coleman Krawczyk

March 3rd 2025, 2:30 pm

That's great to hear! Also good to know that the Windows version still works 😄

hannah.slesinski

March 3rd 2025, 2:28 pm

@cmk24 Good morning! It is working now! My issue was that my numPY package was still corrupt and had to delete more folders and reinstall it again. But I was able to open the GUI! Thank you for all of your help. I really appreciate your fast responses!

Coleman Krawczyk
In reply to hannah.slesinski's

March 3rd 2025, 9:57 am

OK, this is looking promising.

It looks like it might be a path issue now. To check that, could you run the command

panoptes_aggregation --help

in your powershell terminal? If you see the help message that starts with

usage: panoptes_aggregation [-h] {config,extract,reduce} ...

the path is set up correctly. The good news is that even if the GUI does not work, it is just a wrapper for running this command with different arguments.

If not, make sure the path you posted above is on your path (see https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/ for instructions).

If the path is correct, the command panoptes_aggregation_gui should work. It looks like you have already tried this and did not see any error messages on the screen. This makes me think it is a bug in the aggregation code that I will need to look into. Long story short, there are a few extra hoops the code needs to do on a Windows system that I have not tested in several years and might have changed with the newer versions of Python (my development computer is a Mac so the Windows version of the code does not get tested very often).

hannah.slesinski

February 28th 2025, 6:35 pm

Oh, I forgot to add I am pretty sure I installed Python from python.org, but it was a while ago so I can't be 100% sure. However, I have never heard of conda, so I don't think I would have downloaded it from there.

hannah.slesinski

February 28th 2025, 4:56 pm

@cmk24 Thank you so much for your assistance! Unfortunately, it is still not working. I followed the directions in the link you provided and deleted 3 folders. I then made sure numpy was installed, which I think it is. I then uninstalled the panoptes aggregation package and reinstalled it. When I installed it, I got a new warning:

  • WARNING: The script panoptes_aggregation.exe is installed in C:\Users\tuu16701\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

I asked chatGPT what this means, and learned I needed to add a new path in my system's environmental variables, which I did. I then restarted PowerShell and tried to install the aggregation package again. This time, it seemingly worked with no error or warning! Then, I entered the code to install the GUI with no error or warning.

Then, I entered the code to open the GUI, and nothing happened. Like it seemingly accepted the code, because it gave me the option to enter a new line, but nothing opened on my computer and there were no errors, warnings, or any text at all. I tried restarting powershell and doing this over again, but I got the same result.

Any advice on where to go from here? Thank you for your help so far.

Page  of 24
Talk is a place for Zooniverse volunteers and researchers to discuss their projects, collect and share data, and work together to make new discoveries.