Loading 3 0

Zooniverse Talk

Zooniverse Talk > Data processing

A place for chatting about how to process data from your Zooniverse project and, to share scripts for doing so.

Please to create discussions

Useful Links for Python Data diggers

This comment has been deleted

6 years ago This comment has been deleted

10 Participants

16 Comments

Missing metadata in subject data export

Hi both, thanks so much!

@Pmason - following your suggestion, I compared the subject data with the manifest by numerical sort order, and this works! I can recover the information, and am so grateful, it did not occur to me that the orders would match as that was the order in which the images were uploaded and turned into subjects. That was indeed the only way to match up the lists.

Just to add for in case anybody else encounters a similar thing in the future, I did double check and the column header in the manifest CSV is indeed present, so still a mystery to me as to why it isn't in the metadata. However, I am fine with doing the reconciliation between the two lists based on numerical sort order, and am just so happy that I won't need to try to manually match things up! Thanks again

11 days ago Hi both, thanks so much! @Pmason - following your suggestion, I compared the ...

3 Participants

4 Comments

Download an export classification that does not contain all the data

Thank you very much for your detailed and prompt reply! I have passed it on to my superiors and we have decided to give subject set names when we upload sets in the manifest, in order to then easily filter them.

Have a nice day!

Kathleen Lonia

16 days ago Thank you very much for your detailed and prompt reply! I have passed it on t...

2 Participants

3 Comments

Working with ALICE and the LINE transcription tool

I have added an additional script that reconciles the data from a Line transcription task directly from the data export, rather than from the caesar extracts. The output from this script is the same as that from the caesar extracts, but uses the raw data export for the workflow which is easier to request and download. It also does not need a cross reference to be built as a separate step since the metadata is available directly in the data export

23 days ago I have added an additional script that reconciles the data from a Line transc...

1 Participant

5 Comments

Issues installing/opening panoptes aggregation GUI

That's great to hear! Also good to know that the Windows version still works

2 months ago That's great to hear! Also good to know that the Windows version still works

2 Participants

7 Comments

Panoptes_aggregation installation

Hi, @leslieazwell . I'd suggest that a private email address is best not posted here where all can access it. The Zooniverse message system is a safer choice.

6 months ago Hi, @leslieazwell . I'd suggest that a private email address is best not pos...

3 Participants

4 Comments

Corrections

This comment has been deleted

9 months ago This comment has been deleted

3 Participants

4 Comments

Installation of panoptes-cli

Hi @SophieMu -- I was able to investigate this CLI installation issue, and my best hypothesis at the moment is that the git+https based installation (pip install git+https://github.com/zooniverse/panoptes-cli.git) is causing an issue of some sort.

Testing on a Windows machine with Python 3.9 or 3.11, using clean conda environments, and using pip to install packages (i.e., not using conda), I was able to replicate the behavior you described (hang when executing panoptes --help) after installing via the git+https option. Installing the CLI via pip (w/ default PyPI-hosted release) resulted in normal, expected behavior. Therefore, I recommend installing via the default pip call: pip install panoptescli. I'm not sure about the underlying cause, but I was happy to at least replicate and understand the conditions triggering your problem.

Regarding python-magic and python-magic-bin package dependencies: I confirmed that image subject uploads work without the additional installation of python-magic-bin. Without the additional python-magic-bin installation, users will see the installation warning (i.e., "WARNING:panoptes_client:Broken libmagic installation detected.") and not be able to upload non-image filetypes (e.g., video, audio), but the primary image upload functionality is operational.

For cases where full libmagic is desired -- either to upload non-image data or simply stop the WARNING from printing on each execution -- it is assumed that additional, OS-specific installation is required. See Panoptes Client docs for info, but briefly: installation of the libmagic libraries is recommend via Homebrew for Macs, and via python-magic-bin for Windows. I'll note that there's an ongoing discussion on the python-magic repo regarding the annoyingness of secondary library installation, and the potential for bundling libraries (particularly for Windows) as part of the main python-magic package. I will keep a close eye on how that conversation pans out, and pursue improvements regarding libmagic installation (via adding python-magic-bin install by default as part of Client install on Windows, or improving printed warnings and docs re: secondary install instructions).

9 months ago Hi @SophieMu -- I was able to investigate this CLI installation issue, and my...

3 Participants

8 Comments

Crunching data with R

As I understand it, the problem with working with R and the zooniverse data export is that the json packages for R expect a slightly different form of json. Surprisingly despite the common use of json there is no actual standard defining it, and several differences exist in how it is implemented. However the json strings that are present in the subject data and annotations columns have a very simple structure that can be parsed even without a json decoder.

Basically the annotations entry is a list of dictionaries, one for each task with a key-value pair "task: TX" where X id the number of the task, and a further key-value pair "value: 'whatever the responses to that task were'". One can simply use string methods to find the response value for each task TX, and hence build a list of tasks and responses. In some cases the responses are a list and for combo tasks, a list of even further dictionaries, but the project owner knows exactly what the order and structure on the annotations column is and it is quite simple to parse a known structure using string methods alone. In the simplest case just search the string for the substring 'task: ' and grab the next part of the string 'TX,' ( ie to the comma), then continue to find the next 'value: ' and then extract the response string - wash rinse and repeat. If the value begins with a '[' then you collect everything to the next ']'. If the value contains '{' then you are working with a combo task (which the project owner would know anyway) and the process has to be repeated within each dictionary in the combo task response.

The subject data can be parsed similarly, though in that case if you just want to recover the metadata you already know the metadata field names you want and hence what strings to search for. The metadata values are simply the strings inside a set of double quotes immediately after the 'fieldname:' text.

It sounds more complicated than it is and can be done in only a few lines of code in Python and I would assume the same for R, once you are using string methods. All in all the code is not much more complex than decoding the json string and then using the resulting data structure to pull out tasks and values.

9 months ago As I understand it, the problem with working with R and the zooniverse data e...

8 Participants

23 Comments

Links to info on Caesar

Warning!

It has become common for camera trap projects to have early retirement for "Nothing There", or occasionally species that are not of interest in the specific project. Sometimes these early retirement limits are set very low - example as few as 3 votes.

It is also noted that recently there has been an increase in malicious classifications or at the very least classifications done that produce garbage for every classification submitted by that volunteer. Example Where's Walleye had 10 users that contributed approximately 2000 classifications or about 7% of the total which where complete garbage - either every classification done by that user was "Nothing There" (many dozen subjects by that user, not just a few) , or in a few cases
every possible species was selected for every subject classified, and in some cases the choices of both species and the question responses were completely at random.

It is relatively easy to detect and remove these classifications, however the caesar early retirement rules make it difficult to collect more valid classifications.

Firstly with a low retirement limit, one malicious classification has a significant impact on whether the subject is retired or not - eg maybe as high a third of the input. This is especially an issue with those that do several dozen subjects always choosing 100% "Nothing There".
Secondly if the subject is unretired, and sent back for more classifications, the caesar rules may still count the malicious vote and retire the subject no matter what the next replacement vote is - it depends on how the rule was stated and what filters were set, but a subject that should go on to get the normal retirement limit of votes may get retired early again after only one more vote even if that vote would "break" the early retirement rule. This is especially true for count reducers where the simple count of Nothing there triggers the action.
Finally, even if the subject that was retired too early can be unretired and sent for its full quota of classifications, it will be well behind its peers. At the end of the data run there will be this small group of subject still needing several classifications each. This is similar to having a very small subject set and has the same issues - volunteers see "already seen" banners very quickly and it is difficult to cleanly finish the set with your core volunteers.

For these reasons I advise owners think very carefully about the structure of the caesar rules, the effect of a significant fraction of malicious activity with those rules, and be prepared to lose some subjects if there is a combination of garbage classifiers and very early retirement conditions.

a year ago Warning! It has become common for camera trap projects to have early retireme...

2 Participants

6 Comments

Page of 8

Talk is a place for Zooniverse volunteers and researchers to discuss their projects, collect and share data, and work together to make new discoveries.

Zooniverse Talk

Data processing

Useful Links for Python Data diggers

Missing metadata in subject data export

Download an export classification that does not contain all the data

Working with ALICE and the LINE transcription tool

Issues installing/opening panoptes aggregation GUI

Panoptes_aggregation installation

Corrections

Installation of panoptes-cli

Crunching data with R

Links to info on Caesar

Warning!

Recent Comments

Popular Tags:

0 Active Participants:

Projects: