Well, maybe you are missing out .... because, Brazil has even more results than we do: currently 1,970,000 resultados from projetos de ciência cidadã https://www.google.com.br/?gws_rd=ssl#q=projetos+de+ciência+cidadã It might take some elbow-grease but if you go country for country, Google in their language, there might be some awesome surprises and results
Edit:
Weird, Google changes the results count everytime the page is accessed anew. There are now 2,190,000 resultados for Brazil ...
Well, maybe you are missing out .... because, Brazil has even more results than we do: currently 1,970,000 resultados from projetos de ciência cidadã https://www.google.com.br/?gws_rd=ssl#q=projetos+de+ciência+cidadã It might take some elbow-grease but if you go country for country, Google in their language, there might be some awesome surprises and results
Edit:
Weird, Google changes the results count everytime the page is accessed anew. There are now 2,190,000 resultados for Brazil ...
2 Participants
6 Comments
There are certainly cases where a real time reconciliation of free transcription fields as the volunteers complete each classification could be used to eliminate a significant number of classifications. However this is not yet practical - zooniverse has a limited ability to compare volunteer responses in near real time using something called caesar, and take various actions based on the result (such as retire a subject and/or add it to a second subject set used in a different workflow), but the response comparison does not currently allow a lot of processing or deal with minor differences so common in transcription.
It is possible to export the classifications on a 24 hour basis look at reconciliation of the results at have hit some limit in the last day but this requires significant management, and becomes very difficult as the the workflow completes.
There is also the danger that while two transcriptions agree to some level, BOTH may be wrong - Note this is not made any better with a retirement of three - the two matching but incorrect text strings are still accepted as good, but setting the retirement higher does reduce this, and is the reason we chose 5 in some phases. It is a double edged blade - more transcriptions increases the computational effort, allows more opportunity for variation, and if the reconciliation algorithm tries to keep additional text when presented with two versions one of which has added text (as does reconcile.py), then many of the final reconciled texts will have stuff that may not be valid.
One problem is the normally transcriptions of several fields are combined in the same workflow. To retire a subject would require that all the fields have been transcribed and successfully reconciled ( ie exact match or "close enough" with only small differences in spacing or punctuation that are judged to be acceptable)
We have considered this for WWI burial cards... not so much as for reducing classifications where there were good matches, but for those cases where after three classifications there was NOT a match. For a number of reasons, so far we have proceeded in the normal pattern - setting a fairly low retirement level (generally 3, occasionally five), completing the workflow then pulling out remaining issues and feeding those back into secondary workflows or private review for resolution.
So far we have only run a few of the verification and resolution workflows, and as we have proceeded through the various phases (around twelve so far) we are building a rather daunting pile of work that remains - Below is the summary for one phase of the project - the Emergency address section. This phase is fairly typical, except in the "other" field which I am still trying to sort out (most of the "other" single transcripts are bits found somewhere else in the other volunteer's transcriptions). As you can see if all the single transcript and no match fields remain to be resolved, there is a significant amount of work that still needs to be done:
Reconciliation Summary | |||||||||
---|---|---|---|---|---|---|---|---|---|
Reconciled | |||||||||
Field | Type | Unanimous Matches | Majority Matches | Mean Mode Range | Fuzzy Matches | All Blank | One Transcript | Total | No Matches |
fullname | text | 42,666 | 29,528 | 1,215 | 1,179 | 40 | 78,442 | 3 | |
address | text | 40,475 | 30,707 | 1,494 | 1,238 | 41 | 78,440 | 5 | |
other | text | 3,708 | 5,350 | 2,224 | 61,181 | 3,773 | 78,293 | 152 | |
notified_raw | text | 47,525 | 20,350 | 1,172 | 6,411 | 190 | 78,435 | 10 | |
notified_regex | text | 53,540 | 15,276 | 1,772 | 6,407 | 194 | 78,434 | 11 | |
sketch | text | 31,135 | 1,674 | 22 | 45,178 | 412 | 78,445 | 0 | |
photo | text | 57,848 | 5,855 | 81 | 14,204 | 289 | 78,443 | 2 |
Almost all one transcript all no matches indicate a transcription in error of some sort - often simply things put in the wrong place, or information from some other area mistakenly transcribed into a field, but what has become fairly obvious is that, for many cases, there is often a issue with the card itself - for WWI burial cards issues include - erasures, information out of place on the card, various typographical errors (corrected or not by volunteers), and odd punctuation/spacing/format/shortforms. To even resolve some issues requires the card to located and viewed, and some editorial authority, and is not something that will be easy to set up in a workflow for volunteers to resolve..
There are certainly cases where a real time reconciliation of free transcription fields as the volunteers complete each classification could be used to eliminate a significant number of classifications. However this is not yet practical - zooniverse has a limited ability to compare volunteer responses in near real time using something called caesar, and take various actions based on the result (such as retire a subject and/or add it to a second subject set used in a different workflow), but the response comparison does not currently allow a lot of processing or deal with minor differences so common in transcription.
It is possible to export the classifications on a 24 hour basis look at reconciliation of the results at have hit some limit in the last day but this requires significant management, and becomes very difficult as the the workflow completes.
There is also the danger that while two transcriptions agree to some level, BOTH may be wrong - Note this is not made any better with a retirement of three - the two matching but incorrect text strings are still accepted as good, but setting the retirement higher does reduce this, and is the reason we chose 5 in some phases. It is a double edged blade - more transcriptions increases the computational effort, allows more opportunity for variation, and if the reconciliation algorithm tries to keep additional text when presented with two versions one of which has added text (as does reconcile.py), then many of the final reconciled texts will have stuff that may not be valid.
One problem is the normally transcriptions of several fields are combined in the same workflow. To retire a subject would require that all the fields have been transcribed and successfully reconciled ( ie exact match or "close enough" with only small differences in spacing or punctuation that are judged to be acceptable)
We have considered this for WWI burial cards... not so much as for reducing classifications where there were good matches, but for those cases where after three classifications there was NOT a match. For a number of reasons, so far we have proceeded in the normal pattern - setting a fairly low retirement level (generally 3, occasionally five), completing the workflow then pulling out remaining issues and feeding those back into secondary workflows or private review for resolution.
So far we have only run a few of the verification and resolution workflows, and as we have proceeded through the various phases (around twelve so far) we are building a rather daunting pile of work that remains - Below is the summary for one phase of the project - the Emergency address section. This phase is fairly typical, except in the "other" field which I am still trying to sort out (most of the "other" single transcripts are bits found somewhere else in the other volunteer's transcriptions). As you can see if all the single transcript and no match fields remain to be resolved, there is a significant amount of work that still needs to be done:
Reconciliation Summary | |||||||||
---|---|---|---|---|---|---|---|---|---|
Reconciled | |||||||||
Field | Type | Unanimous Matches | Majority Matches | Mean Mode Range | Fuzzy Matches | All Blank | One Transcript | Total | No Matches |
fullname | text | 42,666 | 29,528 | 1,215 | 1,179 | 40 | 78,442 | 3 | |
address | text | 40,475 | 30,707 | 1,494 | 1,238 | 41 | 78,440 | 5 | |
other | text | 3,708 | 5,350 | 2,224 | 61,181 | 3,773 | 78,293 | 152 | |
notified_raw | text | 47,525 | 20,350 | 1,172 | 6,411 | 190 | 78,435 | 10 | |
notified_regex | text | 53,540 | 15,276 | 1,772 | 6,407 | 194 | 78,434 | 11 | |
sketch | text | 31,135 | 1,674 | 22 | 45,178 | 412 | 78,445 | 0 | |
photo | text | 57,848 | 5,855 | 81 | 14,204 | 289 | 78,443 | 2 |
Almost all one transcript all no matches indicate a transcription in error of some sort - often simply things put in the wrong place, or information from some other area mistakenly transcribed into a field, but what has become fairly obvious is that, for many cases, there is often a issue with the card itself - for WWI burial cards issues include - erasures, information out of place on the card, various typographical errors (corrected or not by volunteers), and odd punctuation/spacing/format/shortforms. To even resolve some issues requires the card to located and viewed, and some editorial authority, and is not something that will be easy to set up in a workflow for volunteers to resolve..
6 Participants
13 Comments
I've run into something like this when using the Panoptes JS client, in NodeJS, to make hundreds of API requests, so maybe this isn't a Python problem? In these cases, retrying a failed request always succeeds.
In the logs for a job that reads ~11,000 subjects from ~190 subject sets (one request per subject set, I think), I see four failed requests. These are requests where www.zooniverse.org
either timed out or dropped the connection. In each case, retrying the failed request once succeeds.
#30 215.0 retrying /subjects?subject_set_id=98241&page_size=100&page=1, attempt: 1
#30 215.0 retrying /subjects?subject_set_id=98908&page_size=100&page=1, attempt: 1
#30 217.3 { id: '98908', subjects: 43 }
#30 217.4 { id: '98241', subjects: 94 }
#30 285.7 retrying /subjects?subject_set_id=98904&page_size=100&page=1, attempt: 1
#30 285.9 retrying /subjects?subject_set_id=111058&page_size=100&page=1, attempt: 1
#30 287.8 { id: '98904', subjects: 41 }
#30 288.2 { id: '111058', subjects: 36 }
I've run into something like this when using the Panoptes JS client, in NodeJS, to make hundreds of API requests, so maybe this isn't a Python problem? In these cases, retrying a failed request always succeeds.
In the logs for a job that reads ~11,000 subjects from ~190 subject sets (one request per subject set, I think), I see four failed requests. These are requests where www.zooniverse.org
either timed out or dropped the connection. In each case, retrying the failed request once succeeds.
#30 215.0 retrying /subjects?subject_set_id=98241&page_size=100&page=1, attempt: 1
#30 215.0 retrying /subjects?subject_set_id=98908&page_size=100&page=1, attempt: 1
#30 217.3 { id: '98908', subjects: 43 }
#30 217.4 { id: '98241', subjects: 94 }
#30 285.7 retrying /subjects?subject_set_id=98904&page_size=100&page=1, attempt: 1
#30 285.9 retrying /subjects?subject_set_id=111058&page_size=100&page=1, attempt: 1
#30 287.8 { id: '98904', subjects: 41 }
#30 288.2 { id: '111058', subjects: 36 }
5 Participants
58 Comments