Hello everyone,
I'm writing my doctoral thesis on the topic of speech synthesis and want to evaluate how synthesized speech samples are perceived in terms of their naturalness and intelligibility. Therefore, I designed a study in which the workers have to listen to eight audio files and rate both aspects on a scale from 1 to 5.
I'm new to mTurk and my first try of that study unfortunately didn't go well as almost only bots submitted HITs. For this reason I had to reject ~95% of the assignments.
The failed study
I set the acceptance rate to >=85% and the country to USA because only native American English speakers should participate (for linguistic reasons). I didn't place any other restrictions on participation. One HIT took around 90s to complete and I provided 60 HITs with a maximum of 9 assignments for each HIT. There were three bonuses depending on the number of HITs completed and the quality of ratings compared to all workers:
- base reward: $0.10 (~$4/hour)
- bonus of $0.10/HIT (for a total of $0.20/HIT, ~$8/hour) if you submit 20 or more HITs or
- bonus of $0.30/HIT (for a total of $0.40/HIT, ~$16/hour) if you submit 20 or more HITs and your results are among the top 50% or
- bonus of $0.50/HIT (for a total of $0.60/HIT, ~$24/hour) if you submit 20 or more HITs and your results are among the top 10%.
After I started the survey, new assignments came in almost immediately that were clearly non-human, as one have to read the instructions for at least 5 minutes to know what to do. All 9 assignments per HIT (540 assignments) were done after about an hour!
I filtered out and rejected all bot-like submissions (these were almost all assignments). For the rest, I couldn't find a justifiable reason to reject them, but I suspect they were bot responses as well.
The new study
In order to improve my study, I made some adjustments:
- raise acceptance rate to >98%
- adding requirement: >=1000 accepted HITs
- adding a qualification test: this test determines the demographic group and if someone is eligible for the study
- adding a hearing ability test (~10min): I added three types of listening tests: solving a math exercise, selecting the numbers that were said in the audio and determining the file having a better sound quality out of two choices
- adding an English proficiency test (~5min): the test contains questions where the worker has to complete a sentence by selecting one out of four words
- adding trapping questions: one of the audio files in each HIT is a CAPTCHA which contains the audio 'This is an interruption. Please select the score 3 on both scales to confirm your attention now.' If the worker doesn't select the specified score, the assignment is automatically rejected and republished for another worker.
- adding gold questions: questions where I know the answer to a speech sample, and if the score is off by more than 1, the assignment will be rejected and also republished.
- adding requirement: listening to the whole audio file is necessary before rating is possible
I also changed the payments in the following way:
- base reward: $0.40 (approx. $12/hour)
- bonus of $0.10/HIT (for a total of $0.50/HIT, approx. $15/hour) if you submit 15 or more HITs or
- bonus of $0.20/HIT (for a total of $0.60/HIT, approx. $18/hour) if you submit 15 or more HITs and your results are among the top 20%
I lowered the number of assignments for the first bonus from 20 to 15 HITs because I read that some workers only do a small number of HITs for new requesters to see if they will mass-reject. Also, workers may feel less pressured to do their HITs too quickly because they need to reach the required number of assignments before there are none left.
I increased the estimated duration of an HIT from 90 seconds to two minutes because I added the trapping question and because I felt it was perhaps a bit too short before.
The demographic groups are: all combinations of female/male and ages 18-29/30-49/50+ (six combinations which will result in six HIT types). Each HIT type will have one max assignment, i.e., each HIT can only be assigned once within a demographic group.
I published the study at the worker sandbox with three example HITs for each HIT type:
https://workersandbox.mturk.com/requesters/AO6KAGU2WT9Y1/projects
My questions to you
- Do you think these measurements will prevent bots from participating? If not, what should I change?
- Do you think these measurements will be a too hard requirement for workers to participate? If yes, what should I change?
- Are there any further improvements you would make to the study?
Thank you for your time!
Click to expand...