Is my study bot-safe?

Stefan Taubert · Apr 21, 2023

Hello everyone,

I'm writing my doctoral thesis on the topic of speech synthesis and want to evaluate how synthesized speech samples are perceived in terms of their naturalness and intelligibility. Therefore, I designed a study in which the workers have to listen to eight audio files and rate both aspects on a scale from 1 to 5.

I'm new to mTurk and my first try of that study unfortunately didn't go well as almost only bots submitted HITs. For this reason I had to reject ~95% of the assignments.

The failed study
I set the acceptance rate to >=85% and the country to USA because only native American English speakers should participate (for linguistic reasons). I didn't place any other restrictions on participation. One HIT took around 90s to complete and I provided 60 HITs with a maximum of 9 assignments for each HIT. There were three bonuses depending on the number of HITs completed and the quality of ratings compared to all workers:

base reward: $0.10 (~$4/hour)

bonus of $0.10/HIT (for a total of $0.20/HIT, ~$8/hour) if you submit 20 or more HITs or

bonus of $0.30/HIT (for a total of $0.40/HIT, ~$16/hour) if you submit 20 or more HITs and your results are among the top 50% or

bonus of $0.50/HIT (for a total of $0.60/HIT, ~$24/hour) if you submit 20 or more HITs and your results are among the top 10%.

After I started the survey, new assignments came in almost immediately that were clearly non-human, as one have to read the instructions for at least 5 minutes to know what to do. All 9 assignments per HIT (540 assignments) were done after about an hour!

I filtered out and rejected all bot-like submissions (these were almost all assignments). For the rest, I couldn't find a justifiable reason to reject them, but I suspect they were bot responses as well.

The new study
In order to improve my study, I made some adjustments:

raise acceptance rate to >98%

adding requirement: >=1000 accepted HITs

adding a qualification test: this test determines the demographic group and if someone is eligible for the study

adding a hearing ability test (~10min): I added three types of listening tests: solving a math exercise, selecting the numbers that were said in the audio and determining the file having a better sound quality out of two choices

adding an English proficiency test (~5min): the test contains questions where the worker has to complete a sentence by selecting one out of four words

adding trapping questions: one of the audio files in each HIT is a CAPTCHA which contains the audio 'This is an interruption. Please select the score 3 on both scales to confirm your attention now.' If the worker doesn't select the specified score, the assignment is automatically rejected and republished for another worker.

adding gold questions: questions where I know the answer to a speech sample, and if the score is off by more than 1, the assignment will be rejected and also republished.

adding requirement: listening to the whole audio file is necessary before rating is possible

I also changed the payments in the following way:

base reward: $0.40 (approx. $12/hour)

bonus of $0.10/HIT (for a total of $0.50/HIT, approx. $15/hour) if you submit 15 or more HITs or

bonus of $0.20/HIT (for a total of $0.60/HIT, approx. $18/hour) if you submit 15 or more HITs and your results are among the top 20%

I lowered the number of assignments for the first bonus from 20 to 15 HITs because I read that some workers only do a small number of HITs for new requesters to see if they will mass-reject. Also, workers may feel less pressured to do their HITs too quickly because they need to reach the required number of assignments before there are none left.

I increased the estimated duration of an HIT from 90 seconds to two minutes because I added the trapping question and because I felt it was perhaps a bit too short before.

The demographic groups are: all combinations of female/male and ages 18-29/30-49/50+ (six combinations which will result in six HIT types). Each HIT type will have one max assignment, i.e., each HIT can only be assigned once within a demographic group.

I published the study at the worker sandbox with three example HITs for each HIT type: https://workersandbox.mturk.com/requesters/AO6KAGU2WT9Y1/projects

My questions to you

Do you think these measurements will prevent bots from participating? If not, what should I change?

Do you think these measurements will be a too hard requirement for workers to participate? If yes, what should I change?

Are there any further improvements you would make to the study?

Thank you for your time!

Kadauchi · Apr 21, 2023

Stefan Taubert said: ↑

Hello everyone,

I'm writing my doctoral thesis on the topic of speech synthesis and want to evaluate how synthesized speech samples are perceived in terms of their naturalness and intelligibility. Therefore, I designed a study in which the workers have to listen to eight audio files and rate both aspects on a scale from 1 to 5.

I'm new to mTurk and my first try of that study unfortunately didn't go well as almost only bots submitted HITs. For this reason I had to reject ~95% of the assignments.

The failed study
I set the acceptance rate to >=85% and the country to USA because only native American English speakers should participate (for linguistic reasons). I didn't place any other restrictions on participation. One HIT took around 90s to complete and I provided 60 HITs with a maximum of 9 assignments for each HIT. There were three bonuses depending on the number of HITs completed and the quality of ratings compared to all workers:

base reward: $0.10 (~$4/hour)

bonus of $0.10/HIT (for a total of $0.20/HIT, ~$8/hour) if you submit 20 or more HITs or

bonus of $0.30/HIT (for a total of $0.40/HIT, ~$16/hour) if you submit 20 or more HITs and your results are among the top 50% or

bonus of $0.50/HIT (for a total of $0.60/HIT, ~$24/hour) if you submit 20 or more HITs and your results are among the top 10%.

After I started the survey, new assignments came in almost immediately that were clearly non-human, as one have to read the instructions for at least 5 minutes to know what to do. All 9 assignments per HIT (540 assignments) were done after about an hour!

I filtered out and rejected all bot-like submissions (these were almost all assignments). For the rest, I couldn't find a justifiable reason to reject them, but I suspect they were bot responses as well.

The new study
In order to improve my study, I made some adjustments:

raise acceptance rate to >98%

adding requirement: >=1000 accepted HITs

adding a qualification test: this test determines the demographic group and if someone is eligible for the study

adding a hearing ability test (~10min): I added three types of listening tests: solving a math exercise, selecting the numbers that were said in the audio and determining the file having a better sound quality out of two choices

adding an English proficiency test (~5min): the test contains questions where the worker has to complete a sentence by selecting one out of four words

adding trapping questions: one of the audio files in each HIT is a CAPTCHA which contains the audio 'This is an interruption. Please select the score 3 on both scales to confirm your attention now.' If the worker doesn't select the specified score, the assignment is automatically rejected and republished for another worker.

adding gold questions: questions where I know the answer to a speech sample, and if the score is off by more than 1, the assignment will be rejected and also republished.

adding requirement: listening to the whole audio file is necessary before rating is possible

I also changed the payments in the following way:

base reward: $0.40 (approx. $12/hour)

bonus of $0.10/HIT (for a total of $0.50/HIT, approx. $15/hour) if you submit 15 or more HITs or

bonus of $0.20/HIT (for a total of $0.60/HIT, approx. $18/hour) if you submit 15 or more HITs and your results are among the top 20%

I lowered the number of assignments for the first bonus from 20 to 15 HITs because I read that some workers only do a small number of HITs for new requesters to see if they will mass-reject. Also, workers may feel less pressured to do their HITs too quickly because they need to reach the required number of assignments before there are none left.

I increased the estimated duration of an HIT from 90 seconds to two minutes because I added the trapping question and because I felt it was perhaps a bit too short before.

The demographic groups are: all combinations of female/male and ages 18-29/30-49/50+ (six combinations which will result in six HIT types). Each HIT type will have one max assignment, i.e., each HIT can only be assigned once within a demographic group.

I published the study at the worker sandbox with three example HITs for each HIT type: https://workersandbox.mturk.com/requesters/AO6KAGU2WT9Y1/projects

My questions to you

Do you think these measurements will prevent bots from participating? If not, what should I change?

Do you think these measurements will be a too hard requirement for workers to participate? If yes, what should I change?

Are there any further improvements you would make to the study?

Thank you for your time!
Click to expand...

1. The tests along with bringing it to 98% and 1000 should get rid of most of the bad submissions.
2. Captcha is probably going to be annoying if you have to do one every HIT, gold/trap questions should be enough. So I'd probably drop the captcha and keep the rest.
3. Other than the captcha, seems like you'll be good.

Stefan Taubert · Apr 25, 2023

Kadauchi said: ↑

1. The tests along with bringing it to 98% and 1000 should get rid of most of the bad submissions.
2. Captcha is probably going to be annoying if you have to do one every HIT, gold/trap questions should be enough. So I'd probably drop the captcha and keep the rest.
3. Other than the captcha, seems like you'll be good.
Click to expand...

Thank you for your response!

I got the recommendation to increase the amount of accepted HITs to 10,000 to eliminate Farmers. Would you agree? I have the feeling that it might be too restrictive? I have no idea how much accepted HITs are common these days.

malysa · Apr 25, 2023

You should have no problem getting enough qualified participants. Your pay scale sounds very reasonable for batch type work.

WillowWolf · Apr 25, 2023

Stefan Taubert said: ↑

increase the amount of accepted HITs to 10,000
Click to expand...

This would be a good idea if you want more experienced workers.

Stefan Taubert said: ↑

I have the feeling that it might be too restrictive?
Click to expand...

That's actually not very restrictive.

Stefan Taubert said: ↑

I have no idea how much accepted HITs are common these days.
Click to expand...

You'll find that most workers with even a moderate amount of experience will greatly exceed 10k approved HITs. I have over 1 million approved, just to give you an idea. You will not have a hard time finding workers with more than 10k approved.

Stefan Taubert said: ↑

I increased the estimated duration of an HIT from 90 seconds to two minutes because I added the trapping question and because I felt it was perhaps a bit too short before.
Click to expand...

I would greatly advise increasing the time on your HITs. You might think two minutes is adequate, and it likely is, but there are daily issues with Mturk returning a "website unavailable" error to workers several times per day, which can last several minutes per occurrence. If this happens, you're going to have a lot of HITs in queues expiring on workers before they have the chance to complete them. I would suggest considering a minimum of a 10-minute timer to eliminate frustration on both ends. This also prevents workers from feeling the need to rush to beat the timer when the site becomes back available, which will contribute to better-quality answers.

Stefan Taubert said: ↑

each HIT is a CAPTCHA
Click to expand...

As @Kadauchi said, a CAPCHA is going to be really annoying, and I will add that it will likely discourage experienced workers from doing these HITs. A gold /trap question is sufficient.

Best of luck with your project!

Stefan Taubert · Apr 26, 2023

WillowWolf said: ↑

That's actually not very restrictive.
Click to expand...

WillowWolf said: ↑

You'll find that most workers with even a moderate amount of experience will greatly exceed 10k approved HITs. I have over 1 million approved, just to give you an idea. You will not have a hard time finding workers with more than 10k approved.
Click to expand...

That's great

WillowWolf said: ↑

I would greatly advise increasing the time on your HITs. You might think two minutes is adequate, and it likely is, but there are daily issues with Mturk returning a "website unavailable" error to workers several times per day, which can last several minutes per occurrence. If this happens, you're going to have a lot of HITs in queues expiring on workers before they have the chance to complete them. I would suggest considering a minimum of a 10-minute timer to eliminate frustration on both ends. This also prevents workers from feeling the need to rush to beat the timer when the site becomes back available, which will contribute to better-quality answers.
Click to expand...

WillowWolf said: ↑

As @Kadauchi said, a CAPCHA is going to be really annoying, and I will add that it will likely discourage experienced workers from doing these HITs. A gold /trap question is sufficient.
Click to expand...

Yes, I just used the two minutes for calculating the expected hourly wage but the time for one HIT is set to 10 minutes. I will increase it to 15 minutes because I've added two more "normal" audio files per HIT in order to decrease the total amount of CAPTCHAs to solve. It may still be annoying, but after the disastrous first run, I need a strong measure to filter out random answers.

WillowWolf said: ↑

Best of luck with your project!
Click to expand...

Thank you!

Stefan Taubert · May 11, 2023

A quick update: I've now separated the study into two HIT group types. The first HIT group contains all qualification tests (~15min effort, need to be passed to submit one empty HIT with $3 reward) the second group is the main study (48 HITs). This will ensure that the people getting payed for their efforts to take the qualification. From the ones who successfully submitted the qualification HIT I randomly choose workers which I will give a qualification to be able to participate in the main study. This ensures that not too many workers are working on the main study at the same time so that everyone can achieve the bonus for completing 12 HITs.

My plan is to collect 9 workers using the first HIT group from which I will choose 3 to work on all 48 available HITs. If some workers do not do enough HITs, I will select more workers from the remaining 6 workers until all HITs are done.

My concerns with that approach are:

1. Maybe some workers just want to do the qualification HIT but aren't interested in participating in the main study. This could result in too few people doing the main study. But the main study is not really hard to do, so why should one prefer not to work on the task?
2. Some workers might don't qualify because it is not guaranteed for them to be chosen for the main study. But because they get rewarded for the qualification HIT this might not be an issue?
3. Maybe some workers have a problem with the timely difference between qualifying and being able to participate in the main study. Since I have to collect enough workers I think the gap between qualifying and being able to do the main study is maximum 7 days. I will notify the workers when they got qualified so that they would not miss participating.

I would be very happy if someone could give me feedback again about my updated approach!

WillowWolf · May 11, 2023

Stefan Taubert said: ↑

From the ones who successfully submitted the qualification HIT I randomly choose workers which I will give a qualification to be able to participate in the main study. This ensures that not too many workers are working on the main study at the same time so that everyone can achieve the bonus for completing 12 HITs.
Click to expand...

This sounds like a good multistep process to ensure you get the right people on your project.

Stefan Taubert said: ↑

Maybe some workers just want to do the qualification HIT but aren't interested in participating in the main study. This could result in too few people doing the main study. But the main study is not really hard to do, so why should one prefer not to work on the task?
Click to expand...

It's very unlikely that this would happen. The biggest reason for workers bailing on closed qualification work is a decrease in pay. For example, a qualification pays generously, but then the main task is underpaid or the requester drops the pay later in the project, which then becomes not worth their while to work on it. As long as your hourly rate is consistent and workers know what to expect, you should not see a problem with qualified people choosing to not participate in the main study.

Stefan Taubert said: ↑

Some workers might don't qualify because it is not guaranteed for them to be chosen for the main study. But because they get rewarded for the qualification HIT this might not be an issue?
Click to expand...

This is perfectly okay and workers should approach every qualification knowing that they might or might not be selected. Mturk is designed for requesters to find workers to fulfill their exact needs and I would assume that the vast majority of experienced workers know that just because they complete a qualification, does not mean they'll be chosen for the main task. I think we're all used to this process by now. As long as their honest work on the qualification is paid for, this should not be an issue.

Stefan Taubert said: ↑

Maybe some workers have a problem with the timely difference between qualifying and being able to participate in the main study. Since I have to collect enough workers I think the gap between qualifying and being able to do the main study is maximum 7 days. I will notify the workers when they got qualified so that they would not miss participating.
Click to expand...

This is also okay. Many projects are set up with multiple parts with time gaps between them. I would just make sure that everyone knows that the qualification they are filling out won't lead to the task immediately, so they know upfront. It's definitely a good idea to send folks a message to notify them when the main study will launch. So, it's great that you're already factoring that into your thought process.

I don't see any potential issues with your approach. I think the most important thing is transparency. It's helpful for workers to know what to expect, especially during a process that could be delayed for a week or so between tasks.

Stefan Taubert · May 12, 2023

Thank you very much for your fast and detailed feedback! I'm grateful for the insights.

WillowWolf said: ↑

I don't see any potential issues with your approach. I think the most important thing is transparency. It's helpful for workers to know what to expect, especially during a process that could be delayed for a week or so between tasks.
Click to expand...

I'm happy that there is no problem with the approach. For the purpose of transparency, I plan to describe in the Qualification HIT the exact process, including the number of HITs and assignments of each HIT group. There, it will also be described that I will initially only allow a subset of the qualified workers to take part in the main study.

WillowWolf · May 12, 2023

Stefan Taubert said: ↑

For the purpose of transparency, I plan to describe in the Qualification HIT the exact process, including the number of HITs and assignments of each HIT group. There, it will also be described that I will initially only allow a subset of the qualified workers to take part in the main study.
Click to expand...

It sounds like you're doing things right and it's really nice to see this much thought put into setting up a project.

Stefan Taubert said: ↑

Thank you very much for your fast and detailed feedback!
Click to expand...

You're welcome! I truly hope you have a better experience this time around with your study and best of luck writing your thesis!

Is my study bot-safe?

Stefan Taubert New Turker

Kadauchi Administrator Former MTG MotM

Stefan Taubert New Turker

malysa Survey Slinger TurkerView Masters

WillowWolf Mischief Managed

Stefan Taubert New Turker

Stefan Taubert New Turker

WillowWolf Mischief Managed

Stefan Taubert New Turker

WillowWolf Mischief Managed

About TurkerView Forum

Is my study bot-safe?

Stefan Taubert New Turker

Kadauchi Administrator Former MTG MotM

Stefan Taubert New Turker

malysa Survey Slinger TurkerView Masters

WillowWolf Mischief Managed

Stefan Taubert New Turker

Stefan Taubert New Turker

WillowWolf Mischief Managed

Stefan Taubert New Turker

WillowWolf Mischief Managed

Useful Searches