Is my study bot-safe?

  1. Stefan Taubert

    Hello everyone,

    I'm writing my doctoral thesis on the topic of speech synthesis and want to evaluate how synthesized speech samples are perceived in terms of their naturalness and intelligibility. Therefore, I designed a study in which the workers have to listen to eight audio files and rate both aspects on a scale from 1 to 5.

    I'm new to mTurk and my first try of that study unfortunately didn't go well as almost only bots submitted HITs. For this reason I had to reject ~95% of the assignments.

    The failed study
    I set the acceptance rate to >=85% and the country to USA because only native American English speakers should participate (for linguistic reasons). I didn't place any other restrictions on participation. One HIT took around 90s to complete and I provided 60 HITs with a maximum of 9 assignments for each HIT. There were three bonuses depending on the number of HITs completed and the quality of ratings compared to all workers:

    • base reward: $0.10 (~$4/hour)
    • bonus of $0.10/HIT (for a total of $0.20/HIT, ~$8/hour) if you submit 20 or more HITs or
    • bonus of $0.30/HIT (for a total of $0.40/HIT, ~$16/hour) if you submit 20 or more HITs and your results are among the top 50% or
    • bonus of $0.50/HIT (for a total of $0.60/HIT, ~$24/hour) if you submit 20 or more HITs and your results are among the top 10%.
    After I started the survey, new assignments came in almost immediately that were clearly non-human, as one have to read the instructions for at least 5 minutes to know what to do. All 9 assignments per HIT (540 assignments) were done after about an hour!

    I filtered out and rejected all bot-like submissions (these were almost all assignments). For the rest, I couldn't find a justifiable reason to reject them, but I suspect they were bot responses as well.

    The new study
    In order to improve my study, I made some adjustments:
    1. raise acceptance rate to >98%
    2. adding requirement: >=1000 accepted HITs
    3. adding a qualification test: this test determines the demographic group and if someone is eligible for the study
    4. adding a hearing ability test (~10min): I added three types of listening tests: solving a math exercise, selecting the numbers that were said in the audio and determining the file having a better sound quality out of two choices
    5. adding an English proficiency test (~5min): the test contains questions where the worker has to complete a sentence by selecting one out of four words
    6. adding trapping questions: one of the audio files in each HIT is a CAPTCHA which contains the audio 'This is an interruption. Please select the score 3 on both scales to confirm your attention now.' If the worker doesn't select the specified score, the assignment is automatically rejected and republished for another worker.
    7. adding gold questions: questions where I know the answer to a speech sample, and if the score is off by more than 1, the assignment will be rejected and also republished.
    8. adding requirement: listening to the whole audio file is necessary before rating is possible
    I also changed the payments in the following way:
    • base reward: $0.40 (approx. $12/hour)
    • bonus of $0.10/HIT (for a total of $0.50/HIT, approx. $15/hour) if you submit 15 or more HITs or
    • bonus of $0.20/HIT (for a total of $0.60/HIT, approx. $18/hour) if you submit 15 or more HITs and your results are among the top 20%
    I lowered the number of assignments for the first bonus from 20 to 15 HITs because I read that some workers only do a small number of HITs for new requesters to see if they will mass-reject. Also, workers may feel less pressured to do their HITs too quickly because they need to reach the required number of assignments before there are none left.

    I increased the estimated duration of an HIT from 90 seconds to two minutes because I added the trapping question and because I felt it was perhaps a bit too short before.

    The demographic groups are: all combinations of female/male and ages 18-29/30-49/50+ (six combinations which will result in six HIT types). Each HIT type will have one max assignment, i.e., each HIT can only be assigned once within a demographic group.

    I published the study at the worker sandbox with three example HITs for each HIT type:

    My questions to you
    1. Do you think these measurements will prevent bots from participating? If not, what should I change?
    2. Do you think these measurements will be a too hard requirement for workers to participate? If yes, what should I change?
    3. Are there any further improvements you would make to the study?
    Thank you for your time!
  2. Kadauchi

    1. The tests along with bringing it to 98% and 1000 should get rid of most of the bad submissions.
    2. Captcha is probably going to be annoying if you have to do one every HIT, gold/trap questions should be enough. So I'd probably drop the captcha and keep the rest.
    3. Other than the captcha, seems like you'll be good.
  3. Stefan Taubert

    Thank you for your response!

    I got the recommendation to increase the amount of accepted HITs to 10,000 to eliminate Farmers. Would you agree? I have the feeling that it might be too restrictive? I have no idea how much accepted HITs are common these days.
  4. malysa

    You should have no problem getting enough qualified participants. Your pay scale sounds very reasonable for batch type work.
  5. WillowWolf

    This would be a good idea if you want more experienced workers.

    That's actually not very restrictive.

    You'll find that most workers with even a moderate amount of experience will greatly exceed 10k approved HITs. I have over 1 million approved, just to give you an idea. You will not have a hard time finding workers with more than 10k approved.

    I would greatly advise increasing the time on your HITs. You might think two minutes is adequate, and it likely is, but there are daily issues with Mturk returning a "website unavailable" error to workers several times per day, which can last several minutes per occurrence. If this happens, you're going to have a lot of HITs in queues expiring on workers before they have the chance to complete them. I would suggest considering a minimum of a 10-minute timer to eliminate frustration on both ends. This also prevents workers from feeling the need to rush to beat the timer when the site becomes back available, which will contribute to better-quality answers.

    As @Kadauchi said, a CAPCHA is going to be really annoying, and I will add that it will likely discourage experienced workers from doing these HITs. A gold /trap question is sufficient.

    Best of luck with your project!
  6. Stefan Taubert

    That's great :emoji_grinning:

    Yes, I just used the two minutes for calculating the expected hourly wage but the time for one HIT is set to 10 minutes. I will increase it to 15 minutes because I've added two more "normal" audio files per HIT in order to decrease the total amount of CAPTCHAs to solve. It may still be annoying, but after the disastrous first run, I need a strong measure to filter out random answers.

    Thank you! :emoji_blush:
  7. Stefan Taubert

    A quick update: I've now separated the study into two HIT group types. The first HIT group contains all qualification tests (~15min effort, need to be passed to submit one empty HIT with $3 reward) the second group is the main study (48 HITs). This will ensure that the people getting payed for their efforts to take the qualification. From the ones who successfully submitted the qualification HIT I randomly choose workers which I will give a qualification to be able to participate in the main study. This ensures that not too many workers are working on the main study at the same time so that everyone can achieve the bonus for completing 12 HITs.

    My plan is to collect 9 workers using the first HIT group from which I will choose 3 to work on all 48 available HITs. If some workers do not do enough HITs, I will select more workers from the remaining 6 workers until all HITs are done.

    My concerns with that approach are:

    1. Maybe some workers just want to do the qualification HIT but aren't interested in participating in the main study. This could result in too few people doing the main study. But the main study is not really hard to do, so why should one prefer not to work on the task?
    2. Some workers might don't qualify because it is not guaranteed for them to be chosen for the main study. But because they get rewarded for the qualification HIT this might not be an issue?
    3. Maybe some workers have a problem with the timely difference between qualifying and being able to participate in the main study. Since I have to collect enough workers I think the gap between qualifying and being able to do the main study is maximum 7 days. I will notify the workers when they got qualified so that they would not miss participating.

    I would be very happy if someone could give me feedback again about my updated approach!
  8. WillowWolf

    This sounds like a good multistep process to ensure you get the right people on your project.

    It's very unlikely that this would happen. The biggest reason for workers bailing on closed qualification work is a decrease in pay. For example, a qualification pays generously, but then the main task is underpaid or the requester drops the pay later in the project, which then becomes not worth their while to work on it. As long as your hourly rate is consistent and workers know what to expect, you should not see a problem with qualified people choosing to not participate in the main study.

    This is perfectly okay and workers should approach every qualification knowing that they might or might not be selected. Mturk is designed for requesters to find workers to fulfill their exact needs and I would assume that the vast majority of experienced workers know that just because they complete a qualification, does not mean they'll be chosen for the main task. I think we're all used to this process by now. As long as their honest work on the qualification is paid for, this should not be an issue.

    This is also okay. Many projects are set up with multiple parts with time gaps between them. I would just make sure that everyone knows that the qualification they are filling out won't lead to the task immediately, so they know upfront. It's definitely a good idea to send folks a message to notify them when the main study will launch. So, it's great that you're already factoring that into your thought process.

    I don't see any potential issues with your approach. I think the most important thing is transparency. It's helpful for workers to know what to expect, especially during a process that could be delayed for a week or so between tasks.
  9. Stefan Taubert

    Thank you very much for your fast and detailed feedback! I'm grateful for the insights.

    I'm happy that there is no problem with the approach. For the purpose of transparency, I plan to describe in the Qualification HIT the exact process, including the number of HITs and assignments of each HIT group. There, it will also be described that I will initially only allow a subset of the qualified workers to take part in the main study.
  10. WillowWolf

    It sounds like you're doing things right and it's really nice to see this much thought put into setting up a project.

    You're welcome! I truly hope you have a better experience this time around with your study and best of luck writing your thesis!
