- Scale AI's Project Xylophone asks contractors to improvise conversations based on hundreds of topics.
- The data aims to help xAI's models sound less robotic and more like a real person.
- Contractors are being paid a few dollars per task to do this work, they told Business Insider.
What would you take from your house if there were a zombie apocalypse? What type of person would you like to live on Mars with?
These are some of the questions being used to train AI voice models for Elon Musk’s xAI, alongside everyday topics about DIY plumbing and trip planning, documents obtained by Business Insider show.
Freelancers for data-labeling company Scale AI are being paid to record conversations with other contractors about things like colonizing Mars — a goal of Musk’s — and superheroes, in a bid to make xAI’s voice models sound less like a robot and more like a real person.
As of April, Scale AI was running at least 10 generative AI projects for xAI, according to an internal dashboard seen by BI. The dashboard lists over 100 AI training projects for xAI and other clients, including Apple, Google DeepMind, and Meta.
Scale AI’s work comes as companies across the industry are pushing to make their bots more conversational and human-like to help compete for users who might pay for their premium versions.
Scale AI and xAI did not respond to requests for comment from Business Insider.
Inside ‘Project Xylophone’
Business Insider obtained four Scale AI documents — two sets of project instructions, a set of instructions for reviewers who check submissions, and a conversation topic guide — that outline how “Project Xylophone” works for xAI.
The documents do not state which xAI model is being trained. In late February, Musk announced the beta rollout of a voice mode for Grok, the company’s only publicly known AI model.
The Scale AI project dashboard shows contractors working on Project Xylophone are asked to record short conversations, focusing on “audio quality and natural fluency.” They are especially encouraged to join if they have experience with voice acting. The dashboard says the project is aiming for “engaging scripts, great voice acting, and high quality audio.” Scale’s dashboard is not accessible to contractors, who may not know who the client is.
For Project Xylophone, gig workers located around the world can pick from hundreds of conversation topics about ethics, philosophy, business, and travel, and record answers in a variety of languages for a few dollars per task. It splits the work between an invite-only project called “Conversations,” which gig workers do in three-person teams, and “Grasslands,” which they do solo.
“Conversations” teams are asked to set up realistic conversations with each other over Zoom. Contributors take turns asking questions from a prompt spreadsheet, which was active earlier this week. The sheet includes more than 700 conversation starters on a wide variety of topics, including postapocalyptic survival tactics, planning trips to India, and managing anxiety and panic attacks.
“If you were designing the 'culture' for the first Mars settlement, what Earth tradition would you definitely want to recreate, and what would you be excited to leave behind forever?” reads one prompt.
BI found that about 10% of the conversation prompts in the document we reviewed are science fiction-related.
Other questions are about the US political and judiciary systems, but the set does not include hot-button political issues.
In the “Conversation” arm, instructions for “good” conversations are explicit: “The recording must sound extremely natural, as if you were having a casual conversation with a friend. This includes being emotional, having varied intonations, and interrupting each other! Please avoid sounding like an interview.”
In the “Grasslands” arm, solo workers are asked to create unscripted, natural-sounding recordings in their native language. Each worker is given a conversation type and subcategory, and is told to let the conversation flow, in any setting they like, with background noise encouraged.
There are dozens of subcategories, like “Socratic questioning” and “reflective storytelling,” “courtly love scenarios,” “hero-villain confrontations,” or “collaborative puzzle-solving,” sometimes with different accents, sound effects, or invented linguistic patterns required.
Fast and accurate
Three Scale AI contractors, who asked not to be named because they signed nondisclosure agreements, said that projects are assigned to contractors based on their skill sets.
Two of the contractors said that payment for the Grassland project, which was assigned to contractors based on their location and language expertise, started with $3 per task, and was cut to $1 per task roughly a month later. Contractors have five minutes to complete each task, and each task is one recording.
Once contractors have recorded an audio file, they upload it to a Scale AI contributor platform and transcribe it manually, with the Grasslands document asking for filler words such as “uh” to be left in. “If someone has a slight pause, we should include a comma, even if grammatically that comma is incorrect,” one of the contractors told BI.
Large language models require vast amounts of quality data to improve. Recreating real-world scenarios, such as natural-sounding conversations between people, is one way to generate suitable data to feed into those models.
Training Grok
Project Xylophone is an example of a larger push by AI companies to inject personality into their AIs and stand out in an increasingly crowded space.
BI reported last month that Meta ran a project via Scale AI asking gig workers training its AI to adopt different personas, such as “a wise and mystical wizard" or a "hyper-excited music theory student."
OpenAI’s Sam Altman said in late April that the latest GPT-4o had become “too sycophant-y and annoying,” prompting a reset to make its replies more natural.
xAI has marketed Grok as a politically edgier chatbot compared to what Musk has called “woke” rivals, with training methods that sometimes lean heavily on right-wing or contrarian views, BI previously reported. Alongside xAI’s outsourced work, the company has hundreds of in-house “AI tutors” and plans to hire thousands more, BI reported in February, showing the huge human effort involved in training AI.
xAI has also ramped up its efforts to control Grok’s unpredictable side. New hires are “red teaming” Grok, stress-testing it for unsafe or policy-violating replies, especially on controversial topics and in “NSFW” or “unhinged” modes, BI reported in April.
The safety push follows high-profile incidents, including a feature in March that allowed users to prompt Grok to use racial slurs, and, most recently, unprompted responses about “white genocide” in South Africa. xAI blamed the latter issue on an unauthorized prompt modification. The company promised stricter code review and around-the-clock monitoring.