Prompting Text-to-Speech Models in 2026

Back in the “good old days” of 2024, following best practices for prompting LLMs to produce high-quality text for TTS models meant including fairly basic instructions to make the LLM “aware” that its text would be run through a TTS engine.
The primary focus was to ensure the model generated concise speech and rendered numbers, names, and acronyms appropriately for a TTS engine.
By way of example, that meant including something in the prompt along the lines of:
<output_style>
Be aware that the text you are generating text for a TTS engine.
This primarily means you should be concise in your speech, avoid long lists, and don’t include any markdown in your response.
You should write out numbers and words the way they should be spoken. Pay attention to how you want the TTS engine to render phone numbers, credit card numbers, etc.
By way of example, instead of generating “1387”, you should choose between “one thousand three hundred and eighty-seven” (for numeric amounts) or “one three eight seven” (for credit card, phone numbers etc.).
When you refer to our company (ABC_STUDIO), generate “A B C Studio” to ensure that the audio renders appropriately.
</output_style>
Since the models were more restricted in those days, we also used techniques like replacing the text after the LLM generated it. For example you could automatically convert all numbers that looked like a phone number (based on the number of consecutive digits) into a list of words that were ready to be spoken aloud. Or, in the case of a difficult-to-spell company name, you could do an automatic text replacement (ABC_STUDIO → A B C Studio) applied to everything the LLM generates. Of course, these techniques are still valid, but usually aren’t needed when you are building the first version of a voice agent.
Fast-forward to March 2026, and the possibilities (and range of control) have exploded with new possibilities, some of which are specific to individual models or providers.
At a high level, the introductions entail providing the LLM more control over the way in which the speech is generated (and often by consequence introduce a requirement for new TTS model-specific prompt instructions for the LLM).
The controls broadly support control in four categories:
- Pacing
- Pronunciation
- Emphasis
- Including non-word sounds
1. Pacing
Voice agents often have pacing issues, often rattling through information at the wrong time and not pausing when a human would pause.
Model providers are introducing more explicit control over the pacing through punctuation (which was effective before, but it gives more precise control) and tokens, which can create a pause in the text (such as with pause-related SSML tags).
To account for the pacing instructions, you may consider adding something like the following to your system prompt:
Inworld style model (no SSML Tags)
<output_style>
…
Pay attention to how you use punctuation as this has large impact on how the TTS engine will render speech.
Think about the various traits of human speech and attempt to express these with punctuation and the SSML tags I’ll provide you.
Patterns to think include aspects such as humans tend to pause when thinking, or they pause (briefly) after saying a very important word to punctuate it.
Think about how humans read text when they see a comma or period and use those strategically in your generation. Ellipsis (…) can also be used with this model to have the speech render as trailing off / hesitating slightly.
</output_style>
And if you have access to SSML Tags you might replace some of the instructions with something like:
<output_style>
…
Make careful use of the pacing SSML tags we have available. You can use <speed ratio=”x.y”/> and <break time="xs"/>
Example usage:
- “<speed ratio=”0.8”/> I am speaking more slowly”
- “Here is a pause <break time="1s"/> and now I’m speaking again.”
</output_style>
2. Pronunciation
A number of models now support IPA (International Phonetic Alphabet) notation for words where you want control over the pronunciation. (e.g. “/kriːt/” instead of “Crete”). This can protect against embarrassing voice output when speaking about key product or company names that may have unusual pronunciations (or heteronyms which have ambiguous pronunciation - e.g., “Bass” which is pronounced differently depending on whether you are talking about the fish or the instrument).
Example usage in an agent’s system prompt:
<output_style>
…
You are a voice agent in a UK context, please use IPA notation when rendering brand names to ensure the listener hears them correctly.
Example: when discussing any Nike shoes in stock, please render the brand as “/naɪk/”. (Our TTS model by default pronounces the word “Nike” as /ˈnaɪ.ki/ if don’t use IPA notation here).
</output_style>
3. Non-Word Vocalizations
When people talk, they tend to make a number of non-word vocalizations to either better deliver the meaning of their words or simply as a consequence of being a creature with a body.
Various model providers (such as Inworld and ElevenLabs) have introduced the ability to generate these sounds by including tags in the generated text. The tagged word is not generated but is replaced by a non-vocal sound.
To be perfectly honest, a lot of these still sound odd in voice, but some have their use and can really improve the experience of talking with a voice agent. Example tags include [laugh], [sigh], [breath]
A number of the tags also control the style in which the following speech is delivered, such as in the audiotags for ElevenLabs new V3 model. These include tags such as [jumping in], [dismissive], [whisper], which impact the delivery of the audio.
If your TTS model supports the tags, an example set of instructions may look like:
<output_style>
…
We love the idea of creating a realistic voice experience for our users, so we’ve selected a model that has the ability to generate non-word vocalizations through various tags. Please use these (sparingly) to better communicate to the user and improve the overall quality of their experience. You can use the following tags: [jumping in] and [excited].
Example usage: “[excited] Alright, you’re all booked in. ”
</output_style>
4. Emphasis
Various ways to emphasize text have been introduced, most especially through the great control allowed through punctuation and pause tokens. Some models are also starting to support specific emphasis tokens. For the new Inworld models this looks like including single *asterisks* around a key word for emphasis.
Example usage in a prompt:
<output_style>
…
Please enclose important words in single asterisks which (when spoken) should be given vocal emphasis.
Example: It’s *extremely* important that travelers arrive 60 minutes before the departure in order to get checked in.
</output_style>
Further Materials
Providers of leading TTS models now include prompting techniques and best practices in their docs. Since each TTS model renders text differently, it’s well worth reading the latest docs to make sure your agent will be rendering text in a way that makes the most out of your TTS model.
LiveKit has recently posted a great guide about prompting for voice agents and I’ve linked below some pages from 3 of the best TTS model providers:
Stay up to date
Get notified when we publish new articles.