The NAF uses Voice-to-Voice (V2V) backward prompting to control multimodal voice models during customer conversations. This technique "pushes" generated thoughts into the V2V model after each turn to guide its behavior and ensure it follows the correct conversation flow.
Why Backward Prompting?
The Challenge: Voice-to-Voice models take sound as input and produce sound as output (no intermediate text representation). This makes them difficult to control since there's no text to validate or modify.
The Solution: Instead of only providing instructions upfront (forward), additional context is pushed back into the model from an external system (Observer) during generation. This ensures the AI Employee rarely misses steps and follows scenarios accurately.
Evolution of Voice Architecture
Traditional Approach: Zero-Shot + Text-to-Speech
Process Flow:
Generate full prompt with complete conversation context
LLM generates text response based on user's last message
Pass generated text to Text-to-Speech provider (e.g., ElevenLabs)
TTS converts text to audio
Audio played to the customer
Limitations:
Multi-step process (text generation β text-to-speech conversion)
Less natural-sounding conversation flow
Newo Approach: Voice-to-Voice with Backward Prompting
Process Flow:
System prompt uploaded at the beginning of the conversation
User speaks (audio input sent to V2V model)
Observer system generates thoughts based on conversation state
Thoughts are pushed to the V2V model
V2V model processes audio input + thoughts guidance
V2V model directly generates audio output (no text intermediary)
Audio response played to the customer
