Skip to main content

Newo Agent Framework (NAF): V2V "Backward prompting"

Updated this week

The NAF uses Voice-to-Voice (V2V) backward prompting to control multimodal voice models during customer conversations. This technique "pushes" generated thoughts into the V2V model after each turn to guide its behavior and ensure it follows the correct conversation flow.

Why Backward Prompting?

The Challenge: Voice-to-Voice models take sound as input and produce sound as output (no intermediate text representation). This makes them difficult to control since there's no text to validate or modify.

The Solution: Instead of only providing instructions upfront (forward), additional context is pushed back into the model from an external system (Observer) during generation. This ensures the AI Employee rarely misses steps and follows scenarios accurately.

Evolution of Voice Architecture

Traditional Approach: Zero-Shot + Text-to-Speech

Process Flow:

  1. Generate full prompt with complete conversation context

  2. LLM generates text response based on user's last message

  3. Pass generated text to Text-to-Speech provider (e.g., ElevenLabs)

  4. TTS converts text to audio

  5. Audio played to the customer

Limitations:

  • Multi-step process (text generation β†’ text-to-speech conversion)

  • Less natural-sounding conversation flow

Newo Approach: Voice-to-Voice with Backward Prompting

Process Flow:

  1. System prompt uploaded at the beginning of the conversation

  2. User speaks (audio input sent to V2V model)

  3. Observer system generates thoughts based on conversation state

  4. Thoughts are pushed to the V2V model

  5. V2V model processes audio input + thoughts guidance

  6. V2V model directly generates audio output (no text intermediary)

  7. Audio response played to the customer

Did this answer your question?