Solving for STT Constraints

Overview

This document outlines a plan to enhance task completion by improving the system's ability to handle transcript errors and missing transcripts. It delves into the current challenges faced from both the End Customer (EC) and engineering perspectives, envisions the ideal scenario, identifies constraints, defines the scope and core of the problem, and proposes initial steps to address these issues.

Current Scenario

EC Perspective

Inappropriate Responses:

When the agent is completing a task T to help the customer fulfill their intent, the end-customer receives an inappropriate response from the agent for what the customer speaks, which leads to higher effort and drop-offs if triggered multiple times.

Examples:

When the agent asks about their gender preference. A customer mentions "co-ed properties" (Co-Ed Properties), but due to transcription errors, it's recognized as "Co-Vid policies" leading the agent to discuss COVID policies instead.

Impact:

Increase in customer effort as the customer has to clarify that they meant something else.

Silent Responses:

When the agent is completing a task T to help the customer fulfill their intent, the end-customer does not receive any response from the agent for what the customer speaks, which leads to higher effort and drop-offs.

Examples:

When the agent asks about gender preference, a customer mentions “Male” which is not transcribed from the STT and leads to no response from the agent.

Engineering Perspective

STT Failures:

Audio input is processed by the Speech-to-Text (STT) system.

The STT system returns a transcript that doesn't fit the task being performed ( collecting name, email, property area, etc. ), the agent does not have the actions available to recognize the transcript as invalid considering the task, which leads to the agent choosing the wrong action to complete the task.

Inadequate Responses:

The agent provides inappropriate replies or remains silent because there's no mechanism to check transcript accuracy against the conversation context.

Absence of Observation Component:

No system in place to observe and assess whether the transcript makes sense before generating a response.

Ideal Scenario

EC Perspective

Accurate and Relevant Responses:

The agent always provides appropriate replies that resolve the customer’s intent, and helps the customer navigate the conversation with the minimum effort.

Error Handling:

If the end customer’s voice is not audible, the agent has the ability to request the customer to repeat by speaking louder, slowly, moving to a quieter environment, etc. depending on the scenario. ( to be decided with the solution hypothesis )

Smooth Conversation Flow:

Minimizes the need for customers to repeat themselves.

Increase in resolution by the agent.

Scope of the Problem Capability

Wrong Transcripts

Objective:

Develop a solution to identify and correct instances where the transcribed text does not accurately reflect what was said during the conversation.

Missing Transcripts

Objective:

Develop a solution to observe and handle situations where parts of the conversation are not transcribed at all.

Constraints, Facts, Assumptions to Consider while Solving the Problem

Wrong Transcripts

Constraints:

LLM-Triggered Identification: LLM can only identify and trigger events once their STT provides the transcript.

High False Positives: LLM may generate a high number of false positives for identification and trigger, which can affect the overall conversation if the system does not accomodate for the same. This is because the trigger would be dependent on a prompt and the prompt would not have 100% performance accuracy when pushed live the first time.

Facts:

Five-Stage Process: The process involves five stages—identification, trigger, verification, reasoning, and action.

Identification: Recognizing that the transcript received is not correct considering the task being performed, the conversation until now, the customer’s intent and the business’s context/goals.

Trigger: Initiating a notification to the reasoning engine to decide the next action

Reasoning: Deciding the next best action based on constraints and environmental factors.

Action: LLM executing the chosen action.

Assumptions:

Acceptable Latency: Additional latency of (1-2 seconds) is acceptable when performing reasoning to choose the action for wrong transcripts and can be masked with agent utterances.

Focus on Loudness and Speed: The corrective actions will primarily involve influencing the speaker's loudness and speed along with guessing what the person might have told considering the task being performed, the intent and the context of the conversation.

Missing Transcripts

Facts:

Retail's Inability for Audio Triggers: LLM cannot trigger actions based on audio inputs before a transcript is available. This inhibits us from assigning the identification step to any component on LLM as LLM would only trigger if there is a transcript.

Control Over Shared Audio: Formi has control over the audio being shared among all three parties (our reasoning engine, LLM, and Exotel).

Constraints:

Internal Identification and Trigger: Since LLM cannot detect missing transcripts, identification and triggering must be handled internally from our observer.

Unaltered Customer Experience: Any influence on the audio should not affect what the customer hears; the customer's experience must remain unchanged.

Assumptions:

Focus on Loudness and Speed: Similar to wrong transcripts, actions will focus on probing the customer to adjust loudness and speed as actions to be probed by the agent.

KPI’s to Track

Internal Metrics

False Positive Rate

Description: The percentage of times correct transcripts were incorrectly flagged as wrong.

Calculation: (Number of False Positives / Total Number of Triggers) × 100%

Value: Indicates the accuracy of the identification process and helps in refining verification mechanisms.

Verification Success Rate

Description: The percentage of triggers that, after verification, were confirmed as true positives.

Calculation: (Number of True Positives after Verification / Total Number of Triggers) × 100%

Value: Assesses the effectiveness of the verification stage in filtering out false positives.

Average Latency Introduced

Description: The average time delay (in seconds) added to the conversation due to the correction process.

Value: Helps evaluate whether the latency stays within the acceptable range (1-2 seconds) and its impact on conversation flow.

External Metrics

Success Rate of Corrective Actions

Description: The percentage of corrective actions that successfully resolved the transcription errors.

Calculation: (Number of Successful Corrections / Total Corrective Actions Taken) × 100%

Value: Measures the effectiveness of the actions in improving transcription accuracy.

Solution Hypothesis

Mermaid Chart - Create complex, visual diagrams with text. A smarter way of creating diagrams.-2025-08-08-195350.png

Overview#

Current Scenario#

EC Perspective#

Engineering Perspective#

Ideal Scenario#

EC Perspective#

Scope of the Problem Capability#

Wrong Transcripts#

Objective:#

Missing Transcripts#

Objective:#

Constraints, Facts, Assumptions to Consider while Solving the Problem#

Constraints:#

Facts:#

Assumptions:#

Facts:#

Constraints:#

Assumptions:#

KPI’s to Track#

Solution Hypothesis#

Overview

Current Scenario

EC Perspective

Engineering Perspective

Ideal Scenario

EC Perspective

Scope of the Problem Capability

Wrong Transcripts

Objective:

Missing Transcripts

Objective:

Constraints, Facts, Assumptions to Consider while Solving the Problem

Constraints:

Facts:

Assumptions:

Facts:

Constraints:

Assumptions:

KPI’s to Track

Solution Hypothesis