This document outlines a plan to enhance task completion by improving the system's ability to handle transcript errors and missing transcripts. It delves into the current challenges faced from both the End Customer (EC) and engineering perspectives, envisions the ideal scenario, identifies constraints, defines the scope and core of the problem, and proposes initial steps to address these issues.
When the agent is completing a task T to help the customer fulfill their intent, the end-customer receives an inappropriate response from the agent for what the customer speaks, which leads to higher effort and drop-offs if triggered multiple times.
Examples:
When the agent asks about their gender preference. A customer mentions "co-ed properties" (Co-Ed Properties), but due to transcription errors, it's recognized as "Co-Vid policies" leading the agent to discuss COVID policies instead.
Impact:
Increase in customer effort as the customer has to clarify that they meant something else.
Silent Responses:
When the agent is completing a task T to help the customer fulfill their intent, the end-customer does not receive any response from the agent for what the customer speaks, which leads to higher effort and drop-offs.
Examples:
When the agent asks about gender preference, a customer mentions “Male” which is not transcribed from the STT and leads to no response from the agent.
Audio input is processed by the Speech-to-Text (STT) system.
The STT system returns a transcript that doesn't fit the task being performed ( collecting name, email, property area, etc. ), the agent does not have the actions available to recognize the transcript as invalid considering the task, which leads to the agent choosing the wrong action to complete the task.
Inadequate Responses:
The agent provides inappropriate replies or remains silent because there's no mechanism to check transcript accuracy against the conversation context.
Absence of Observation Component:
No system in place to observe and assess whether the transcript makes sense before generating a response.
The agent always provides appropriate replies that resolve the customer’s intent, and helps the customer navigate the conversation with the minimum effort.
Error Handling:
If the end customer’s voice is not audible, the agent has the ability to request the customer to repeat by speaking louder, slowly, moving to a quieter environment, etc. depending on the scenario. ( to be decided with the solution hypothesis )
Smooth Conversation Flow:
Minimizes the need for customers to repeat themselves.
LLM-Triggered Identification: LLM can only identify and trigger events once their STT provides the transcript.
High False Positives: LLM may generate a high number of false positives for identification and trigger, which can affect the overall conversation if the system does not accomodate for the same. This is because the trigger would be dependent on a prompt and the prompt would not have 100% performance accuracy when pushed live the first time.
Five-Stage Process: The process involves five stages—identification, trigger, verification, reasoning, and action.
Identification: Recognizing that the transcript received is not correct considering the task being performed, the conversation until now, the customer’s intent and the business’s context/goals.
Trigger: Initiating a notification to the reasoning engine to decide the next action
Reasoning: Deciding the next best action based on constraints and environmental factors.
Acceptable Latency: Additional latency of (1-2 seconds) is acceptable when performing reasoning to choose the action for wrong transcripts and can be masked with agent utterances.
Focus on Loudness and Speed: The corrective actions will primarily involve influencing the speaker's loudness and speed along with guessing what the person might have told considering the task being performed, the intent and the context of the conversation.
Retail's Inability for Audio Triggers: LLM cannot trigger actions based on audio inputs before a transcript is available. This inhibits us from assigning the identification step to any component on LLM as LLM would only trigger if there is a transcript.
Control Over Shared Audio: Formi has control over the audio being shared among all three parties (our reasoning engine, LLM, and Exotel).
Internal Identification and Trigger: Since LLM cannot detect missing transcripts, identification and triggering must be handled internally from our observer.
Unaltered Customer Experience: Any influence on the audio should not affect what the customer hears; the customer's experience must remain unchanged.
Focus on Loudness and Speed: Similar to wrong transcripts, actions will focus on probing the customer to adjust loudness and speed as actions to be probed by the agent.