Conversations
Import, analyze, and review conversations.
How does conversation analysis work?
Import conversations from Firestore, run AI-powered analysis using your configured evaluators, then review quality ratings for each conversation.
| Status | Client ID | Client Name | Protocol ID | Protocol Name | Conversation ID | Work Unit | Type | Call Status | Conv. Date | Imported | Analyzed | Duration | Score | Issues | Rev. | Audio |
|---|
Reports
Analyze patterns across your evaluated conversations and generate quality insights.
What can reports tell you?
Reports analyze patterns across your evaluated conversations -- recurring quality issues, trends over time, and actionable insights. Generate reports filtered by client or protocol to focus on specific areas.
Evaluators
Configure evaluators, scoring, and profiles.
What are evaluators?
Evaluators are the quality criteria the AI uses to rate your conversations. Each evaluator checks for a specific aspect -- like safety, compliance, or communication quality -- and assigns a rating based on its rubric.
Evaluators List
| Evaluator | Category | Status |
|---|
Prompt Template
<<SCHEMA_JSON>> - Where the JSON schema will be inserted
<<EVALUATOR_RUBRICS>> - Where the rubric definitions will be inserted
Scoring Scale
Profiles
What are profiles?
Profiles save a snapshot of your current evaluator configuration — which evaluators are enabled, the framework prompt, and the scoring scale — so you can switch between setups quickly.
- Evaluators — enabled/disabled state & order
- Framework prompt
- Scoring scale
- Create a profile from your current setup
- Bind it to specific clients or protocols (optional)
- Apply it manually, or set bindings to activate it automatically
No profiles yet
Create a profile to save your current evaluation setup and switch between configurations quickly.
Alignment
Measure and improve the accuracy of your AI evaluators by comparing them against human reviews
How do you know your evaluators are accurate?
The Alignment section helps you answer this question. You manually review a sample of conversations, then compare your human ratings against the AI's ratings. The agreement rate tells you how much you can trust each evaluator.
Filter calibration metrics by protocol and/or dataset.
Agreement by Evaluator
Click an evaluator to view its detailed confusion matrix below.
No manual reviews yet.
To see calibration data, open a conversation's detail view and manually review the AI ratings.
⚠️ Below 75% agreement with 5+ reviews — consider reviewing the rubric
Confusion Matrix
Select an evaluator above to view its confusion matrix.
No Test Datasets Yet
Datasets are named collections of conversations that you use to test evaluator accuracy. Group conversations by protocol or any dimension you want to measure against.
After creating a dataset, add conversations to it, then manually review them in the Conversations section. Come back to the Metrics tab to see how AI compares to your human ratings.
Compare Cohorts
Compare two sets of conversations side by side to detect quality differences across clients, protocols, or time periods.
How does cohort comparison work?
Define two groups of conversations using filters (clients, protocols, date ranges), then compare their quality metrics side by side. This helps you spot trends, measure improvements, or identify regressions.
A Cohort A
B Cohort B
No comparison yet
Select filters for two cohorts above and click Compare Cohorts to see side-by-side quality metrics.
Evaluator Details
Select a cohort to view conversation details.
Settings
Configure how conversations are analyzed, transcribed, and processed. Changes affect all future analyses.
How do settings affect your analyses?
These settings control the AI model, reasoning depth, and transcription pipeline used when you analyze conversations. Changes only affect future analyses -- previously analyzed conversations keep their original settings.
The AI model used for analysis. Flash models are faster and cheaper, Pro models are more capable.
Language for AI evaluation output: evaluator names, descriptions, rubrics, framework prompt and transcription prompt.
Higher values make output more random and creative. Lower values make it more focused and deterministic.
Controls reasoning depth for 2.5 models. Higher budgets may improve quality for complex analysis tasks.
Number of conversations to download or analyze simultaneously (1-10). Higher values complete batches faster but consume more API quota per minute.
When enabled, audio files will be transcribed during pipeline extraction. This adds processing time and API costs.
The multimodal model used to transcribe audio. This can be different from the analysis model.
Flash models are faster and recommended for transcription tasks.
Transcription always runs with Thinking Off and Temperature 0.0.
Instructions sent to the model when converting audio to text. Use the ES/EN toggle to edit each language version — the version matching the active language above is sent during transcription.
Configure the cost per million tokens for each model. These rates are used to estimate the cost shown in conversation details.
Admin
Manage admin access and permissions.
Admins can edit evaluators, settings, categories, framework prompt, and scoring scale. Regular users can only manage their own profiles.