Quality & Evals
A/B testing, quality metrics, scorecards, and guardrail monitoring
Avg Quality Score
8.4/10
↑ 5.2%improvement
Active Experiments
12
Running: 8•Completed: 4
Guardrail Violations
23
↓ 18.5%vs last period
Avg Response Time
1.2s
↓ 8.3%optimization
Quality Scorecard
Quality Metrics vs Cost
A/B Testing Experiments
| Experiment | Status | Variants | Sample Size | Winner Metric | Variant A | Variant B | Improvement |
|---|---|---|---|---|---|---|---|
Customer Support Prompt v2 vs v1 Started 2025-01-15 | running | A: Original Prompt B: Optimized Prompt 50/50 | 1.25K | Quality Score | 7.8 | 8.6 | +10.3% |
GPT-4 vs Claude 3 Sonnet Started 2025-01-18 | running | A: GPT-4 Turbo B: Claude 3 Sonnet 50/50 | 980 | Cost per Quality | 0.12 | 0.09 | +25% |
Temperature 0.7 vs 0.3 Started 2025-01-10 | completed | A: Temp 0.7 B: Temp 0.3 50/50 | 2.10K | Consistency | 82.5 | 94.2 | +14.2% |
Content Length: Long vs Short Started 2025-01-08 | completed | A: Long Context B: Short Context 60/40 | 1.65K | Relevance | 8.9 | 7.2 | +23.6% |
Quality Metrics Breakdown
| Category | Score | Samples | Total Cost | Cost per Quality Point |
|---|---|---|---|---|
| Accuracy | 8.9/10 | 12.50K | ₹1250.50 | ₹0.14 |
| Relevance | 8.7/10 | 12.50K | ₹1180.20 | ₹0.14 |
| Coherence | 8.5/10 | 12.50K | ₹1220.80 | ₹0.14 |
| Helpfulness | 8.2/10 | 12.50K | ₹1190.40 | ₹0.15 |
| Fluency | 9.1/10 | 12.50K | ₹1280.90 | ₹0.14 |
Guardrail Violations
| Guardrail | Type | Status | Violations | Total Checks | Violation Rate |
|---|---|---|---|---|---|
| PII Detection | privacy | active | 8 | 45.89K | 1.700% |
| Toxicity Filter | safety | active | 12 | 45.89K | 2.600% |
| Brand Guidelines | compliance | active | 3 | 45.89K | 0.700% |
| Context Length Limit | technical | active | 0 | 45.89K | 0.000% |