Skip to main content

The Teckel Judge

At the heart of Teckel AI is our proprietary auditing engine, the Teckel Judge. This specialized evaluation system provides rigorous, automated analysis of every response your AI generates, ensuring clear and consistent quality measurement.

How It Works

When a trace is sent to Teckel AI, it's queued for processing by our unified Teckel Judge evaluation system. This consolidated process efficiently analyzes responses in a single pass, providing both quantitative scores and actionable qualitative feedback.

Processing Times

Batch Processing (Standard)

  • Typical completion: Within 1 hour
  • Maximum turnaround: 24 hours
  • Cost-effective for production use
  • Handles thousands of traces efficiently

Realtime Processing (Premium)

  • Immediate results: 2-3 seconds
  • Available for time-sensitive applications
  • Premium pricing applies
  • Contact sales for access

The Core Evaluation Metrics

The Teckel Judge evaluates three key quality dimensions:

1. Accuracy

Accuracy measures factual accuracy by analyzing individual claims and their support in source documents. Each AI response gets broken into discrete statements, then we calculate the ratio of supported claims to total claims.

Calculation: Supported Claims ÷ Total Claims Extracted

  • 1.0: All factual claims are fully supported by the source documents.
  • 0.8-0.99: Most claims are supported, with only minor unsupported details.
  • 0.6-0.79: Several claims lack proper support or contain minor inaccuracies.
  • Below 0.6: The response contains significant unsupported claims or factual errors.

2. Precision

Precision evaluates how relevant retrieved document chunks are to the user's question. This metric reveals whether your RAG system finds the most useful information for each query.

Calculation: Relevant Chunks ÷ Total Chunks Retrieved

  • 1.0: All retrieved chunks are directly relevant and useful for answering the question.
  • 0.8-0.99: Most chunks are relevant, with only minor irrelevant content retrieved.
  • 0.6-0.79: A moderate amount of irrelevant information was included.
  • Below 0.6: Many retrieved chunks are not relevant to the user's question.

3. Completeness

Completeness measures how well the AI's response addresses the user's specific question. It checks whether the AI stayed on topic and directly answered what was asked.

Calculation: AI evaluation of answer relevance to the original question

  • 1.0: The response directly and comprehensively addresses the user's question.
  • 0.8-0.99: The response is mostly relevant with minor tangential content.
  • 0.6-0.79: The response partially addresses the question but includes some irrelevant information.
  • Below 0.6: The response is largely off-topic or doesn't address the core question.

Claims-Based Analysis

Our unique approach breaks down every AI response into verifiable claims, providing unprecedented transparency.

What's a Claim?

A claim is any factual statement the AI makes. For example:

AI Response: "You can reset your password by clicking the profile icon and selecting 'Security Settings'. The reset link expires in 24 hours."

Claims Extracted:

  1. Password reset accessed via profile icon
  2. Reset option found in Security Settings
  3. Reset link has 24-hour expiration

Why Claims Matter

Claims based feedback allows for more grounded accuracy scoring by using:

  • WHICH parts are wrong
  • WHY they're unsupported
  • WHAT documentation is missing
  • HOW to fix it

Freshness Tracking

In addition to the three core metrics, Teckel AI can track the freshness of your source documents by analyzing the last_updated timestamps in your trace data if provided. This helps you understand when your AI might be relying on outdated information.

Calculation: 1.0 - (document_age_in_days ÷ 730) with 2-year decay curve

  • Fresh (1.0-0.7): Information is recent and current (0-219 days old)
  • Aging (0.7-0.3): Information is becoming stale but still usable (219-511 days old)
  • Stale (Below 0.3): Information may be significantly outdated (511+ days old)

This metric helps you proactively identify documents that may need review or updating to maintain response quality.

The Audit Result

After evaluation is complete (typically within 1 hour for batch processing), the Teckel Judge provides:

  • Core Quality Scores: Three quantitative metrics (accuracy, precision, completeness)
  • Freshness Assessment: Document age-based scoring to identify stale information if metadata is available
  • Overall Score: A weighted average providing a single quality indicator
  • Claims Analysis: Detailed breakdown of factual claims and their supporting chunks
  • Document Quality Insights: Analysis of which documents contribute to high-quality responses
  • Qualitative Feedback: Actionable recommendations for improving your documentation
  • Issue Tags: Categorized problems like "missing_details", "needs_examples", or "unclear_terminology"

Actionable Feedback for Documentation Teams

Beyond scoring, the Teckel Judge provides specific feedback that documentation teams can act on.

Types of Documentation Feedback

Gap Identification
"Users frequently ask about bulk import features, but no documentation exists. Create a bulk import guide covering CSV format, field mapping, and error handling."

Ambiguity Detection
"Documentation mentions 'administrator privileges' without defining what permissions are included. List specific admin capabilities."

Contradiction Finding
"API guide says rate limit is 100/minute, but error messages show 60/minute. Verify and update correct limit."

Completeness Check
"Installation guide missing system requirements. Add minimum RAM, disk space, and supported OS versions."

How Feedback Gets Generated

  1. Pattern Recognition: We analyze failed claims across multiple responses
  2. Root Cause Analysis: Identify why claims lack support
  3. Impact Assessment: Calculate how many users are affected
  4. Solution Recommendation: Provide specific fix instructions

Using Feedback Effectively

Documentation teams should:

  • Review feedback weekly in priority order
  • Assign specific docs to subject matter experts
  • Track improvements through quality scores
  • Use feedback tags to categorize work

Ground Truth Testing: Proactive Knowledge Base Validation

Beyond analyzing individual responses, the Teckel Judge can proactively test your vector database to identify issues before users encounter them.

How Ground Truth Testing Works

  1. Weakness Pattern Detection: Analyzes low-scoring responses to identify recurring issues
  2. Custom Query Generation: Creates precise test queries targeting identified weaknesses
  3. Vector Database Testing: Tests your search function with these queries
  4. Root Cause Analysis: Determines if problems are:
    • Retrieval Issues: Search isn't finding existing content
    • Content Gaps: Information doesn't exist in your knowledge base
    • Content Conflicts: Contradictory information across documents
  5. Targeted Recommendations: Provides specific fixes for either search optimization or content creation

Example Ground Truth Test

Identified Weakness: Poor responses about "bulk import features"

Test Query Generated: "How do I import multiple records at once?"

Test Results:

  • Found chunks: 0
  • Maximum relevance score: 0.0
  • Diagnosis: Content gap - no documentation exists
  • Recommendation: "Create bulk import guide covering CSV format, field mapping, validation rules, and error handling"

Benefits of Ground Truth Testing

  • Proactive Quality Assurance: Find and fix issues before they impact users
  • Clear Problem Identification: Know exactly whether to improve search or add content
  • Automated Validation: Continuous testing without manual effort
  • Measurable Improvements: Track how fixes improve response quality

This comprehensive evaluation helps you systematically improve knowledge base quality, providing a clear path to better AI performance through both reactive analysis and proactive testing.