The Teckel Judge
At the heart of Teckel AI is our proprietary auditing engine, the Teckel Judge. This specialized evaluation system provides a rigorous, automated analysis of every response your AI generates, ensuring that you have a clear and consistent measure of your system's quality.
How It Works
When a trace is sent to Teckel AI, it's immediately queued for processing by our two-stage evaluation system. The process is designed to be objective and consistent, applying the same high standards to every audit while providing both quantitative scores and actionable qualitative feedback.
Stage 1: Automated Scoring
Our proprietary judging methodology, inspired by RAGAS research but uniquely tailored for enterprise needs, performs a comprehensive multi-dimensional analysis of each response. The system breaks down AI responses into individual factual claims and maps them to supporting source chunks, providing unprecedented transparency into how your AI arrived at its conclusions.
Stage 2: Qualitative Assessment
Following the automated scoring, the Teckel Judge provides qualitative feedback focused on documentation improvement opportunities, helping you identify specific areas where your knowledge base could be enhanced to produce better AI responses.
The Core Evaluation Metrics
The Teckel Judge's evaluation is broken down into three key quality dimensions plus freshness tracking:
1. Faithfulness
The Faithfulness score measures the factual accuracy of the response by analyzing individual claims and their support in the source documents. The Judge breaks down each AI response into discrete factual statements and calculates the ratio of supported claims to total claims made.
Calculation: Supported Claims ÷ Total Claims Extracted
- 1.0: All factual claims are fully supported by the source documents.
- 0.8-0.99: Most claims are supported, with only minor unsupported details.
- 0.6-0.79: Several claims lack proper support or contain minor inaccuracies.
- Below 0.6: The response contains significant unsupported claims or factual errors.
2. Context Precision
The Context Precision score evaluates how relevant the retrieved document chunks are to answering the user's question. This metric helps identify whether your RAG system is finding the most useful information for each query.
Calculation: Relevant Chunks ÷ Total Chunks Retrieved
- 1.0: All retrieved chunks are directly relevant and useful for answering the question.
- 0.8-0.99: Most chunks are relevant, with only minor irrelevant content retrieved.
- 0.6-0.79: A moderate amount of irrelevant information was included.
- Below 0.6: Many retrieved chunks are not relevant to the user's question.
3. Response Relevancy
The Response Relevancy score measures how well the AI's response addresses the user's specific question. This evaluates whether the AI stayed on topic and provided information that directly answers what was asked.
Calculation: Direct LLM evaluation of answer relevance to the question
- 1.0: The response directly and comprehensively addresses the user's question.
- 0.8-0.99: The response is mostly relevant with minor tangential content.
- 0.6-0.79: The response partially addresses the question but includes some irrelevant information.
- Below 0.6: The response is largely off-topic or doesn't address the core question.
Claims-Based Analysis
A key innovation in Teckel's approach is our claims-based analysis system. For every AI response, we:
- Extract Individual Claims: Break down the response into discrete factual statements
- Map Supporting Evidence: Identify which source chunks support each claim
- Track Unsupported Claims: Flag statements that lack adequate evidence
- Analyze Chunk Relevance: Determine which retrieved chunks actually contribute to the answer
This granular analysis provides unprecedented visibility into your AI's reasoning process and helps identify specific areas for knowledge base improvement.
Freshness Tracking
In addition to the three core metrics, Teckel tracks the freshness of your source documents by analyzing the last_updated
timestamps in your trace data. This helps you understand when your AI might be relying on outdated information.
Calculation: 1.0 - (document_age_in_days ÷ 730)
with 2-year decay curve
- Fresh (1.0-0.7): Information is recent and up-to-date (0-219 days old)
- Aging (0.7-0.3): Information is becoming stale but still usable (219-511 days old)
- Stale (Below 0.3): Information may be significantly outdated (511+ days old)
This metric helps you proactively identify documents that may need review or updating to maintain response quality.
The Audit Result
After evaluation is complete, the Teckel Judge provides:
- Core Quality Scores: Three quantitative metrics (faithfulness, context precision, response relevancy)
- Freshness Assessment: Document age-based scoring to identify stale information
- Overall Score: A weighted average providing a single quality indicator
- Claims Analysis: Detailed breakdown of factual claims and their supporting evidence
- Document Quality Insights: Analysis of which documents contribute to high-quality responses
- Qualitative Feedback: Actionable recommendations for improving your documentation
- Issue Tags: Categorized problems like "missing_details", "needs_examples", or "unclear_terminology"
This comprehensive evaluation focuses on helping you improve your knowledge base quality rather than just identifying response problems, providing a clear path to enhance future AI performance.