The Teckel Judge

At the heart of Teckel AI is our proprietary auditing engine, the Teckel Judge. This specialized evaluation system provides a rigorous, automated analysis of every response your AI generates, ensuring that you have a clear and consistent measure of your system's quality.

How It Works

When a trace is sent to Teckel AI, it's immediately queued for processing by our two-stage evaluation system. The process is designed to be objective and consistent, applying the same high standards to every audit while providing both quantitative scores and actionable qualitative feedback.

Stage 1: Automated Scoring

Our proprietary judging methodology, inspired by RAGAS research but uniquely tailored for enterprise needs, performs a comprehensive multi-dimensional analysis of each response. The system breaks down AI responses into individual factual claims and maps them to supporting source chunks, providing unprecedented transparency into how your AI arrived at its conclusions.

Stage 2: Qualitative Assessment

Following the automated scoring, the Teckel Judge provides qualitative feedback focused on documentation improvement opportunities, helping you identify specific areas where your knowledge base could be enhanced to produce better AI responses.

The Core Evaluation Metrics

The Teckel Judge's evaluation is broken down into three key quality dimensions plus freshness tracking:

1. Faithfulness

The Faithfulness score measures the factual accuracy of the response by analyzing individual claims and their support in the source documents. The Judge breaks down each AI response into discrete factual statements and calculates the ratio of supported claims to total claims made.

Calculation: Supported Claims ÷ Total Claims Extracted

1.0: All factual claims are fully supported by the source documents.
0.8-0.99: Most claims are supported, with only minor unsupported details.
0.6-0.79: Several claims lack proper support or contain minor inaccuracies.
Below 0.6: The response contains significant unsupported claims or factual errors.

2. Context Precision

The Context Precision score evaluates how relevant the retrieved document chunks are to answering the user's question. This metric helps identify whether your RAG system is finding the most useful information for each query.

Calculation: Relevant Chunks ÷ Total Chunks Retrieved

1.0: All retrieved chunks are directly relevant and useful for answering the question.
0.8-0.99: Most chunks are relevant, with only minor irrelevant content retrieved.
0.6-0.79: A moderate amount of irrelevant information was included.
Below 0.6: Many retrieved chunks are not relevant to the user's question.

3. Response Relevancy

The Response Relevancy score measures how well the AI's response addresses the user's specific question. This evaluates whether the AI stayed on topic and provided information that directly answers what was asked.

Calculation: Direct LLM evaluation of answer relevance to the question

1.0: The response directly and comprehensively addresses the user's question.
0.8-0.99: The response is mostly relevant with minor tangential content.
0.6-0.79: The response partially addresses the question but includes some irrelevant information.
Below 0.6: The response is largely off-topic or doesn't address the core question.

Claims-Based Analysis

A key innovation in Teckel's approach is our claims-based analysis system. For every AI response, we:

Extract Individual Claims: Break down the response into discrete factual statements
Map Supporting Evidence: Identify which source chunks support each claim
Track Unsupported Claims: Flag statements that lack adequate evidence
Analyze Chunk Relevance: Determine which retrieved chunks actually contribute to the answer

This granular analysis provides unprecedented visibility into your AI's reasoning process and helps identify specific areas for knowledge base improvement.

Freshness Tracking

In addition to the three core metrics, Teckel tracks the freshness of your source documents by analyzing the last_updated timestamps in your trace data. This helps you understand when your AI might be relying on outdated information.

Calculation: 1.0 - (document_age_in_days ÷ 730) with 2-year decay curve

Fresh (1.0-0.7): Information is recent and up-to-date (0-219 days old)
Aging (0.7-0.3): Information is becoming stale but still usable (219-511 days old)
Stale (Below 0.3): Information may be significantly outdated (511+ days old)

This metric helps you proactively identify documents that may need review or updating to maintain response quality.

The Audit Result

After evaluation is complete, the Teckel Judge provides:

Core Quality Scores: Three quantitative metrics (faithfulness, context precision, response relevancy)
Freshness Assessment: Document age-based scoring to identify stale information
Overall Score: A weighted average providing a single quality indicator
Claims Analysis: Detailed breakdown of factual claims and their supporting evidence
Document Quality Insights: Analysis of which documents contribute to high-quality responses
Qualitative Feedback: Actionable recommendations for improving your documentation
Issue Tags: Categorized problems like "missing_details", "needs_examples", or "unclear_terminology"

This comprehensive evaluation focuses on helping you improve your knowledge base quality rather than just identifying response problems, providing a clear path to enhance future AI performance.

How It Works​

Stage 1: Automated Scoring​

Stage 2: Qualitative Assessment​

The Core Evaluation Metrics​

1. Faithfulness​

2. Context Precision​

3. Response Relevancy​

Claims-Based Analysis​

Freshness Tracking​

The Audit Result​