Evaluation Framework API Reference

Deprecated since version 1.0: The honeyhive.evaluation module is deprecated. Use the decorator-based evaluators in Evaluators instead (@evaluator, @aevaluator). The classes below are retained for backwards compatibility.

The HoneyHive evaluation framework provides built-in evaluators for assessing LLM outputs. All evaluators follow a standard interface: they accept inputs, outputs, and an optional ground_truth dict, and return a metrics dict.

Available built-in evaluators:

from honeyhive.evaluation import (
    ExactMatchEvaluator,
    F1ScoreEvaluator,
    LengthEvaluator,
    SemanticSimilarityEvaluator,
    BaseEvaluator,
)

# Built-in evaluator keys (for use with evaluate_batch)
# "exact_match", "f1_score", "length", "semantic_similarity"

Base Classes

BaseEvaluator

Note

The BaseEvaluator class has been deprecated in favor of the decorator-based approach. Please use the @evaluator decorator instead. See Evaluators for details.

The abstract base class for all evaluators.

Abstract Methods:

honeyhive.evaluation.evaluate(inputs: dict, outputs: dict, ground_truth: dict | None = None, **kwargs) dict[source]

Evaluate outputs given inputs and optional ground truth.

Parameters:

Parameters:
  • inputs (dict) – Input data (e.g. {"expected": "...", "prompt": "..."})

  • outputs (dict) – Output data (e.g. {"response": "..."})

  • ground_truth (Optional[dict]) – Optional ground truth data

Returns:

Return type:

dict

Returns:

Metrics dict (varies by evaluator)

Custom Evaluator Example:

from honeyhive.evaluation import BaseEvaluator

class MyEvaluator(BaseEvaluator):
    def __init__(self, **kwargs):
        super().__init__("my_evaluator", **kwargs)

    def evaluate(self, inputs, outputs, ground_truth=None, **kwargs):
        expected = inputs.get("expected", "")
        actual = outputs.get("response", "")
        score = float(expected.strip().lower() == actual.strip().lower())
        return {"score": score, "passed": score >= 0.7}

Built-in Evaluators

ExactMatchEvaluator

Evaluates whether the output exactly matches the expected value (case-insensitive, whitespace-trimmed).

Initialization:

from honeyhive.evaluation import ExactMatchEvaluator

evaluator = ExactMatchEvaluator()

Expected input format:

  • inputs["expected"] — the expected string

  • outputs["response"] — the actual output string

Returns:

{
    "exact_match": 1.0,   # 1.0 if match, 0.0 if not
    "expected": "...",
    "actual": "..."
}

Usage Example:

result = evaluator.evaluate(
    inputs={"expected": "Paris"},
    outputs={"response": "paris"},
)
print(result["exact_match"])  # 1.0

F1ScoreEvaluator

Computes the token-level F1 score between the output and expected value. Useful for extractive QA and information retrieval tasks.

Initialization:

from honeyhive.evaluation import F1ScoreEvaluator

evaluator = F1ScoreEvaluator()

Expected input format:

  • inputs["expected"] — the expected string

  • outputs["response"] — the actual output string

Returns:

{"f1_score": 0.85}

Usage Example:

result = evaluator.evaluate(
    inputs={"expected": "The quick brown fox"},
    outputs={"response": "A quick brown dog"},
)
print(result["f1_score"])  # ~0.67

LengthEvaluator

Note

The LengthEvaluator class has been deprecated in favor of the decorator-based approach. Please implement custom length evaluators using the @evaluator decorator.

Evaluates response length characteristics. Returns character, word, and line counts; does not apply pass/fail thresholds.

Initialization:

from honeyhive.evaluation import LengthEvaluator

evaluator = LengthEvaluator()

Expected input format:

  • outputs["response"] — the output string to measure

Returns:

{
    "char_count": 142,
    "word_count": 27,
    "line_count": 3
}

Usage Example:

result = evaluator.evaluate(
    inputs={},
    outputs={"response": "This is a sample response."},
)
print(result["word_count"])  # 5

SemanticSimilarityEvaluator

Evaluates semantic similarity between the output and expected value using word-overlap and sentence-structure heuristics.

Initialization:

from honeyhive.evaluation import SemanticSimilarityEvaluator

evaluator = SemanticSimilarityEvaluator()

Expected input format:

  • inputs["expected"] — the reference string

  • outputs["response"] — the actual output string

Returns:

{"semantic_similarity": 0.72}  # 0.0-1.0

Usage Example:

result = evaluator.evaluate(
    inputs={"expected": "Photosynthesis converts sunlight into energy."},
    outputs={"response": "Plants use sunlight to produce energy via photosynthesis."},
)
print(result["semantic_similarity"])  # e.g. 0.65

Batch Evaluation

evaluate_batch() runs a list of evaluators over a dataset concurrently.

from honeyhive.evaluation.evaluators import evaluate_batch, ExactMatchEvaluator, F1ScoreEvaluator

dataset = [
    {"inputs": {"expected": "Paris"}, "outputs": {"response": "paris"}},
    {"inputs": {"expected": "London"}, "outputs": {"response": "Paris"}},
]

results = evaluate_batch(
    evaluators=[ExactMatchEvaluator(), F1ScoreEvaluator()],
    dataset=dataset,
    max_workers=4,
)

Integration Patterns

With Manual Evaluation

from honeyhive.evaluation import ExactMatchEvaluator, F1ScoreEvaluator, LengthEvaluator

evaluators = {
    "exact_match": ExactMatchEvaluator(),
    "f1": F1ScoreEvaluator(),
    "length": LengthEvaluator(),
}

inputs = {"expected": "The Eiffel Tower is in Paris."}
outputs = {"response": "The Eiffel Tower is located in Paris, France."}

results = {name: ev.evaluate(inputs, outputs) for name, ev in evaluators.items()}
for name, result in results.items():
    print(f"{name}: {result}")

Migrating to Decorator-Based Evaluators

The preferred approach for new code is the @evaluator decorator:

# OLD (deprecated)
from honeyhive.evaluation import ExactMatchEvaluator
evaluator = ExactMatchEvaluator()
result = evaluator.evaluate(inputs, outputs)

# NEW (preferred)
from honeyhive.experiments import evaluator

@evaluator
def exact_match(outputs, inputs, ground_truth=None):
    expected = inputs.get("expected", "")
    actual = outputs.get("response", "")
    return {"exact_match": float(expected.strip().lower() == actual.strip().lower())}

See Evaluators for the full decorator-based API.

See Also

  • Evaluators - Decorator-based evaluators (preferred)