{% extends "base.html" %} {% block title %}Evalground - Memorizz{% endblock %} {% block content %}
Evaluate agents against the benchmarks in the library (LongMemEval).
No agents available
Create an agent to run benchmarks.
| Run ID | Status | Benchmark | Agent | Dataset | Samples | Accuracy | Created | Finished | Actions |
|---|---|---|---|---|---|---|---|---|---|
{{ run.run_id[:8] }}... |
{{ run_status }} | {{ run.benchmark or 'longmemeval' }} | {{ run.agent_name }} | {{ run.dataset_variant or 'oracle' }} | {{ run.num_samples }} | {% if run.overall_accuracy is not none %} {{ run.overall_accuracy }}% {% else %} - {% endif %} | {{ run.created_at or '-' }} | {{ run.finished_at or '-' }} | View {% if run_status in ['queued', 'running', 'canceling'] %} {% endif %} |
No benchmark runs yet.
{% endif %}{{ selected_agent.agent_id }}
| Dataset | {{ eval_results.metadata.dataset_variant }} |
| Mode | {{ eval_results.metadata.application_mode|default('assistant') }} |
| Timestamp | {{ eval_results.metadata.timestamp }} |
| Output | {{ eval_output_path }} |
Ability to extract and recall facts from prior conversations, broken down by source.
Ability to reason across information spread over multiple sessions.
Ability to reason about when events occurred and handle time-sensitive queries.
Ability to handle updated or corrected information across sessions.
_abs suffix, rather than as a separate category. Scores penalise both hallucinated answers when the agent should abstain and refusals when the agent should know the answer.
Run a benchmark to see results.
{{ run_output }}