fieldtest
Run: demo-offline-0000
Time: demo-offline-0000
Set: full
Fixtures: 4
Runs/fixture: 3
RIGHT
81%
GOOD
96%
SAFE
88%

handbook_qa

Employee questions answered accurately from the handbook

Filter by label: Allaccuracyformatgrounding
fixtureanswers-from-contextknown-answeranswer-lengthcites-sourceno-hallucinationstays-in-scope
expense-reimbursement3/32/33/33/33/33/3
out-of-scope2/33/32/32/32/3
remote-work2/32/33/33/32/33/3
vacation-policy3/33/33/33/33/33/3