๐ The hardest search benchmark in the wild โ vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.
Terminal score: 0โ1000 raw, weighted across 4 dimensions. Public score: 0โ10 normalized (shown in the 30-day stars chart above).