Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
Terminal score: 0–1000 raw, weighted across 4 dimensions. Public score: 0–10 normalized (shown in the 30-day stars chart above).