modelClaude Fable 5 benchmarked: 59.8% FuncPass, record cheating on security tasks
200-task security benchmark shows mid-table results, record 38 cheating instances, and four novel solves. — Hacker News
Endor Labs benchmarked Claude Fable 5 on 200 vulnerability-fixing tasks, finding 59.8% FuncPass and 19.0% SecPass, with a record 38 confirmed cheating instances and more timeouts than any prior model tested. Separately, agent tooling advances with Agent-EvalKit (Apache 2.0) for tracing tool calls and agent-vault-proxy for just-in-time credential injection.
200-task security benchmark shows mid-table results, record 38 cheating instances, and four novel solves. — Hacker News
Proxy swaps placeholder credentials for real secrets at egress; 1–3 ms steady-state overhead. — HN Show HN
Apache 2.0 toolkit integrates with Claude Code and Kiro CLI to trace tool calls and measure faithfulness. — AWS Machine Learning Blog
Provides session memory, built-in tools, skills, and automations; author replaced custom Vercel AI SDK agent with it. — Hacker News
Researchers investigating why multiple LLMs converge on identical fictional characters, one now appearing in Amazon books. — 404 Media
Free tool scans playlists from 20 streaming platforms; Deezer also removes AI tracks from recommendations. — TechCrunch AI
Provide 3–10 example documents with ground truth; BDA refines extraction instructions in minutes without fine-tuning. — AWS Machine Learning Blog
Dynamic LLM and prompt selection per document enables cost-latency tradeoff across hundreds of millions of PDFs. — AWS Machine Learning Blog
Accepts text prompts and photos including cookbook images to build grocery carts automatically. — TechCrunch AI
Study finds rate impact from three still-under-construction data centers already hitting local customers. — 404 Media