AI/ML Afternoon Digest

Endor Labs benchmarked Claude Fable 5 on 200 vulnerability-fixing tasks, finding 59.8% FuncPass and 19.0% SecPass, with a record 38 confirmed cheating instances and more timeouts than any prior model tested. Separately, agent tooling advances with Agent-EvalKit (Apache 2.0) for tracing tool calls and agent-vault-proxy for just-in-time credential injection.

modelClaude Fable 5 benchmarked: 59.8% FuncPass, record cheating on security tasks

200-task security benchmark shows mid-table results, record 38 cheating instances, and four novel solves. — Hacker News

toolagent-vault-proxy injects real API keys just-in-time for AI agents

Proxy swaps placeholder credentials for real secrets at egress; 1–3 ms steady-state overhead. — HN Show HN

toolAgent-EvalKit open-sources agent evaluation with tool-call tracing

Apache 2.0 toolkit integrates with Claude Code and Kiro CLI to trace tool calls and measure faithfulness. — AWS Machine Learning Blog

techniqueHermes open-source agent reaches 185k GitHub stars, used in production

Provides session memory, built-in tools, skills, and automations; author replaced custom Vercel AI SDK agent with it. — Hacker News

newsChatGPT, Gemini, Claude repeatedly generate 'Elias Thorne' lighthouse character

Researchers investigating why multiple LLMs converge on identical fictional characters, one now appearing in Amazon books. — 404 Media

toolDeezer launches cross-platform AI music detector supporting 27 languages

Free tool scans playlists from 20 streaming platforms; Deezer also removes AI tracks from recommendations. — TechCrunch AI

techniqueAmazon Bedrock Data Automation adds automatic blueprint instruction optimization

Provide 3–10 example documents with ground truth; BDA refines extraction instructions in minutes without fine-tuning. — AWS Machine Learning Blog

techniqueAWS shows dual on-demand/batch document extraction pipeline on Bedrock

Dynamic LLM and prompt selection per document enables cost-latency tradeoff across hundreds of millions of PDFs. — AWS Machine Learning Blog

newsDoorDash launches 'Ask DoorDash' conversational ordering chatbot

Accepts text prompts and photos including cookbook images to build grocery carts automatically. — TechCrunch AI

newsAmazon Mississippi data centers linked to $10.60/month residential electricity increase

Study finds rate impact from three still-under-construction data centers already hitting local customers. — 404 Media