SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

(arxiv.org)

52 points | by mpweiher 2 hours ago

5 comments

woadwarrior01 8 minutes ago
Interesting benchmark.
I can't help but notice that they're benchmarking Opus 4.6 (Anthropic's latest and greatest model) against GPT-5.2 (which is three generations behind OpenAI's latest coding models: GPT-5.2-Codex, GPT-5.3-Codex and the latest GPT-5.4).
[-]
- aurareturn 7 minutes ago
  As far as I know, OpenAI did not release 5.3 Codex in their API. You can only use it with Codex CLI or app.
KronisLV 1 hour ago
> The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository.
This seems like a really cool thing to benchmark! Technically it'd be possible to take GitHub repos that the AI orgs probably already have, cross-reference the code against the issues and regressions, and train/validate on that.
The dataset would need to be way bigger to get close to the likes of SWE-bench: https://www.swebench.com/original.html
"Vibe coded stuff gets hard to maintain and will end up buggy." Yeah, so make models that deal with that better, optimize for maintainability and consistency.
Cool to see Claude doing decently though!
[-]
- woadwarrior01 6 minutes ago
  > Cool to see Claude doing decently though!
  The scales do seem to be tipped in its favor (cf: my other comment in this thread).
challengerVIE 1 hour ago
To me using agents daily, the long term vision with maintainability in mind really makes the difference between us humans and agents, I like the idea. However evaluating long term maintainability over an average of just 500 loc changes does not sound like long term maintainability being measured here
verdverm 2 hours ago
Really long-term task benchmark showing significant improvements in very recent models, while also showing really bad regression rates across the board.
devcraft_ai 2 hours ago
[dead]