Toward automated verification of unreviewed AI-generated code

peterlavigne.com - 36 poäng - 31 kommentarer - 118405 sekunder sedan

Kommentarer (18)

phailhaus - 6586 sekunder sedan
Using FizzBuzz as your proxy for "unreviewed code" is extremely misleading. It has practically no complexity, it's completely self-contained and easy to verify. In any codebase of even modest complexity, the challenge shifts from "does this produce the correct outputs" to "is this going to let me grow the way I need it to in the future" and thornier questions like "does this have the performance characteristics that I need".
jryio - 5860 sekunder sedan
This is a naïve approach, not just because it uses FizzBuzz, but because it ignores the fundamental complexity of software as a system of abstractions. Testing often involves understanding these abstractions and testing for/against them.
For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.
If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.
tedivm - 7907 sekunder sedan
While I understand why people want to skip code reviews, I think it is an absolute mistake at this point in time. I think AI coding assistants are great, but I've seen them fail or go down the wrong path enough times (even with things like spec driven development) where I don't think it's reasonable to not review code. Everything from development paths in production code, improper implementations, security risks: all of those are just as likely to happen with an AI as a Human, and any team that let's humans push to production without a review would absolutely be ridiculed for it.
Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.
jghn - 8532 sekunder sedan
I do think that GenAI will lead to a rise in mutation testing, property testing, and fuzzing. But it's worth people keeping in mind that there are reasons why these aren't already ubiquitous. Among other issues, they can be computationally expensive, especially mutation testing.
sharkjacobs - 6296 sekunder sedan
I'm having a hard time wrapping my head around how this can scale beyond trivial programs like simplified FizzBuzz.
duskdozer - 4450 sekunder sedan
So are we finally past the stage where people pretend they're actually reading any of the code their LLMs are dumping out?
pron - 4019 sekunder sedan
> The code must pass property-based tests
Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.
There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's attempt to have agents write a C compiler. They had thousands of human-written tests, but at some point the agents couldn't get the software to converge. Fixing a bug created another.
davemp - 601 sekunder sedan
So often these AI articles mis or ignore the Test Oracle Problem. Generating correct tests is at least as hard as generating the correct answers (often harder).
I’m actually starting to get annoyed about how much material is getting spread around about software analysis / formal methods by folks ignorant about the basics of the field.
phillipclapham - 1365 sekunder sedan
There's a layer above this that's harder to automate: verifying that the architectural decision was right, not just the implementation. You can lint for correctness, run the tests, catch the bug classes. But "this should've been a stateless function, not a microservice" or "this abstraction is wrong for the problem", well that's not in the artifact. An agent can happily produce code that passes every automated check and still represent a fundamentally wrong design choice.
The thread's hitting on this with "who writes the tests" but I think it undersells the scope. You're not just shifting responsibility, you're also hitting a ceiling: test specs can verify behavior, not decisions. Worth thinking about what it'd even mean to verify the decision trail that produced the code, not just the code itself.
boombapoom - 2723 sekunder sedan
production ready "fizz buzz" code. lol. I can't even continue typing this response.
- 5430 sekunder sedan
Ancalagon - 7298 sekunder sedan
Even with mutation testing doesn’t this still require review of the testing code?
morpheos137 - 3050 sekunder sedan
I think we need to approach provable code.
otabdeveloper4 - 4122 sekunder sedan
This one is pretty easy!
Just write your business requirements in a clear, unambiguous and exhaustive manner using a formal specification language.
Bam, no coding required.
ventana - 4585 sekunder sedan
I might be missing the point of the article, but from what I understand, the TL;DR is, "cover your code with tests", be it unit tests, functional tests, or mutants.
Each of these approaches is just fine and widely used, and none of them can be called "automated verification", which, if my understanding of the term is correct, is more about mathematical proof that the program works as expected.
The article mostly talks about automatic test generation.
andai - 4844 sekunder sedan
...in FizzBuzz
aplomb1026 - 8063 sekunder sedan
[dead]
rigorclaw - 3339 sekunder sedan
[flagged]