Rendered at 22:20:12 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
guhcampos 21 hours ago [-]
I'm a convert. I was 100% skeptical about LLM code generation, now over 80% of the professional code I write is generated.
That said, the limitations are kind of obvious and are starting to show in some of my projects, and this article seems to confirm my suspicions. If it's just confirmation bias or not, I can't say yet.
In my experience, for anything complex enough, I have to start adding more and more constraints, style guides, corner cases, error handling, optimization guidelines and all this good stuff to my Markdown specifications, rules and skills. At some point this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language. The writing speed gains are enormous, yeah, and business sees this as productivity gains, of course - and we do it because the pressure for increased productivity is there, as it's always been; yet the trade off seems to be clear and a lot of people are just ignoring it.
sdevonoes 15 hours ago [-]
> At some point this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language.
This is the problem nobody is talking about. I see codebases growing in MD files with instructions and guidelines and requests that are also LLM generated… and it’s all piling up. No one is reviewing it 100% , and even when we do, it’s all very subjective. What’s the difference between “Follow a RESTful approach”, “We use REST, not graphql”, “90% of our endpoints are resource oriented, but we have a couple of endpoints that look rpc-ish; please ignore the latter”…
It’s all very stupid.
tcoff91 15 hours ago [-]
This is why you need to be generating more linter rules instead of just having things be in markdown files.
I had never written an eslint rule until i started having agents pump them out for me and now I've encoded a bunch of important rules as lint rules that will fail CI if violated.
hansmayer 47 minutes ago [-]
A linter won't prevent your idiot LLM from going bonkers and suddenly switching to GQL instead of REST just for that one endpoint, because it confabulated something or putting your stripe secret into your react frontend - all cases of slop I've seen happen.
deadbabe 12 hours ago [-]
Who lints the linters
jckahn 8 hours ago [-]
Linter linters, obviously
chanux 9 hours ago [-]
Software loose on theory[1] trying to compensate with moar md.
> moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language
It's like using a compiler that generates semantically different code every time you run it. Basically like compiling a program that's full of UB but "seems to work" most of the time.
> business sees this as productivity gains
Back to LoC/s as a measure of "productivity."
somewhatgoated 16 hours ago [-]
> Back to LoC/s as a measure of "productivity."
IMO this doesn’t follow from what OP wrote.
I personally measure it with a more abstract “how long does it take me to ship something that is useful in production and solving a real problem” and the increase in speed there has been massive for me.
But of course I’m not a bigbrain 10x coder that is doing bleeding edge novel stuff like most people here, so gains might be more obvious for me than for others.
sdevonoes 14 hours ago [-]
> how long does it take me to ship something that is useful in production and solving a real problem
But that’s only half of the problem. What about “and how easy it is to maintain long-term”. If you say that maintenance can be done via LLM, I would argue that there is zero guarantees that LLMs are backwards compatible and that the markdown you wrote now will work just as fine in 1,2,3 years
coldtea 13 hours ago [-]
>I would argue that there is zero guarantees that LLMs are backwards compatible and that the markdown you wrote now will work just as fine in 1,2,3 years
That this would be the case is even more guaranteed than some programming language being backwards compatible and the code we wrote working just as fine in 1,2,3, years.
Languages do get non-backwards compatible changes, dependencies break, stuff is deprecated, etc.
But the job of LLMs will remain to generate something from a prompt, and the markdown we wrote, as it's high level and not tied to language versions, APIs, and implementation details, will be just as good a prompt for that in 2050 as it is in 2026.
cess11 12 hours ago [-]
"Languages do get non-backwards compatible changes, dependencies break, stuff is deprecated, etc."
Sure, but they're deterministic and sometimes you can even do automatic rewrites through AST inspection and writing back to the files instead of scripting string substitutions on them directly.
"But the job of LLMs will remain to generate something from a prompt, and the markdown we wrote, as it's high level and not tied to language versions, APIs, and implementation details, will be just as good a prompt for that in 2050 as it is in 2026."
Your organisation is keeping version control on the LLM:s you use? It's all local, old copies of these databases are kept in secure storage together with the querying and harnessing software?
hansmayer 49 minutes ago [-]
> I have to start adding more and more constraints, style guides, corner cases, error handling, optimization guidelines and all this good stuff to my Markdown specifications, rules and skills
So kind of like maintaining a growing codebase? But this time around you cannot guarantee what the outputs will be?
tpmoney 8 hours ago [-]
One question I have is are these "constraints, style guides, corner cases, error handling, optimization guidelines" extra things that you wouldn't need otherwise, or are they formal documentation of the baked in assumptions and knowledge accumulated over the years? Every project I've ever worked on has had heaps of shared knowledge that's just part of stuff the team just "knows" and no one ever really writes down. Things like "sure you can use java's built in assert for tests, but we don't compile or run the application with the flags that enable them. Use junit's assertions/use the assertj library." or "prefer using auto generated accessors instead of manually writing them out". Even things like "if you change the structure of this ID string, you need to change all the code in modules A, B, and C because they all rely on the ID being in a certain format".
If you're really lucky, maybe a lot of this is documented in some wiki page somewhere, but everyone knows the documentation is never as complete as you'd like it to be. The longer a team works together without new people coming on board, the more likely it is that the documentation of these soft requirements and knowledge has drifted from reality. IME nothing shows how much you've failed to document than revisiting your onboarding process documents for the first time 2-3 years after you wrote them.
As I've experimented with the various AI tools, I feel like a lot of these extra documents I've written are documenting a lot of these things "everyone knows". But I'm also not at the "80% of the professional code I write is generated" stage yet. So I'm curious if you're finding that you're creating documentation that goes beyond just documenting what we used to just keep in our heads and are now getting into "writing a book about how to code" territory?
bitexploder 7 hours ago [-]
There is a great thing. Because the agents can do so much toil you can add things like formal verification, fuzzing, and other feedback mechanisms and quality gates to your projects cheaply. In a human written project you still needed those things, but it cost a lot. Agents require these quality gates and they can implement them for you. The problem with AI documentation is it will just write a lot of useless bullshit unless you guide it on what is important. You can also get agents to identify transitive dependencies via testing and other things.
I adopt the mindset of docs are for humans, tests are for agents. They document formal dependencies and leave a measurable artifact behind. If you identify some behavior or transitive dep in your system, agents document it first with a test codifying the expected behavior. Tests are the source of truth about expected system behavior and you can convince agents to write decent behavioral tests if you ask them to with the right structure. Docs are now cheap and a render, not a long term thing. There is some token efficiency to consider, but still, they are quick and cheap if you don't understand some module or its purpose.
hansmayer 42 minutes ago [-]
> Because the agents can do so much toil you can add things like formal verification, fuzzing, and other feedback mechanisms and quality gates to your projects cheaply
Works great until they sweep you a test under the rug which always passes because the condition is something like if(true) .
rodphil 6 hours ago [-]
Yeah "plus one" to this. Static analysis, fuzzing, linting, integration tests -- there are all sorts of very useful artifacts which have been around for a long time, but which are very time consuming to implement and then maintain. LLMs shift the economics around producing and maintaining these tremendously, so we can now afford these robust validation mechanisms.
These serve as living documentation which cries out in pain when they get out of sync with the system in question, generating specific error messages -- as opposed to natural language docs which rapidly drift into an ambiguous "kinda useful" state. And the validation is performed mechanically (as opposed to neurally) so no hallucinations are possible.
The one thing I would add is that you do want these artifacts to be human-friendly from a reading perspective -- you want engineers to be able to scan over these and check that they are validating the right things.
bob1029 15 hours ago [-]
I'm not having much trouble with very large (>50mb raw source) and complex codebases. The fact that it's all strongly typed probably helps a lot, but I don't think that's the whole story.
I think the harness and code patching technique starts to matter a lot more once you get outside the trivial range of codebases that fit within the first ~20% of the context window and can otherwise be iterated completely in a single inference pass.
The apply_patch technique that OAI has polished their models on seems to be the best approach for monster scale codebases. Anything based on line ranges and simple find-replace will disintegrate at the edges. You need multiple spatial anchors to deal with nasty things like cshtml files. The prepare/commit behavior is ideal for iterating through ambiguous contexts across many large files and refining anchors.
kang 11 hours ago [-]
hehe, this is by design. next model needs to eat your natural language
dominotw 19 hours ago [-]
if 80% of code you are generating is from llm then you are merely remixing whats out there. Aka slop.
llms cannot generate anything novel.
giwook 19 hours ago [-]
And how much of pre-LLM code was just copy pasta from Stack Overflow?
Code doesn't need to be novel to be useful. There's a reason why design patterns are a thing in software.
archagon 18 hours ago [-]
That’s why we abstract the useful code away as libraries, frameworks, etc.
AI is not an abstraction.
coldtea 13 hours ago [-]
Of course it is. It abstracts away the code generation to a much more compressed natural language prompt. That's the very definition of a high abstraction...
hansmayer 39 minutes ago [-]
It abstracts shit mate. How many Rs in the strawberry and if you want to go to a car wash, should you walk, run or drive? They're just fucking text generators, which happen to spit out something half-usable about 60% of the time.
ryan_lane 18 hours ago [-]
You generally need to wire libraries in to your service, and you may be using the library in a slightly different way than normal. AIs are perfectly capable of doing this.
Back to the original point, though: most software engineering work isn't novel. Most people are working on slightly different iterations of the same thing, but with the aim of different products. You can have completely different products that use nearly the same patterns as most other services.
To put it bluntly: we don't need AI to generate novel code for the vast majority of the software being built.
abalashov 10 hours ago [-]
It depends on the programming. However, you're right: another Angular reinsurance accounting application doesn't pose much of a challenge for LLMs.
apsec112 19 hours ago [-]
LLMs recently solved a major, famous open mathematical problem in combinatorial geometry:
Unless you mean remixing the alphabet / tokens, this is mathematically false. 2^256 gets you to unique very fast, and that’s like 252 bytes or like 80 tokens. Remember almost all numbers are irrational. Complexity is infinite.
lurking_swe 16 hours ago [-]
> you are merely remixing whats out there
So basically 90% of programming in an enterprise environment? lol. Sounds useful to me...
loeg 19 hours ago [-]
There is nothing new under the sun.
son_of_gloin 18 hours ago [-]
But there are other suns :)
yard2010 10 hours ago [-]
There are other suns that nothing new under them
goatlover 14 hours ago [-]
Everything changes, nothing remains without change.
coldtea 13 hours ago [-]
Nah, most things stay the same withing quite narrow margins. The moon gets hit by a meteor now and then, but it's been essentially the same rock for some billion years.
maytc 18 hours ago [-]
Most code written are not novel. Actually most code written should not be novel. Eg: the number of lines of code written isn't spent on writing git, it's spent on writing that landing page no one ever sees.
__mharrison__ 9 hours ago [-]
99% of coders don't need to generate anything novel.
dominotw 9 hours ago [-]
then what are they doing with the time savings from llm. generating more remixing ? is there really so much demand for remix slop. i dont think so.
tpmoney 8 hours ago [-]
I don't think I've ever worked on a project where there wasn't more work to be done than there was time to do it in.
dominotw 6 hours ago [-]
ai can remix things on the fly so lots of widgets that were being stamped on are not needed anymore.
we are in middle state where ai tools to generate on the fly widget work arent accessible in the form that most ppl need. So programmers are currently doing the manual step by managing remix into easily consumable form.
coldtea 13 hours ago [-]
I don't think you understand how LLMs work.
They're not merely re-arranging pre-existing blocks of code.
And they have been shown to develop emergent properties that weren't in their training set time and again.
They generate novel things as much as the average programmer (which works after himself having practice, exposure to codebases, and training, and reading API documentation, and such) generates novel things.
hansmayer 36 minutes ago [-]
> And they have been shown to develop emergent properties that weren't in their training set time and again.
Ah yes, the famous emergent properties - like suggesting that we should walk to the car wash?
sirsinsalot 10 hours ago [-]
Not to be pedantic but the emergent properties are in the training set, and thus the model and algorithm. There's no magic coming from the universe.
What makes the behavior emergent is that it can't be predicted at training time.
The emergent and unpredictable output is the result of massive vector complexity being encoded.
coldtea 9 hours ago [-]
>Not to be pedantic but the emergent properties are in the training set, and thus the model and algorithm. There's no magic coming from the universe.
You are either being pedantic or missing the point of emergent however.
Yes, it's not some novel unforeseen thing, like a magical Marvel Universe material or some unknown to humanity mode of thinking. Same way when people make something new they still recombine known words, or colors, or physical things in the universe.
It is however new capabilities that is not explicitly in the training set and can't be predicted by it. Like teaching something only calculus training materials and it figures out boolean algebra.
>The emergent and unpredictable output is the result of massive vector complexity being encoded
As opposed to what in humans? God given revelation?
sirsinsalot 8 hours ago [-]
My point was that if training data + encoding/training = model with emergent behaviour
The emergent behaviour is in the training data and/or encoding/training.
So while I agree it is emergent from the complexity, it isn't some unknown mechanism. Just complexity at scale.
coldtea 4 hours ago [-]
>So while I agree it is emergent from the complexity, it isn't some unknown mechanism. Just complexity at scale.
So like humans? Like the universe?
dominotw 9 hours ago [-]
since you understand how llms really work. show us what novel items llms have generated for you.
coldtea 4 hours ago [-]
How about a novel theorem escaping mathematicians for close to a century?
If we were doing novel things, we’d be scientists. I’m an engineer though. I don’t think I’ve been writing slop for 30 years.
hmmokidk 17 hours ago [-]
they can take novel things. so my novel
runhelm 20 hours ago [-]
[flagged]
TeMPOraL 15 hours ago [-]
> yet the trade off seems to be clear and a lot of people are just ignoring it.
There's plenty of focus on the negative side of the tradeoff. Less so on why we're making it anyway, or why it somehow works out even if "this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language".
And the answer to that can be condensed to a one-liner, which I quote after[0]:
I don’t think that’s true based on experience. Maybe “<“ instead of “<<“, yeah. But even in that case, it’s an awful trade off for any serious codebase that needs to be maintained over the years (and you don’t know what LLMs are gonna look like next year, so there are zero guarantees all your MD is gonna work as good as it’s “working” right now)
coldtea 13 hours ago [-]
As long as LLMs remains at the same skill level at coding, or better, there's 100% guarantee an MD (a glorified prompt) is gonna work as good as it’s “working” right now.
andwur 9 hours ago [-]
This is quite a claim without any evidence to substantiate it. LLMs are nondeterministic models, whose behaviour is reliant on training data, model architecture and context (both in the general and domain specific sense).
There is absolutely no guarantee llm1(MD) == llm2(MD), by design. With the current batch you need to explicitly constrain a number of parameters, far more than simply the prompt, to get identical output from the _same_ model, let alone another model that has varied training data and/or architecture.
jdlshore 1 days ago [-]
“Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”
One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.
qsort 1 days ago [-]
I think it's downstream of "you can't optimize for two different objectives".
If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.
If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.
gib444 16 hours ago [-]
> ... This is also why adding to the prompt examples of the style of code you want ...
You could take it a step further and put the example code into source code files...and be like, super comprehensive with your examples ... ;)
rco8786 10 hours ago [-]
Well yes, ideally. But real world codebases aren't clean enough to be used as the example ideal. Styles change over time, there are always code migrations and refactors in flight, legacy code exists, etc. Using specific examples of what you expect the LLM (and humans) to do now is necessary.
apsurd 1 days ago [-]
Would you mind sharing antirez' suggestion?
qsort 1 days ago [-]
I am obviously paraphrasing, but the general idea is that trying to synthesize style from a codebase into e.g. a markdown guide generally doesn't work very well. What achieves style transfer is providing the model with a lot of examples of the style, conventions, patterns you want.
To put it in practice: if you point claude/codex to a repository and you ask it to implement feature X using style guide Y, the code will probably work, but you can usually get better results by saying "do it in the style of this file, it was done well there".
brandensilva 1 days ago [-]
Right more simply put it's great at being a copy cat, exploring similar data points that match your token needs.
It is not great at decision making or judgment calls that don't have a well defined spec or plan in place yet; like unofficial or unapproved tokens if you will. A lot of this stuff simply never has had specs as it has been internal to how companies work and their secret sauce.
The closest thing we have are governance and compliance policies due to legal/business needs requiring it so it's far more well documented than operational ones in how we work. It is more about the how versus the what here I guess is what I'm saying.
But yeah this is why it does great when there are tests, design systems, evals, and other artifacts to mirror. Far more reckless and unpredictable without these things, but still great for exploration and finding the data output you seek.
withinboredom 1 days ago [-]
Doesn't that make sense? Its text prediction. If you give it examples, it can predict. Synthesizing "put semi-colons on new lines" requires it to generate its own examples 'in its head' (so to speak) and remember that. It won't.
It's like when I see people feeding it a whole bunch of "best practices" and expect it to follow them. It won't. But you could ask it questions about the best practices all day long.
brandensilva 1 days ago [-]
Yes, exactly. Any engineer deep on this stuff right now understands that grounded predictive engine sprinkled with RL training and are discovering what that means in terms of its strengths and weaknesses for company use.
dnautics 23 hours ago [-]
Supposing an unspecified or poorly specified function f(x), and example "f(A)=>B", "given C tell me what f(C) is" lies at the core of creativity.
Idk, calling it "just text prediction " seems unfairly dismissive of this capability
withinboredom 21 hours ago [-]
Saying that it’s dismissive is like saying writing (insert language) is dismissive that you’re just writing assembly.
at the end of the day, it presents a vector field and predicts the next vector. That’s literally the heart of intelligence just like assembly is the heart of execution. When playing table tennis, your brain is literally predicting seconds into the future to get your body into the right position.
But we aren’t discussing intelligence here. We are discussing how best to utilize that intelligence.
dnautics 20 hours ago [-]
You're making my point for me, saying table tennis is "just a proprioceptive predictor" is dismissively reductive (and not a particularly useful framework for understanding table tennis), even if it is strictly speaking accurate. It's the sort of thing someone who has no idea how hard training for table tennis is would say.
withinboredom 14 hours ago [-]
Let me put it bluntly. I’m agreeing with you but saying that isn’t what I was talking about and trying to give examples. You’re also agreeing with me.
The “idea” of table tennis and the rules. Those are things we can talk about. It’s those “best practices” I gave in my example. The actual playing of table tennis would be the examples. How to apply those best practices and what good code looks like.
mikeyouse 1 days ago [-]
I ran into similar issues as we started to roll out LLM generated financials in our org.. I’m so used to the old SQL workflow of “grab this data from this table, that data from that table, combine it into a final result that looks like xxxx” where the tables were outputs from reports in our ERP but I was having terrible results.
Ended up pointing Claude at a few sample files from our existing reporting, gave it read-only oauth access to the ERP and said “build a new report showing the cash by project as calculated by xxxx - yyyy + zzzz in the style of the existing reports” and it basically one-shot from there.
Kind of crazy and I built a bunch of redundant check-sums because I honestly didn’t think it would be able to replace like 6 workdays of effort for the 2 FTEs who generate that kind of thing manually every month but so far so good..
BlueTierOps 1 days ago [-]
[flagged]
coredog64 22 hours ago [-]
I was recently using Copilot to implement a small feature within a very large codebase. About 75-80% of the time, the code that was added matched the current style (warts and all). Copilot would specifically go off and research "How X is already done in the codebase" all the time.
8note 22 hours ago [-]
You basically get this for free, if the coding agent has read the relevant classes that the legacy code its touching has to match.
just dont break out a plan without also having it read the code again
KaiShips 1 days ago [-]
[flagged]
zdragnar 1 days ago [-]
I've noticed something similar with AI assist authored books as well. Early on it does alright, but after some chapters the beginning of each chapter repeats the end of the previous, and obvious LLM tells become more frequent.
The more it has to go on, the more it relies on repetition of what came before. It's also possible that authors start paying much less attention and put less effort into editing later chapters.
Despite the sheer volume on Amazon, LLMs are not at the point of writing well.
piker 1 days ago [-]
Holy crap are you reading books that advertised somehow they were written with LLM assistance? Hard no here in 2026.
zdragnar 1 days ago [-]
Oh no, they were not advertised as such. It's rather painfully obvious in the worst cases.
Animats 1 days ago [-]
That may be the same problem seen when prompts try to force "alignment" or "guardrails". There's a performance drop. Seemingly, a big chunk of the potential solution space has been made unreachable.
For example, if you apply "guardrails" to an image generator of about a year ago, all the people start looking alike. Story generators start using only a few standard names.
That was last year. Is it happening with the frontier models?
nijave 1 days ago [-]
Hmm, I have some anecdotal evidence this is true. Interactively working out a plan with Opus on multiple occasions it'd come up with an incompatible solution, I'll add additional context/requirements, and it has a tendency to "anchor" on it's original architecture and struggles to adapt. Sometimes it tries to sneak in changes for the original plan anyway.
whstl 1 days ago [-]
Opus does this waaaay too much for my taste. It works fine for vibe-coders but for technical work it is infuriating.
UncleEntity 1 days ago [-]
I think the problem is they take the shortest path to the goal ...which may or may not coincide with what you have planned. Oh, and generally think instructions are merely suggestions and what you really want this this totally different thing and not the one in the plan you handed them plus, as a stoke of good luck, this other system is a lot easier to implement as well.
I mean, I spend more tokens having them clean up all the places they didn't follow the the plan (if I catch it) or implementing what came out of a 'complete and tested' previous plan where they just stop as soon as all the pathetic new test pass and you discover half of it isn't even there when trying to implement the next thing on top of it.
Though... I have been conducting an experiment, of sorts, where we've been cooking on these fairly complicated projects and I don't ever touch a single line of code, just yell at them a lot, and with suitable amounts of marijuana (they are very frustrating most of the time) it's been going pretty well. I also helps that they need to explain what they're doing to somebody fairly-baked -- maybe not such an HR friendly plan?
1 days ago [-]
jeremyjh 1 days ago [-]
Even the strongest frontier model they used - GPT 5.2 - I would consider barely usable for agentic programming.
I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.
Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.
nozzlegear 1 days ago [-]
Oldheads remember when GPT 5.2 was at the forefront of agentic programming. December 2025 feels like eons ago, but alack it was an entire half year!
ipaddr 20 hours ago [-]
If I'm not using got 5.5 high reasoning I'm wasting time.
nozzlegear 19 hours ago [-]
Well, maybe so, but how did you feel about 5.2 when it was OpenAI's frontier model? That's what I'm getting at – it was the equivalent of your gpt 5.5 high reasoning just six months ago.
ipaddr 6 hours ago [-]
It was a joke. I think you need to mix up models.
nozzlegear 30 minutes ago [-]
Gotcha, I missed the /s indicating you were joking. Hard to parse tone and intent through text on the internet.
viking123 13 hours ago [-]
They all feel the same to me now, opus, 5.5, whatever
sigbottle 1 days ago [-]
Wait isn't gpt 5.2 good? Or is it not thinking / not codex? 5.2 was what sparked the late 2025 openai agentic programming revolution.
mkozlows 18 hours ago [-]
5.2 still had a Codex variant, which this doesn't describe using. It also notably is not using the Codex harness -- it does everything with open source harnesses (which obviously are worse). And while it uses two harnesses with its cheap models, it only uses the worse-performing one of those with GPT 5.2 for cost reasons. (They also don't specify effort/thinking level used for GPT 5.2, but given that it performs worse in their baseline testing than obviously non-SOTA models, I'm guessing it wasn't set to anything high.)
xienze 1 days ago [-]
> their performance drops when forced to navigate explicit architectural rules
Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.
maxbond 1 days ago [-]
Reminds me of the recent paper about delegating document editing tasks to LLMs across different disciplines [1]. That paper found that programming was the only discipline most LLMs can perform long horizon tasks on without accumulating errors & corrupting the document.
I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.
If it’s not easily verifiable, LLMs aren’t good at it.
jeremyjh 1 days ago [-]
I think that’s mostly because they get so much more of that reinforcement learning - since it is so economical. I dont know if there is any evidence of a fundamental reason they can’t be just as good at other tasks, but it might be economically infeasible for awhile yet.
mjburgess 1 days ago [-]
No one is curating vast amounts of data for them in other domains. Programmers send programs with fixes
jeremyjh 5 hours ago [-]
Its more about how costly it is to verify work in reinforcement learning. It is cheap in Mathematics and coding because it can be automated. It is expensive in other domains because while you can capture certain datasets to do pre-training on, you ultimately need humans in the loop to judge the quality of work.
knollimar 1 days ago [-]
There's no diff of my excel lambdas being fixed? :(
emp17344 1 days ago [-]
RLVR doesn’t work for unverifiable tasks, so they won’t be able to effectively use tools to boost reliability for those tasks.
jeremyjh 5 hours ago [-]
Right, so you have to use RLHF. That is the economics problem I was referring to.
dominotw 23 hours ago [-]
but what does it mean to be good at something that cant be verified. how do you know that they are not good at it, you are obviously using some measure.
sounds like an oxymoron of a claim.
maxbond 22 hours ago [-]
It means having taste. People say Picasso was a great painter, but that cannot be verified (at least, not in the sense of a verified reward).
dominotw 20 hours ago [-]
"people say picasso was a great painter" is definitely not hard to verify . lol.
maxbond 20 hours ago [-]
I don't know if you're being factitious or not but that was not what I meant. Picasso being a great painter is an example of "having taste"; "create an artistic image generation model with Picasso-level performance" is a valid problem statement we could attack with RLHF, but not with RLVR, because "taste" is not amenable to modeling with a reward function.
"Write this code in a way that is readable and maintainable" is another example.
The first paragraph ends with "[...] unleashing a flood of ill-informed reactions and muddled discourse. So, you know, it was just another day online."
It's almost as though it's not about the Monet.
20 hours ago [-]
marcosdumay 19 hours ago [-]
You just threw the "easily" away from the comment you are replying.
dominotw 19 hours ago [-]
doesnt make a difference to my comment
Geezus_42 14 hours ago [-]
There is a huge difference between "not verifiable" and "not easily verifiable".
dominotw 9 hours ago [-]
No because if op is actually able to verify it ( with difficulty) then ai can do it too.
maxbond 4 hours ago [-]
No one in this thread appears to disagree. The issue is that RLHF is prohibitively expensive and the number of disciplines you could target is massive, so for reasons of economics rather than fundamental theory, AIs do not perform well on tasks that aren't amenable to RLVR and even then off the shelf LLMs are really only well aligned for programming.
In the paper I linked they created a benchmark spanning 80 disciplines with tasks that could be checked automatically. So these are necessarily tasks that are tractable for RLVR, trivially you could use performance against the benchmark as a reward function. The performance was still mediocre in everything but programming. And as we're seeing in this article, there is still room for growth in programming.
In general you seem to be reading very literally in some places (taking the statement "AIs aren't good at X" as applying to all AI and perpetually) and very loosely in others (disregarding "easily" as unimportant) and misinterpreting statements you appear to agree with as being in disagreement. I don't think there's a real disagreement here, I think there's a misunderstanding.
pron 9 hours ago [-]
The situation is worse. Not only do agents have more difficulty under "structural constraints", but structural constraints may need to change, and agents are even worse at that.
When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. Except, eventually, you’ll want to add a feature that clashes with that invariant. At that point there are usually three choices:
- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.
- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.
Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences. Even when they are able to follow constraints, agents are terrible at identifying when the constraints need to change.
abalashov 9 hours ago [-]
Despite my very limited enthusiasm for agentic coding, I have some experience with it, and my experience matches what you say perfectly. This is one of the seams that runs between pattern recognition and reasoning, and--despite the marketing claims around chain of thought--LLMs do not reason, not in the slightest.
All attempts to make them appear to reason are basically recursive confinement efforts by the harness, to try to get the lightning into the bottle.
vishvananda 1 days ago [-]
I've been experimenting quite a bit with long-horizion agentic coding[1] and I have also noticed that agents seem to perform worse when forced into certain architectural patterns. I have found that is a bit better when including the constraints along the way instead of adding them after the fact. There seems to be a side-effect I have been calling "calcification", where a pattern starts appearing in the codebase and the agent follows the pattern to the point where it dominates the context and becomes self-reinforcing. This could potentially be a strength or a weakness for existing code bases depending the codebase quality. I will have more insights on this soon as more from-scratch runs conclude that include architectural guidance from the beginning.
> agents seem to perform worse when forced into certain architectural patterns.
FWIW I've noticed this too. I've found that the agents/models have their own style, which is mostly summed up as overly verbose.
Additionally, the models are OK at modularization when given space to "plan" their implementation, but rarely decide that abstracting something would be helpful after the fact (i.e. after many iterations on a greenfield codebase or when being dropped into a legacy codebase).
This often leads to "god files" which, when pointed to by the user/architect, causes the models to correctly critique (humorously when they're the ones that wrote the code in the first place).
dwa3592 1 days ago [-]
This sounds like another version of "As a chat becomes longer, the guardrails seem to become fuzzy". You can't use all of the context window bc at the end, the output would not respect the constraints (or guardrails) but to reliably produce production grade code you want the model to have expansive awareness which fills up the context window pretty quickly. It's like saying "Keep everything in mind from these 6 directories - and make this <insert ticket> change" - but keeping everything in mind already fills it's context window which makes it lose it's ability to follow the constraints (or guardrails).
whatever1 1 days ago [-]
This is not a new problem though. This is why we started writing modular code, strict interfaces etc
lanstin 1 days ago [-]
And doing incremental dev, so once a feature is done you can mostly ignore it.
Silhouette 1 days ago [-]
If there is one good thing that the generative AI tools have shown beyond any doubt it's that the classic "good programming" practices are still useful and effective. Self-documenting code. Modular design. Clearly defined architecture. Incremental development. Coding standards. Automated tests. Automated everything.
If there's a second thing the generative AI tools have shown beyond any doubt it's that many of the more modern (relatively speaking) "best practices" that have always been over-hyped and questionably-evidenced really do tend to produce worse results. LLMs take these methods to their logical conclusions and show us the end result much sooner. You can't just iterate your way to a solution when you don't even know what problem you're trying to solve. If you don't have a clear spec then you don't know what a correct product looks like. You need to invest time in reviewing code properly. If you don't keep the big picture in mind then the big picture becomes a mess.
Maybe one day the LLMs will leave me out of a job but at least I'll feel validated first!
skydhash 24 hours ago [-]
> If there is one good thing that the generative AI tools have shown beyond any doubt it's that the classic "good programming" practices are still useful and effective
If you apply those practice, then quickly you find yourself using the agent as merely a writing boost. And there’s an inflexion point when coding is no longer a bottleneck. Instead, you spend more time on thinking about design. You can see it in open source projects where most PRs are just a few line diffs. The bottleneck is knowledge and problem solving talent.
Silhouette 23 hours ago [-]
If you apply those practice, then quickly you find yourself using the agent as merely a writing boost.
I don't know what that means but I have seen no evidence so far that if you don't apply those practices then your code will be anything other than unmanageable spaghetti if you leave AI to maintain it for long.
Coding has never been the bottleneck for good developers. Part of the reason for that is that good developers know how to isolate different aspects of a system and so keep each individual aspect relatively simple and self-contained. Another part is that good developers were already standardising and automating a lot of the grunt work. These traits are also advantageous for keeping generative AI on the right track and keeping its proposed changes manageable.
lanstin 23 hours ago [-]
Yeah and that design and insight is the tiring part and while fun a bit less satisfying in the way that writing a nice bit of boiler plate or populating the struct members for your data type can be. One thing is you can work on design and insight while taking a good walk around the block, which is nice.
skydhash 23 hours ago [-]
I spend that time mostly on the sofa, or in front of a whiteboard. Or sometimes a live brainstorming. Typing code is actually relaxing. What looks like relaxing is actually hard thinking.
usrusr 1 days ago [-]
So give harder guiderails? Sonarcube and the like. But I guess then the failure mode would be appeasing the linter while slowly forgetting the requirements... (or not so slowly, because the try/fail loop won't be nice to context at all..)
yomismoaqui 1 days ago [-]
Also they used languages with dynamic typing like Python & JS. In my experience a statically typed codebase is easier to maintain for humans so maybe it is also for agents.
When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.
__mharrison__ 9 hours ago [-]
Just have your harness rules add types and run ty after every change. Models handle Python typing quite well these days.
acbart 1 days ago [-]
It's crazy to me that people think of Python as dynamically typed by default. Strong static typing has been an option in Python for years now, and it should just be the default.
mrob 1 days ago [-]
>Strong static typing has been an option in Python for years now, and it should just be the default.
"The Python runtime does not enforce function and variable type annotations. They can be used by third party tools such as type checkers, IDEs, linters, etc."
Which third-party enforcement mechanism do you propose become the default?
shepherdjerred 1 days ago [-]
There are plenty of options for static type checking in Python. Choose your favorite or just use Ty
epgui 1 days ago [-]
The python type hints are useful for static analysis (and yes, should be the default) but it’s a joke compared to the utility of types in a language like Haskell.
shepherdjerred 1 days ago [-]
If you're comparing type systems against Haskell you're excluding all mainstream languages except maybe Scala and Rust
epgui 6 hours ago [-]
Yes.
antonvs 1 days ago [-]
Typing with tools like Pyright doesn't come close to providing what a good statically typechecked language provides.
There are many reasons for this. A big one is that many libraries are only partially typed at best, and dynamic types tend to propagate, weakening the guarantees you get from type checking.
Dynamic idioms in general, including something as common as string-indexed dictionaries, negate type checking. Runtime metaprogramming is the same. All of these things have equivalents in a good statically checked language, but Python doesn't follow those models.
Fundamentally, in Python static typing is an optional analysis layer over a dynamic language, and the consequences of that can't be fully mitigated. The result is a big difference in what types can guarantee.
shepherdjerred 1 days ago [-]
TypeScript had this _exact_ same problem when it started out. As more libraries add annotations, the ecosystem will become stronger, and it will eventually be about as good as a "real" statically typed language.
> Dynamic idioms in general, including something as common as string-indexed dictionaries, negate type checking.
Do you have any proof of this? It hasn't been a problem in TypeScript, and I doubt it's an issue in Python
fredcallagan 9 hours ago [-]
Very interesting paper and I must say that I totally agree with it. But also that is something that is not new.
I would say that the initial expectation is a bit off. I never expected that picking up any agentic coding solution, drop it in a project and fire at it a list of tasks would just magically work and follow a project pre-defined constraints.
I do not believe that any agentic coding stack comes out of the box capable of this. Agents still need proper mechanics to understand the context, constraints and objectives reliably and that's still a work in progress as we can see by the constant updates on tools and skills and processes from Leading AI labs. They are now trying to fill that additional layer, which by the way could be much more profitable then bare model and token consumption.
I would also argue that current OS models, like the ones tested, if properly driven can already produce production code following the desired constraints.
What has been you experience? What has your production code looked like in recent months?
p0w3n3d 1 days ago [-]
tasks spanning eight web frameworks
Does anyone else have this experience that LLM create better pure html+CSS+js than work with existing frameworks?
bob1029 1 days ago [-]
I think web frameworks have been "in trouble" as of gpt-5.4. I can't imagine using something like React anymore.
The most incredible combo I've seen lately is progressive enhancement of Razor Pages with javascript. With this arrangement the newest models tend to make a really good call on if something should happen server-side (cshtml) or on the client (js).
p0w3n3d 15 hours ago [-]
I've recently vibecoded pure html+css+js frontend for WWTBM-alike game: https://github.com/pawel-jaworski-loftyworks/mili-game - it consists of one file and is blazingly fast. Previous attempts to vibecode something with a vite-framework something were more harsh and clumsy.
siliconc0w 1 days ago [-]
I recommend spending some time getting a few parts of the codebase idiomatic and then @-ing those files as exemplars. This works a lot better than trying to steer it with markdown. This works reasonably well for like FastAPI but JavaScript seems to be the worst, even with guidance and exemplars it'll prefer in-lining a bunch of garbage rather than use the APIs as directed.
AmazingTurtle 1 days ago [-]
So my finding is:
planning is worth it.
For a little complex changes, I always run codex (5.5-high) in planning mode first.
I have linked various docs/{ARCHITECTURE,BACKEND-GUIDELINES,NESTJS-DI,..}.md etc. from AGENTS.md so they can quickly discover relevant docs at planning time, only if they are needed. No need to know react specific stuff when it's dealing with a backend problem for example. I typically blindly approve plans made by the agent with a fresh context, because that's as if I had prompted it. Works the best for me.
Using /goal however, it's really just constantly compacting and doing it's thing, of course it gets sloppy. If only there was a state machine that would transform tickets into a Planning Mode Prompt, then use, idk. guardian approvals (somehow a "Product Management Perspective Lens" approving or making changes to the plan) and then letting a less capable or less reasoning agent execute the plan, I think that would work the best.
Odd they used GPT-5.2 and not GPT-5.2-codex. i.e. the one optimized for coding agent tasks.
maleldil 1 days ago [-]
Considering this is from academia, there's a chance there were limitations on the available models. My research group accesses OpenAI models via Azure, and until recently (last week) the latest model was GPT 5. We just got 5.4.
beering 1 days ago [-]
That’s wild. Are you at a university that bans using the OpenAI APIs directly?
16 hours ago [-]
leecommamichael 1 days ago [-]
These things don’t think. We’re going to have to reiterate this for a long time, I fear.
sheeshkebab 1 days ago [-]
…but they reason well enough given enough context (using their matmuls).
noosphr 1 days ago [-]
To this day frontier models think that A and not B means A and B when the sentence gets pushed far enough back in their context window. The context length that model can reason over without obvious errors is much smaller than the advertised context. Between a 1/4th to a 1/20th what is advertised on the tin.
antonvs 1 days ago [-]
Critiques like this tend to focus very hard on what models can't do. It's true, they have limitations.
But they're also superhuman in so many other ways. It's valid to point out limitations, but that doesn't support the conclusion that models are not incredibly powerful and capable of the functional equivalent of reasoning at human or superhuman levels in many scenarios.
noosphr 24 hours ago [-]
They may be better than humans at reasoning but they are substantially worse than the first generation logic programs from the 1950s.
cheevly 18 hours ago [-]
These types of comments help demonstrate first-hand how human reasoning stacks up against what an LLM would say in this situation.
leecommamichael 7 hours ago [-]
Agreed. Both are true. I sometimes think of the calculator as being superhuman as well.
antonvs 4 hours ago [-]
Yes, although the calculator couldn't "reason" the way ML models can.
All the political and emotional reactions to LLMs seem to obscure how absolutely amazing this technology is. I've pointed them at codebases I wrote entirely myself and had them find bugs, point things out I had missed, plan and implement refactorings to improve code quality, etc. I may be "smarter" than the models in some ways but there's no question they're smarter than me in others. They're unlike any tool we've ever had access to.
Yes, the politics and economics around them leaves a lot to be desired (read: is absolutely terrible), and there are a lot of valid justifications for the "AI backlash", but there's a very important baby in that bathwater.
Npovview 1 days ago [-]
Do you also happen to remember what you ate last thrusday?
ethin 20 hours ago [-]
Do you have a point? Because last time I checked, AIs were supposed to be better than us fragile faulty humans, and weren't designed to emulate us and all our faults.
Npovview 11 hours ago [-]
If you have been following the news, harness is also a scaling direction now. Prompt your AI better not to forget relevant stuff or write them in a file which it can refer later. This way context can be refreshed, this is cached facts method or rolling window method of refreshing your memory just like you would ask a colleague to explain a concept again. These are solved problems.
ethin 6 hours ago [-]
Are they though? Because I really shouldn't have to use Claude Code (and I don't) just to get even decent results. As I said, I thought one of the biggest advantages AI was supposed to have was that it wouldn't need such constant reminding of things because it wasn't trying to emulate us faulty, forgetful, fragile humans who do have memory loss?
Npovview 4 hours ago [-]
You can convert your best practices into a skill or best practices md file and CC will keep that in purview.
UncleEntity 1 days ago [-]
"If you have a question look in the specification for the answer and don't just guess" seems a fairly important thing to remember for more than a couple of minutes...
Npovview 1 days ago [-]
I had a coding session where I was doing stuff across two repositories. And CC forgot in exactly which repository a particular file was so it was grepping the parent directory. I just asked it to write all important key-value pairs which it thinks are important to a file and it never did parent directory grepping.
leecommamichael 1 days ago [-]
Is that the same gap as what you’re responding to? To me, it seems his critique is about advertised capability and logical statements, and your rhetorical(?) question is about memory.
emp17344 1 days ago [-]
There is now a trillion-dollar industry bent to the task of convincing people these things can think. It’s gonna cause some damage.
suprfnk 1 days ago [-]
I don't think they think. I still use them a lot despite that, because they are very powerful parameterised code generators.
akomtu 1 days ago [-]
There is a movie, Gold (2016), about a fake gold mine. One of its founders is a true believer: he found a few chunks of gold and started digging for more. The other founder is a nihilist: he realised that there is no gold there, but who cares if he makes the investors believe? So he does, and almost sells the company for $300M.
In our story, investors are mining intelligence from GPUs, and they truly believe they are one inch from discovering the biggest goldmine in history. But GPUs, unlike a goldmine, cannot be inspected for traces of gold by independent contractors. To keep the hype up, the nihilists in our story dig up cheap gold-looking metals from time to time and tell investors that with a bit of alchemy - agentic workflows, etc. - those metals can be magically turned into gold.
Investors will keep digging until the end of the age, or until they run out of money.
1 days ago [-]
alexwwang 21 hours ago [-]
I am trying to avoid this by building a plugin based on my memory management project Aristotle. I add a status machine to monitor the activities of LLM while it does jobs following my tdd-pipeline skills, which begins with requirements clarification and ends up with delivery.
These two projects are on GitHub, you may search alexwwang/aristotle and alexwwang/tdd-pipeline to dive into the details or just ask your LLM to scan them to tell you the points you are interested in.
luodaint 16 hours ago [-]
Not those carefully designed constraints that I set up from the beginning, but short-term ones that I came up with after an agent failed in some way: "Validate JWT at the route level, not the component." "Call workspace provisioning on each user creation." Both because of things the agent had done incorrectly.
Aspiration vs. consequence, in other words. An aspiration constraint describes a desired outcome for the system; a consequence constraint maps to a problem already encountered. And the agent ignores the former when faced with the path of least resistance while obeying the latter because it is brief, unambiguous, and precise about preventing that particular failure mode. Which is key rather than the harness in determining survival through session rotation.
lemax 13 hours ago [-]
Would love to see this benchmark tested on more perceivably LLM friendly frameworks/ORM (e.g. is NestJS or Drizzle / Kysely more performant than their choice of Sequelize) and more frontier model vs just GPT 5.2.
Anyone read whether these tests include any validation loops? What happens if the models get back test failures, for instance? Understanding how many turns to hit full passing behavior suite would also be interesting. Great methodology in the study though.
pianopatrick 1 days ago [-]
I think someone is going to figure out a framework for using LLMs for coding.
A framework would use static code checking tools to force an architecture on to LLMs instead of trying to do so in markdown.
I don't know exactly what it will look like but for example I could imagine a Java Framework where the LLM could only create subclasses of certain classes.
13 hours ago [-]
cheevly 18 hours ago [-]
A lot of us have been doing this for over a year now.
alasano 23 hours ago [-]
I've been building https://engine.build to introduce a proper structured external agent orchestrator that's used to build with clear constraints and make sure the end result is what you wrote in your spec or requirements. Without having to babysit and micromanage the models.
Implementation phases very often go through 5-10 review and fix rounds to actually get the implementation to match the spec. It takes longer but that's what's necessary to get actually good results on long horizon tasks with detailed requirements. I'll be open sourcing it fully soon.
Geezus_42 13 hours ago [-]
Why link to an empty GitHub on your site?
alasano 12 hours ago [-]
It's the GitHub org under which the repo is, it will be updated to the actual repo after it's made public.
If you don't want to sign up to be notified by email you can watch the org.
dalemhurley 24 hours ago [-]
This is why we as an industry have spent so much effort optimising the code generation process with things like skills, rules, tests, reviews, lints, agentic loops with feedback and sub-agents, and the code-runners. It is not just LLMs building code, it is an eco-system collaborating together.
I would agree too that as the codebase grows the LLM struggles more and more with generating code. It is probably misaligned incentives, it wants to complete the isolated task without too much context consumed, at the POC it can consume most of the app, by about 30K lines of code it is quite complex code base to navigate.
13 hours ago [-]
KronisLV 1 days ago [-]
> For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.
Time to start writing linting tools that check the architecture and spoon feed the LLM what exactly it's doing wrong.
(I also wrote a simple linter for architecture/code checks that aren't well encapsulated by ones that just focus on individual files, that uses Go + goja to write rules in ECMAScript and parallelize the read only ones and also allow ones that change files as necessary, in addition to something like Ruff / Oxlint / Oxfmt / whatever is present in each stack; though it's is still in development and not as good of a focused example as ArchUnit is)
If we write software specification docs, bother describing how it evolves with ADRs, enforce code style automatically and require certain test coverage automatically (or at least should), why couldn't we go a step further, formalize those specs and ensure that any new code is also up to snuff? I don't think that's any more of a job for an LLM, than telling it how it should format code is. Also, I'm in the camp that believes that at least many of your ORM mappings and similar stuff should be the output of codegen, since you've already gone through the trouble of describing the schema/migrations to get there.
I don't think this would be only good for LLMs, though - I've seen projects that have like 3 different audit systems built in, not because of some fancy business requirement, but rather cause the devs either didn't know about the previous one(s) or just didn't feel like following what should have been the pre-established conventions, even when there were docs in place (nobody read those).
bob1029 1 days ago [-]
> Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline.
I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.
The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.
richardlblair 1 days ago [-]
I find the same. We have abstractions with multiple concrete implementations, examples of patterns and examples of anti patterns.
I usually find I can achieve 90% of the outcome I'm trying to achieve. I use sonnet for planning, qwen for coding, sonnet for review.
xcjsam 1 days ago [-]
The harness mattering more than the model lines up with my experience too. What this paper measures is within-turn constraint decay. The version that bites in multi-agent setups is across-session — the architectural rules an agent wrote down on Monday don't reach the agent making the next change on Tuesday.
haeseong 14 hours ago [-]
[flagged]
rrook 1 days ago [-]
As a codebase grows, divergent structural emergence from incidental(lang and lib) details results in prolonged complexity costs. I'm working on a language that enforces structure for agents: https://github.com/hale-lang/hale
wetpaws 3 hours ago [-]
[dead]
AIorNot 3 hours ago [-]
Why the heck does this have to be a scientific study? since when did SWEs publish archive style science peices.. lol a good blog post would have been better.
LLMs write working code, but have trouble following the script. They are slot machines of code. Human oversight is under pressure to deliver faster but code takes time to comprehend and analyze. Also in LLM coding, we end up with lots of natural language based spec files to manage and code we don't have an intuitive feel for unless we commit to the rigor of deep code review..(which no human really does anyway)
ElenaDaibunny 19 hours ago [-]
Fragility compounds fast when you add visual grounding to the loop. Code agents at least get structured feedback from the compiler.
rbbydotdev 1 days ago [-]
This is interesting, anecdotally I have felt like I was having better luck with raw sqlite than using an ORM in a recent typescript project, using raw sqlite queries vs drizzle
oulipo2 1 days ago [-]
Exactly why you can't remove humans in the loop to assess that the solution is not only correct (which LLMs are quite bad at, once concurrency, logic, etc are involved), but also elegant, maintainable, etc
phrotoma 1 days ago [-]
"constraint decay" isn't this just another name for the (already well understood) idea of "context rot"?
sspoisk 4 hours ago [-]
[flagged]
jessyt 8 hours ago [-]
[flagged]
jixter_apps 14 hours ago [-]
[flagged]
MultiAgt 8 hours ago [-]
[flagged]
danborn26 8 hours ago [-]
[dead]
codepack 19 hours ago [-]
[flagged]
dundunUp 21 hours ago [-]
[flagged]
hottrends 24 hours ago [-]
[flagged]
launchseed 18 hours ago [-]
[flagged]
zenai666 15 hours ago [-]
[flagged]
GhostGains 13 hours ago [-]
[flagged]
Developer_H 19 hours ago [-]
[flagged]
zane_shu 22 hours ago [-]
[flagged]
brentrhodes 20 hours ago [-]
[flagged]
volume_tech 1 days ago [-]
[flagged]
huaiorg 10 hours ago [-]
[flagged]
try-working 1 days ago [-]
[flagged]
spacedoutman 1 days ago [-]
This research is useless and nearly all other LLM research is too.
gpt 5.2 is the strongest model they tested, a nearly 6 month old model.
Traditional research can not keep up.
acgourley 1 days ago [-]
I disagree, their findings should generalize to the frontier. Even if the latest can deal with the extra complexity, it stands to reason it will take more tokens to do less. This could be a useful insight into the next generation of evals.
abujazar 1 days ago [-]
Agreed. As Simon Willison points out, November 2025 was a a critical months because that's pretty much when coding agents became «good enough», eliminating most of the problems pointed out in this study.
anygivnthursday 13 hours ago [-]
I regularly see Claude Opus 4.7 dropping constraints from an otherwise small CLAUDE.md at merely 20% context use. I have to keep reminding it, and it has all info ready in its context, still time to time decides to ignore parts.
That said, the limitations are kind of obvious and are starting to show in some of my projects, and this article seems to confirm my suspicions. If it's just confirmation bias or not, I can't say yet.
In my experience, for anything complex enough, I have to start adding more and more constraints, style guides, corner cases, error handling, optimization guidelines and all this good stuff to my Markdown specifications, rules and skills. At some point this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language. The writing speed gains are enormous, yeah, and business sees this as productivity gains, of course - and we do it because the pressure for increased productivity is there, as it's always been; yet the trade off seems to be clear and a lot of people are just ignoring it.
This is the problem nobody is talking about. I see codebases growing in MD files with instructions and guidelines and requests that are also LLM generated… and it’s all piling up. No one is reviewing it 100% , and even when we do, it’s all very subjective. What’s the difference between “Follow a RESTful approach”, “We use REST, not graphql”, “90% of our endpoints are resource oriented, but we have a couple of endpoints that look rpc-ish; please ignore the latter”… It’s all very stupid.
I had never written an eslint rule until i started having agents pump them out for me and now I've encoded a bunch of important rules as lint rules that will fail CI if violated.
[1] https://pages.cs.wisc.edu/~remzi/Naur.pdf
It's like using a compiler that generates semantically different code every time you run it. Basically like compiling a program that's full of UB but "seems to work" most of the time.
> business sees this as productivity gains
Back to LoC/s as a measure of "productivity."
IMO this doesn’t follow from what OP wrote. I personally measure it with a more abstract “how long does it take me to ship something that is useful in production and solving a real problem” and the increase in speed there has been massive for me. But of course I’m not a bigbrain 10x coder that is doing bleeding edge novel stuff like most people here, so gains might be more obvious for me than for others.
But that’s only half of the problem. What about “and how easy it is to maintain long-term”. If you say that maintenance can be done via LLM, I would argue that there is zero guarantees that LLMs are backwards compatible and that the markdown you wrote now will work just as fine in 1,2,3 years
That this would be the case is even more guaranteed than some programming language being backwards compatible and the code we wrote working just as fine in 1,2,3, years.
Languages do get non-backwards compatible changes, dependencies break, stuff is deprecated, etc.
But the job of LLMs will remain to generate something from a prompt, and the markdown we wrote, as it's high level and not tied to language versions, APIs, and implementation details, will be just as good a prompt for that in 2050 as it is in 2026.
Sure, but they're deterministic and sometimes you can even do automatic rewrites through AST inspection and writing back to the files instead of scripting string substitutions on them directly.
"But the job of LLMs will remain to generate something from a prompt, and the markdown we wrote, as it's high level and not tied to language versions, APIs, and implementation details, will be just as good a prompt for that in 2050 as it is in 2026."
Your organisation is keeping version control on the LLM:s you use? It's all local, old copies of these databases are kept in secure storage together with the querying and harnessing software?
So kind of like maintaining a growing codebase? But this time around you cannot guarantee what the outputs will be?
If you're really lucky, maybe a lot of this is documented in some wiki page somewhere, but everyone knows the documentation is never as complete as you'd like it to be. The longer a team works together without new people coming on board, the more likely it is that the documentation of these soft requirements and knowledge has drifted from reality. IME nothing shows how much you've failed to document than revisiting your onboarding process documents for the first time 2-3 years after you wrote them.
As I've experimented with the various AI tools, I feel like a lot of these extra documents I've written are documenting a lot of these things "everyone knows". But I'm also not at the "80% of the professional code I write is generated" stage yet. So I'm curious if you're finding that you're creating documentation that goes beyond just documenting what we used to just keep in our heads and are now getting into "writing a book about how to code" territory?
I adopt the mindset of docs are for humans, tests are for agents. They document formal dependencies and leave a measurable artifact behind. If you identify some behavior or transitive dep in your system, agents document it first with a test codifying the expected behavior. Tests are the source of truth about expected system behavior and you can convince agents to write decent behavioral tests if you ask them to with the right structure. Docs are now cheap and a render, not a long term thing. There is some token efficiency to consider, but still, they are quick and cheap if you don't understand some module or its purpose.
Works great until they sweep you a test under the rug which always passes because the condition is something like if(true) .
These serve as living documentation which cries out in pain when they get out of sync with the system in question, generating specific error messages -- as opposed to natural language docs which rapidly drift into an ambiguous "kinda useful" state. And the validation is performed mechanically (as opposed to neurally) so no hallucinations are possible.
The one thing I would add is that you do want these artifacts to be human-friendly from a reading perspective -- you want engineers to be able to scan over these and check that they are validating the right things.
I think the harness and code patching technique starts to matter a lot more once you get outside the trivial range of codebases that fit within the first ~20% of the context window and can otherwise be iterated completely in a single inference pass.
The apply_patch technique that OAI has polished their models on seems to be the best approach for monster scale codebases. Anything based on line ranges and simple find-replace will disintegrate at the edges. You need multiple spatial anchors to deal with nasty things like cshtml files. The prepare/commit behavior is ideal for iterating through ambiguous contexts across many large files and refining anchors.
llms cannot generate anything novel.
Code doesn't need to be novel to be useful. There's a reason why design patterns are a thing in software.
AI is not an abstraction.
Back to the original point, though: most software engineering work isn't novel. Most people are working on slightly different iterations of the same thing, but with the aim of different products. You can have completely different products that use nearly the same patterns as most other services.
To put it bluntly: we don't need AI to generate novel code for the vast majority of the software being built.
https://www.reddit.com/r/math/comments/1tj534d/openais_inter...
So basically 90% of programming in an enterprise environment? lol. Sounds useful to me...
we are in middle state where ai tools to generate on the fly widget work arent accessible in the form that most ppl need. So programmers are currently doing the manual step by managing remix into easily consumable form.
They're not merely re-arranging pre-existing blocks of code.
And they have been shown to develop emergent properties that weren't in their training set time and again.
They generate novel things as much as the average programmer (which works after himself having practice, exposure to codebases, and training, and reading API documentation, and such) generates novel things.
Ah yes, the famous emergent properties - like suggesting that we should walk to the car wash?
What makes the behavior emergent is that it can't be predicted at training time.
The emergent and unpredictable output is the result of massive vector complexity being encoded.
You are either being pedantic or missing the point of emergent however.
Yes, it's not some novel unforeseen thing, like a magical Marvel Universe material or some unknown to humanity mode of thinking. Same way when people make something new they still recombine known words, or colors, or physical things in the universe.
It is however new capabilities that is not explicitly in the training set and can't be predicted by it. Like teaching something only calculus training materials and it figures out boolean algebra.
>The emergent and unpredictable output is the result of massive vector complexity being encoded
As opposed to what in humans? God given revelation?
The emergent behaviour is in the training data and/or encoding/training.
So while I agree it is emergent from the complexity, it isn't some unknown mechanism. Just complexity at scale.
So like humans? Like the universe?
https://www.scientificamerican.com/article/ai-just-solved-an...
There's plenty of focus on the negative side of the tradeoff. Less so on why we're making it anyway, or why it somehow works out even if "this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language".
And the answer to that can be condensed to a one-liner, which I quote after[0]:
--[0] - https://drensin.medium.com/elephants-goldfish-and-the-new-go... - article may be a bit fluffy here and there, but that one line was a big insight for me.
There is absolutely no guarantee llm1(MD) == llm2(MD), by design. With the current batch you need to explicitly constrain a number of parameters, far more than simply the prompt, to get identical output from the _same_ model, let alone another model that has varied training data and/or architecture.
One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.
If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.
If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.
You could take it a step further and put the example code into source code files...and be like, super comprehensive with your examples ... ;)
To put it in practice: if you point claude/codex to a repository and you ask it to implement feature X using style guide Y, the code will probably work, but you can usually get better results by saying "do it in the style of this file, it was done well there".
It is not great at decision making or judgment calls that don't have a well defined spec or plan in place yet; like unofficial or unapproved tokens if you will. A lot of this stuff simply never has had specs as it has been internal to how companies work and their secret sauce.
The closest thing we have are governance and compliance policies due to legal/business needs requiring it so it's far more well documented than operational ones in how we work. It is more about the how versus the what here I guess is what I'm saying.
But yeah this is why it does great when there are tests, design systems, evals, and other artifacts to mirror. Far more reckless and unpredictable without these things, but still great for exploration and finding the data output you seek.
It's like when I see people feeding it a whole bunch of "best practices" and expect it to follow them. It won't. But you could ask it questions about the best practices all day long.
Idk, calling it "just text prediction " seems unfairly dismissive of this capability
at the end of the day, it presents a vector field and predicts the next vector. That’s literally the heart of intelligence just like assembly is the heart of execution. When playing table tennis, your brain is literally predicting seconds into the future to get your body into the right position.
But we aren’t discussing intelligence here. We are discussing how best to utilize that intelligence.
The “idea” of table tennis and the rules. Those are things we can talk about. It’s those “best practices” I gave in my example. The actual playing of table tennis would be the examples. How to apply those best practices and what good code looks like.
Ended up pointing Claude at a few sample files from our existing reporting, gave it read-only oauth access to the ERP and said “build a new report showing the cash by project as calculated by xxxx - yyyy + zzzz in the style of the existing reports” and it basically one-shot from there.
Kind of crazy and I built a bunch of redundant check-sums because I honestly didn’t think it would be able to replace like 6 workdays of effort for the 2 FTEs who generate that kind of thing manually every month but so far so good..
just dont break out a plan without also having it read the code again
The more it has to go on, the more it relies on repetition of what came before. It's also possible that authors start paying much less attention and put less effort into editing later chapters.
Despite the sheer volume on Amazon, LLMs are not at the point of writing well.
For example, if you apply "guardrails" to an image generator of about a year ago, all the people start looking alike. Story generators start using only a few standard names.
That was last year. Is it happening with the frontier models?
I mean, I spend more tokens having them clean up all the places they didn't follow the the plan (if I catch it) or implementing what came out of a 'complete and tested' previous plan where they just stop as soon as all the pathetic new test pass and you discover half of it isn't even there when trying to implement the next thing on top of it.
Though... I have been conducting an experiment, of sorts, where we've been cooking on these fairly complicated projects and I don't ever touch a single line of code, just yell at them a lot, and with suitable amounts of marijuana (they are very frustrating most of the time) it's been going pretty well. I also helps that they need to explain what they're doing to somebody fairly-baked -- maybe not such an HR friendly plan?
I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.
Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.
Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.
I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.
[1] https://arxiv.org/abs/2604.15597
Discussion: https://news.ycombinator.com/item?id=48073246
sounds like an oxymoron of a claim.
"Write this code in a way that is readable and maintainable" is another example.
It's almost as though it's not about the Monet.
In the paper I linked they created a benchmark spanning 80 disciplines with tasks that could be checked automatically. So these are necessarily tasks that are tractable for RLVR, trivially you could use performance against the benchmark as a reward function. The performance was still mediocre in everything but programming. And as we're seeing in this article, there is still room for growth in programming.
In general you seem to be reading very literally in some places (taking the statement "AIs aren't good at X" as applying to all AI and perpetually) and very loosely in others (disregarding "easily" as unimportant) and misinterpreting statements you appear to agree with as being in disagreement. I don't think there's a real disagreement here, I think there's a misunderstanding.
When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. Except, eventually, you’ll want to add a feature that clashes with that invariant. At that point there are usually three choices:
- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.
- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.
Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences. Even when they are able to follow constraints, agents are terrible at identifying when the constraints need to change.
All attempts to make them appear to reason are basically recursive confinement efforts by the harness, to try to get the lightning into the bottle.
[1]: https://medium.com/@vishvananda/i-spent-2-billion-tokens-wri...
FWIW I've noticed this too. I've found that the agents/models have their own style, which is mostly summed up as overly verbose.
Additionally, the models are OK at modularization when given space to "plan" their implementation, but rarely decide that abstracting something would be helpful after the fact (i.e. after many iterations on a greenfield codebase or when being dropped into a legacy codebase).
This often leads to "god files" which, when pointed to by the user/architect, causes the models to correctly critique (humorously when they're the ones that wrote the code in the first place).
If there's a second thing the generative AI tools have shown beyond any doubt it's that many of the more modern (relatively speaking) "best practices" that have always been over-hyped and questionably-evidenced really do tend to produce worse results. LLMs take these methods to their logical conclusions and show us the end result much sooner. You can't just iterate your way to a solution when you don't even know what problem you're trying to solve. If you don't have a clear spec then you don't know what a correct product looks like. You need to invest time in reviewing code properly. If you don't keep the big picture in mind then the big picture becomes a mess.
Maybe one day the LLMs will leave me out of a job but at least I'll feel validated first!
If you apply those practice, then quickly you find yourself using the agent as merely a writing boost. And there’s an inflexion point when coding is no longer a bottleneck. Instead, you spend more time on thinking about design. You can see it in open source projects where most PRs are just a few line diffs. The bottleneck is knowledge and problem solving talent.
I don't know what that means but I have seen no evidence so far that if you don't apply those practices then your code will be anything other than unmanageable spaghetti if you leave AI to maintain it for long.
Coding has never been the bottleneck for good developers. Part of the reason for that is that good developers know how to isolate different aspects of a system and so keep each individual aspect relatively simple and self-contained. Another part is that good developers were already standardising and automating a lot of the grunt work. These traits are also advantageous for keeping generative AI on the right track and keeping its proposed changes manageable.
When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.
https://docs.python.org/3/library/typing.html
"The Python runtime does not enforce function and variable type annotations. They can be used by third party tools such as type checkers, IDEs, linters, etc."
Which third-party enforcement mechanism do you propose become the default?
There are many reasons for this. A big one is that many libraries are only partially typed at best, and dynamic types tend to propagate, weakening the guarantees you get from type checking.
Dynamic idioms in general, including something as common as string-indexed dictionaries, negate type checking. Runtime metaprogramming is the same. All of these things have equivalents in a good statically checked language, but Python doesn't follow those models.
Fundamentally, in Python static typing is an optional analysis layer over a dynamic language, and the consequences of that can't be fully mitigated. The result is a big difference in what types can guarantee.
> Dynamic idioms in general, including something as common as string-indexed dictionaries, negate type checking.
Do you have any proof of this? It hasn't been a problem in TypeScript, and I doubt it's an issue in Python
What has been you experience? What has your production code looked like in recent months?
The most incredible combo I've seen lately is progressive enhancement of Razor Pages with javascript. With this arrangement the newest models tend to make a really good call on if something should happen server-side (cshtml) or on the client (js).
For a little complex changes, I always run codex (5.5-high) in planning mode first. I have linked various docs/{ARCHITECTURE,BACKEND-GUIDELINES,NESTJS-DI,..}.md etc. from AGENTS.md so they can quickly discover relevant docs at planning time, only if they are needed. No need to know react specific stuff when it's dealing with a backend problem for example. I typically blindly approve plans made by the agent with a fresh context, because that's as if I had prompted it. Works the best for me.
Using /goal however, it's really just constantly compacting and doing it's thing, of course it gets sloppy. If only there was a state machine that would transform tickets into a Planning Mode Prompt, then use, idk. guardian approvals (somehow a "Product Management Perspective Lens" approving or making changes to the plan) and then letting a less capable or less reasoning agent execute the plan, I think that would work the best.
But they're also superhuman in so many other ways. It's valid to point out limitations, but that doesn't support the conclusion that models are not incredibly powerful and capable of the functional equivalent of reasoning at human or superhuman levels in many scenarios.
All the political and emotional reactions to LLMs seem to obscure how absolutely amazing this technology is. I've pointed them at codebases I wrote entirely myself and had them find bugs, point things out I had missed, plan and implement refactorings to improve code quality, etc. I may be "smarter" than the models in some ways but there's no question they're smarter than me in others. They're unlike any tool we've ever had access to.
Yes, the politics and economics around them leaves a lot to be desired (read: is absolutely terrible), and there are a lot of valid justifications for the "AI backlash", but there's a very important baby in that bathwater.
In our story, investors are mining intelligence from GPUs, and they truly believe they are one inch from discovering the biggest goldmine in history. But GPUs, unlike a goldmine, cannot be inspected for traces of gold by independent contractors. To keep the hype up, the nihilists in our story dig up cheap gold-looking metals from time to time and tell investors that with a bit of alchemy - agentic workflows, etc. - those metals can be magically turned into gold.
Investors will keep digging until the end of the age, or until they run out of money.
These two projects are on GitHub, you may search alexwwang/aristotle and alexwwang/tdd-pipeline to dive into the details or just ask your LLM to scan them to tell you the points you are interested in.
Aspiration vs. consequence, in other words. An aspiration constraint describes a desired outcome for the system; a consequence constraint maps to a problem already encountered. And the agent ignores the former when faced with the path of least resistance while obeying the latter because it is brief, unambiguous, and precise about preventing that particular failure mode. Which is key rather than the harness in determining survival through session rotation.
Anyone read whether these tests include any validation loops? What happens if the models get back test failures, for instance? Understanding how many turns to hit full passing behavior suite would also be interesting. Great methodology in the study though.
A framework would use static code checking tools to force an architecture on to LLMs instead of trying to do so in markdown.
I don't know exactly what it will look like but for example I could imagine a Java Framework where the LLM could only create subclasses of certain classes.
Implementation phases very often go through 5-10 review and fix rounds to actually get the implementation to match the spec. It takes longer but that's what's necessary to get actually good results on long horizon tasks with detailed requirements. I'll be open sourcing it fully soon.
If you don't want to sign up to be notified by email you can watch the org.
I would agree too that as the codebase grows the LLM struggles more and more with generating code. It is probably misaligned incentives, it wants to complete the isolated task without too much context consumed, at the POC it can consume most of the app, by about 30K lines of code it is quite complex code base to navigate.
Time to start writing linting tools that check the architecture and spoon feed the LLM what exactly it's doing wrong.
I reckon something like this would be good for every project out there: https://www.archunit.org/getting-started
They expand a bit more on the reasoning behind it: https://www.archunit.org/motivation
(I also wrote a simple linter for architecture/code checks that aren't well encapsulated by ones that just focus on individual files, that uses Go + goja to write rules in ECMAScript and parallelize the read only ones and also allow ones that change files as necessary, in addition to something like Ruff / Oxlint / Oxfmt / whatever is present in each stack; though it's is still in development and not as good of a focused example as ArchUnit is)
If we write software specification docs, bother describing how it evolves with ADRs, enforce code style automatically and require certain test coverage automatically (or at least should), why couldn't we go a step further, formalize those specs and ensure that any new code is also up to snuff? I don't think that's any more of a job for an LLM, than telling it how it should format code is. Also, I'm in the camp that believes that at least many of your ORM mappings and similar stuff should be the output of codegen, since you've already gone through the trouble of describing the schema/migrations to get there.
I don't think this would be only good for LLMs, though - I've seen projects that have like 3 different audit systems built in, not because of some fancy business requirement, but rather cause the devs either didn't know about the previous one(s) or just didn't feel like following what should have been the pre-established conventions, even when there were docs in place (nobody read those).
I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.
The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.
I usually find I can achieve 90% of the outcome I'm trying to achieve. I use sonnet for planning, qwen for coding, sonnet for review.
LLMs write working code, but have trouble following the script. They are slot machines of code. Human oversight is under pressure to deliver faster but code takes time to comprehend and analyze. Also in LLM coding, we end up with lots of natural language based spec files to manage and code we don't have an intuitive feel for unless we commit to the rigor of deep code review..(which no human really does anyway)
gpt 5.2 is the strongest model they tested, a nearly 6 month old model.
Traditional research can not keep up.