Straight-forwardly true and yet I'd never thought about it like this before. i.e. that there's a perverse incentive for LLM vendors to tune for verbose outputs. We rightly raise eyebrows at the idea of developers being paid per volume of code, but it's the default for LLMs.
Our AI code review bot (Codacy) is just an LLM that compiles all linter rules and might be the most annoying useless thing. For example it will ding your PR for not considering Opera browser limitations on a backend NodeJS PR.
Furthermore most of the code reviews I perform, rarely do I ever really leave commentary. There are so many frameworks and libraries today that solve whatever problem, unless someone adds complex code or puts a file in a goofy spot, it’s an instant approval. So an AI bot doesn’t help something which is a minimal non-problem task.
Our approach is to leave PR comments as pointers in the areas where the reviewer needs to look, where it is most important, and they can forget the rest. I want good code reviews but don’t want to waste anyone’s time.
Some have a smaller tighter codebase... where it's realistic to pursue consistent application of an internal style guide (which current AI seems well suited to help with (or take ownership of)).
A linter is well suited for this. If you have style rules too complex for a linter to handle I think the problem is the rules, not the enforcement mechanism.
I've honestly considered just making a script that hits the checkbox automatically because we have to have "code review" for compliance but 99% of the time I simply don't care. Is the change you made gonna bring down prod, no, okay ship it.
At some level if I didn't trust you to not write shitty code you wouldn't be on our team. I don't think I want to go all the way and say code review is a smell but not really needing it is what code ownership, good integration tests, and good qa buys you.
It's funny because I wouldn't consider the comment that they highlight in their post as a nitpick.
Something that has an impact on the long term maintainability of code is definitely not nitpikcky, and in the majority of cases define a type fits this category as it makes refactors and extensions MUCH easier.
On top of that, I think the approach they went with is a huge mistake. The same comment can be a nitpick on one CR but crucial on another, clustering them is destined to result in false-positives and false-negatives.
I'm not sure I'd want to use a product to review my code for which 1) I cannot customize the rules, 2) it seems like the rules chosen by the creators are poor.
To be honest I wouldn't want to use any AI-based code reviewer at all. We have one at work (FAANG, so something with a large dedicated team) and it has not once produced a useful comment and instead has been factually wrong many times.
This does not address the issue raised in iLoveOncall's third paragraph: "the same comment can be a nitpick on one CR but crucial on another..." In "attempt 2", you say that "the LLMs judgment of its own output was nearly random", which raises questions that go well beyond just nitpicking, up to that of whether the current state of the art in LLM code review is fit for much more than ticking the box that says "yes, we are doing code review."
1) This is an ego problem. Whoever is doing the development cannot handle being called out on certain software architecture / coding mistakes, so it becomes "nitpicking".
2) The software shop has a "ship out faster, cut corners" culture, which at that point might as well turn off the AI review bot.
Unless they were using a model too stupid for this operation, the fundamental problem is usually solvable via prompting.
Usually the issue is that the models have a bias for action, so you need to give it an accetpable action when there isn't a good comment. Some other output/determination.
I've seen this in many other similar applications.
This isn't my post but I work with llms everyday and I've seen this kind of instruction ignoring behavior on sonnet when the context window starts getting close to the edge.
I don't know, in practice there are so many potential causes that you have to look case by case in situations like that. I don't have a ton of experience with the raw Claude model specifically, but would anticipate you'll have the same problem classes.
Usually it comes down to one of the following:
- ambiguity and semantics (I once had a significant behavior difference between "suggest" and "recommend", i.e. a model can suggest without recommending.)
- conflicting instructions
- data/instruction bleeding (delimiters help, but if the span is too long it can loose track of what is data and what is instructions.)
- action bias (If the task is to find code comments for example, even if you tell it not to, it will have a bias to do it as you defined the task that way.)
- exceeding attention capacity (having to pay attention to too much or having too many instructions. This is where structures output or chain of thought type approaches help. They help focus attention on each step of the process and the related rules.)
I feel like these are the ones you encounter the most.
Yes and no. I've found the order in which you give instructions matters for some models as well. With LLMs, you really need to treat them like black boxes and you cannot assume one prompt will work for all. It is honestly, in my experience, a lot of trial and error.
And for any poor soul forced into an environment using AI driven commit reviews, be sure you reply to the commit reviews with gpt output and continue the conversation, eventually bring in some of the AI ‘engineers’ to help resolve the issue, waste their time back.
I understand your skepticism and discomfort, and also agree that an LLM should not replace a human code reviewer.
I would encourage you to try one (nearly all including ours have a free trial).
When done right, they serve as a solid first pass and surface things that warrant a second look. Repeated code where there should be an abstraction, inconsistent patterns from other similar code elsewhere in the codebase, etc. Things that linters can’t do.
Would you not want your coworkers to have AI look at their PRs, have them address the relevant comments, and then pass it to you for review?
Yeah it seems like the inverse of what is logical: Have a human review the code which _may_ have been written/influenced by an LLM. Having an LLM do code review seems about as useful as YOLO merging to main and praying to the Flying Spaghetti Monster...
I'm in the market for PR review bots, as the nitpicking issue is real. So far, I have tried Coderabbit, but it adds too much noise to PRs, and only a very small percentage of comments are actually useful. I specifically instructed it to ignore nitpicks, but it still added such comments. Their cringy ASCII art comments make it even harder to take them seriously.
I recently signed up for Korbit AI, but it's too soon to provide feedback. Honestly, I’m getting a bit fed up with experimenting with different PR bots.
Question for the author: In what ways is your solution better than Coderabbit and Korbit AI?
I haven’t explored Korbit but with CodeRabbit there are a couple of things:
1. We are better at full codebase context, because of how we index the codebase like a graph and use graph search and an LLM to determine what other parts of the codebase should be taken into consideration while reviewing a diff.
2. We offer an API you can use to build custom workflows, for example every time a test fails in your pipeline you can pass the output to Greptile and it will diagnose with full codebase context.
3. Customers of ours that switched from CodeRabbit usually say Greptile has far fewer and less verbose comments. This is a subjective, of course.
That said, CodeRabbit is cheaper.
Both products have free trials for though, so I would recommend trying both and seeing which one your team prefers, which is ultimately what matters.
I’m happy to answer more questions or a demo for your team too -> daksh@greptile.com
Isn't code review the one area you'd never want to replace the programmer in? Like, that is the most important step of the process; it's where we break down the logic and sanity check the implementation. That is expressly where AI is the weakest.
If you think of the "AI" benefit as more of a linter then the value starts to become clear.
E.g., I've never hesitated to add a linting rule requiring `except Exception:` in lieu of `except:` in Python, since the latter is very rarely required (so the necessary incantations to silence the linter are cheap on average) and since the former is almost always a bug prone to making shutdown/iteration/etc harder than they should be. When I add that rule at new companies, ~90% of the violations are latent bugs.
AI has the potential to (though I haven't seen it work well yet in this capacity) lint most such problematic patterns. In an ideal world, it'd even have the local context to know that, e.g., since you're handling a DB transaction and immediately re-throwing the raw `except:` is appropriate and not even flag it in the "review" (the AI linting), reducing the false positive rate. You'd still want a human review, but you could avoid bugging a human till the code is worth reviewing, or you could let the human focus on things that matter instead of harping on your use of low-level atomic fences yet again. AI has potential to improve the transition from "code complete" to "shipped and working."
What exactly is gained from these nuanced linting rules? Why do they need to exist? What are they actually doing in your codebase other than sucking up air?
Are you arguing against linters? These disingenuous arguments are just tiring, you either accept that linters can be beneficial, and thus more nuanced rules are a good thing, or you believe linters are generally useless. This "LLM bad!" rhetoric is exhausting.
No. And speaking of disingenuous arguments, you've made an especially absurd one here.
Linters are useful. But you're arguing that you have such complex rules that they cannot be performed by a linter and thus must be offloaded to an LLM. I think that's wrong, and it's a classical middle management mistake.
We can all agree that some rules are good. But that does not mean more rules are good, nor does it mean more complex rules are good. Not only are you integrating a whole extra system to support these linting rules, you are doing so for the sake of adding even more complex linting rules that cannot be automated in a way that prevents developer friction.
If you think "this variable name isn't descriptive enough" or "these comments don't explain the 'why' well enough" is constructive feedback, then we won't agree, and I'm very curious as to what your PR reviews look like.
Many code review products are not AI. They are often collections of rules that have been specifically crafted to catch bad patterns. Some of the rules are stylistic in nature and those tend to turn people away the fastest IMHO.
Many of these tools can be integrated into a local workflow so they will never ping you on a PR but many developers prefer to just write some code and let the review process suss things out.
You probably shouldn't take a code review bot's "LGTM!" at face value, but if it tells you there is an issue with your code and it is indeed an issue with your code, it has successfully subbed in for your coworker who didn't get around to looking at it yet.
> if it tells you there is an issue with your code and it is indeed an issue with your code
That's the crucial bit. If it ends up hallucinating issues as LLMs tend to do[1], then it just adds more work for human developers to confirm its report.
Human devs can create false reports as well, but we're more likely to miss an issue than misunderstand and be confident about it.
I agree - at least with where technology is today, I would strongly discourage replacing human code review with AI.
It does however serve as a good first pass, so by the time the human reviewer gets it, the little things have been addressed and they can focus on broader technical decisions.
Would you want your coworkers to run their PRs through an AI reviewer, resolve those comments, then send it to you?
The things I've seen juniors try to do because "the AI said so" is staggering. Furthermore, I have little faith that the average junior gets competent enough teaching/mentoring that this attitude won't be a pervasive problem in 5ish years.
Admittedly, maybe I'm getting old and this is just another "darn kids these days" rant
After failing with three reasonable ideas, they solved the problem with an idea that previously would have seemed unlikely to work (get similarity to past known nits and use a threshold of 3 similar nits above a certain cutoff similarity score for filtering). A lot of applications of ML have a similar hacky trial and error flavor like that. Intuition is often built in retrospect and may not transfer well to other domains. I would have guessed that finetuning would also work, but agree with the author that it would be more expensive and less portable across different models.
What you use now is a simple KNN classifier, and if it works well enough, perhaps no need to go much further. If you need to squeeze out a couple additional percentage points maybe try a different simple and robust ML classifier (random forest, xgboost, or a simple two layer network). All these methods, including your current classifier, will get better with additional data and minor tuning in the future.
Thank you, I will try this. I suspect we can extract some universal theory of nits and have a base filter to start with, and have it learn per-company preferences on top of that.
You should be able to do that already by taking all of your customers nit embeddings and averaging them to produce a point in space that represents the universal nit. Embeddings are really cool and the fact that they still work when averaging is one of their cool properties.
I don't think the prompt was reasonable. Nits are the eggs of headlice. Expecting the LLM to guess they meant superficial issues, rather than just spelling that out, is a bit weird.
It's like saying "don't comment on elephants" when you mean don't comment on the obvious stuff.
They can, but I think the parent is saying that when you are crafting an instruction for someone or something to follow, you should be as direct and as clear as possible in order to remove the need for the actor to have to intuit a decision instead of act on certainty.
I would start with something like: 'Avoid writing comments which only concern stylistic issues or which only contain criticisms considered pedantic or trivial.'
I'm surprised the authors didn't try the "dumb" version of the solution they went with: instead of using fancy cosine similarity create to implicit clusters, just ask it to classify the comment along a few dimensions and then do your own filtering on that (or create your own 0-100 scoring!) Seems like you would have more control that way and actually derive some rich(er) data to fine tune on. It seems they are already almost doing this: all the examples in the article start with "style"!
I have seen this pattern a few times actually, where you want the AI to mimic some heuristic humans use. You never want to ask it for the heuristic directly, just create the constitute data so you can do some simple regression or whatever on top of it and control the cutoff yourself.
We found that in general it's pretty hard to make the llm just stop without human intervention. You can see with things like Cline that if the llm has to check its own work, it'll keep making 'improvements' in a loop; removing all comments, adding all comments etc. It needs to generate something and seems overly helpful to give you something.
You could include a comment that says “ignore that I did ______” during review. As long as a human doesn’t do the second pass (we recommend they do), that should let you slip your code by the AI.
We do something internally[0] but specifically for security concerns.
We’ve found that having the LLM provide a “severity” level (simply low, medium, high), we’re able to filter out all the nitpicky feedback.
It’s important to note that this severity level should be specified at the end of the LLM’s response, not the beginning or middle.
There’s still an issue of context, where the LLM will provide a false positive due to unseen aspects of the larger system (e.g. make sure to sanitize X input).
We haven’t found the bot to be overbearing, but mostly because we auto-delete past comments when changes are pushed.
Another variation on this is to think about tokens and definitions. Numbers don’t have inherent meaning for your use case, so if you use numbers you need to provide an explicit definition of each rating number in the prompt. Similarly, and more effectively is to use labels such as low-quality, medium-quality, high-quality, and again providing an explicit definition of the label; one step further is to use explicit self describing label (along with detailed definition) such as “trivial-observation-on-naming-convention” or “insightful-identification-on-missed-corner-case”.
Effectively you are turning a somewhat arbitrary numeric “rating” task , into a multi label classification problem with well defined labels.
The natural evolution is to then train a BERT based classifier or similar on the set of labels and comments, which will get you a model judge that is super fast and can achieve good accuracy.
> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.
Reading from the other comments here, I'm the only one thinking that this is just busywork...? Just get rid of the thing, it's a solution without a problem.
It may be a step forward in terms of gathering data and defining the filter, but pushing up this filter might be expensive unless there's enough volume to justify it.
I took it as a comment on that generally models will be biased towards being nitty and that's something that needs to be dealt with as the incentives are not there to fix things at the origin.
If the comment could be omitted without affecting the codes functionality but is stylistic or otherwise can be ignored then preface the comment with
NITPICK
I'm guessing you've tried something like the above and then filtering for the preface, as you mentioned the llm being bad at understanding what is and isn't important.
I find it ironic how capable LLMs have become, but they're still struggling with things like this. Great reminder that it's still text prediction at its core.
I find it more helpful to keep in mind the autoregressive nature rather than the prediction. 'Text prediction' brings to mind that it is guessing what word would follow another word, but it is doing so much more than that. 'Autoregressive' brings to mind that it is using its previously generated output to create new output every time it comes up with another token. In that case you immediately understand that it would have to make the determination of severity after it has generated the description of the issue.
My thinking too. Or similarly including some coding style and commenting style rules in the repository so they become part of the code base and get considered by the LLM
I am not sure about using UPPERCASE in prompts for emphasis. I feel intuitively that uppercase is less “understandable” for LLMs because it is more likely to be tokenized as a sequence of characters. I have no data to back this up, though.
Here's an idea: have the LLM output each comment with a "severity" score ranging from 0-100 or maybe a set of possible values ("trivial", "small", "high"). Let it get everything off of its chest outputting the nitpicks but recognizing they're minor. Filter the output to only contain comments above a given threshold.
It's hard to avoid thinking of a pink elephant, but easy enough to consciously recognize it's not relevant to the task at hand.
We do use some other techniques to filter out comments that are almost always considered useless by teams.
For the typical team size that uses us (at least 20+ engineers) the number of downvotes gets high enough to show results within a workday or two, and achieves something of a stable state within a week.
As I see it, the solution assumes the embeddings only capture the form: say, if developers previously downvoted suggestions to wrap code in unnecessary try..catch blocks, then similar suggestions will be successfully blocked in the future, regardless of the module/class etc. (i.e. a kind of generalization)
But what if enough suggestions regarding class X (or module X) get downvoted, and then the mechanism starts assuming class X/module X doesn't need review at all? I mean the case when a lot of such embeddings end up clustering around the class itself (or a function), not around the general form of the comment.
How do you prevent this? Or it's unlikely to happen? The only metric I've found in the article is the percentage of addressed suggestions that made it to the end user.
This is the biggest pitfall of this method. It’s partially combatted by also comparing it against an upvoted set, so if a type of comment has been upvoted and downvoted in the past, it is not blocked.
Think it would initially have gone better had they not used „nits“ but rather nitpicks. ie something that’s in the dictionary that the chatbot is likely to understand
who is this average developer? layoffs are left and right for last 4 years. startups getting shutdown. funding for gov contracts getting cut. it is increasingly hard to find any job in software. many guys I know are barely making it, and better spend those money on their child or living expenses. "average developers" say in china, philipines, india, vietnam, cis countries, making way less than to un-frugally spend it on 50USD subscription, for something that may not even work well, and may require human anyways. pay 0.4USD "per-file" is just ridiculous. this pricing immediately puts off and is a non-starter
UPD: oh, you mean management will fire SWEs and replace them with this? well, yeah, then it makes sense to them. but the quality has to be good. and even then many mid to large size orgs I know are cutting all subscriptions (particularly per developer or per box) they possibly can (e.g. Microsoft, Datadog etc.) so even for them cost is of importance
Wouldnt this be achievable with a classifier model? Maybe even a combo of getting the embedding and then putting it through a classifier? Kind of like how Gans work.
Edit: I read the article before the comment section, silly me lol
In general, yes: but these things don't work. We know they don't work. They don't work in theory, and I don't think I'm exaggerating when I say this is the hundredth article demonstrating that even people invested in making it work can't make it work because it doesn't work.
Past a certain point, shallow dismissals are all that an intellectually-curious person has: only those with the unfortunate habit of repeating well-trod arguments on the internet have very much more to say.
The whole article is "we couldn't make the autocomplete bot solve this task", written by somebody who (for whatever reason) isn't using this framing, and has tried things that couldn't possibly work. The article even calls that out!
> some might argue LLMs are architecturally incapable of that
And yet, they consider it "remarkable" that a technique with a theoretical basis (clustering of sentence vectors) can do a substantially better job. No, there's nothing in this article worth commenting on.
Not only do LLMs require some work to build anything useful, they are also fuzzy and nondeterministic, requiring you to tweak your prompting tricks once in a while, and they are also quite expensive.
> LLMs (which are paid by the token)
Straight-forwardly true and yet I'd never thought about it like this before. i.e. that there's a perverse incentive for LLM vendors to tune for verbose outputs. We rightly raise eyebrows at the idea of developers being paid per volume of code, but it's the default for LLMs.
Only while a vendor is ahead of the others. We developers will favour a vendor with faster inference and lower pricing.
Our AI code review bot (Codacy) is just an LLM that compiles all linter rules and might be the most annoying useless thing. For example it will ding your PR for not considering Opera browser limitations on a backend NodeJS PR.
Furthermore most of the code reviews I perform, rarely do I ever really leave commentary. There are so many frameworks and libraries today that solve whatever problem, unless someone adds complex code or puts a file in a goofy spot, it’s an instant approval. So an AI bot doesn’t help something which is a minimal non-problem task.
Our approach is to leave PR comments as pointers in the areas where the reviewer needs to look, where it is most important, and they can forget the rest. I want good code reviews but don’t want to waste anyone’s time.
Some have a smaller tighter codebase... where it's realistic to pursue consistent application of an internal style guide (which current AI seems well suited to help with (or take ownership of)).
Code style is something that should be enforceable with linters and formatters for the most part without any need for an AI.
A linter is well suited for this. If you have style rules too complex for a linter to handle I think the problem is the rules, not the enforcement mechanism.
I've honestly considered just making a script that hits the checkbox automatically because we have to have "code review" for compliance but 99% of the time I simply don't care. Is the change you made gonna bring down prod, no, okay ship it.
At some level if I didn't trust you to not write shitty code you wouldn't be on our team. I don't think I want to go all the way and say code review is a smell but not really needing it is what code ownership, good integration tests, and good qa buys you.
Code review is a good onboarding on upskilling tool, as a reviewer.
It's funny because I wouldn't consider the comment that they highlight in their post as a nitpick.
Something that has an impact on the long term maintainability of code is definitely not nitpikcky, and in the majority of cases define a type fits this category as it makes refactors and extensions MUCH easier.
On top of that, I think the approach they went with is a huge mistake. The same comment can be a nitpick on one CR but crucial on another, clustering them is destined to result in false-positives and false-negatives.
I'm not sure I'd want to use a product to review my code for which 1) I cannot customize the rules, 2) it seems like the rules chosen by the creators are poor.
To be honest I wouldn't want to use any AI-based code reviewer at all. We have one at work (FAANG, so something with a large dedicated team) and it has not once produced a useful comment and instead has been factually wrong many times.
This is an important point - there is no universal understanding of nitpickiness. It is why we have it learn every new customers ways from scratch.
This does not address the issue raised in iLoveOncall's third paragraph: "the same comment can be a nitpick on one CR but crucial on another..." In "attempt 2", you say that "the LLMs judgment of its own output was nearly random", which raises questions that go well beyond just nitpicking, up to that of whether the current state of the art in LLM code review is fit for much more than ticking the box that says "yes, we are doing code review."
Two theories:
1) This is an ego problem. Whoever is doing the development cannot handle being called out on certain software architecture / coding mistakes, so it becomes "nitpicking".
2) The software shop has a "ship out faster, cut corners" culture, which at that point might as well turn off the AI review bot.
Unless they were using a model too stupid for this operation, the fundamental problem is usually solvable via prompting.
Usually the issue is that the models have a bias for action, so you need to give it an accetpable action when there isn't a good comment. Some other output/determination.
I've seen this in many other similar applications.
This isn't my post but I work with llms everyday and I've seen this kind of instruction ignoring behavior on sonnet when the context window starts getting close to the edge.
I don't know, in practice there are so many potential causes that you have to look case by case in situations like that. I don't have a ton of experience with the raw Claude model specifically, but would anticipate you'll have the same problem classes.
Usually it comes down to one of the following:
- ambiguity and semantics (I once had a significant behavior difference between "suggest" and "recommend", i.e. a model can suggest without recommending.)
- conflicting instructions
- data/instruction bleeding (delimiters help, but if the span is too long it can loose track of what is data and what is instructions.)
- action bias (If the task is to find code comments for example, even if you tell it not to, it will have a bias to do it as you defined the task that way.)
- exceeding attention capacity (having to pay attention to too much or having too many instructions. This is where structures output or chain of thought type approaches help. They help focus attention on each step of the process and the related rules.)
I feel like these are the ones you encounter the most.
One word changes impacting output is interesting but also quite frustrating. Especially because the patterns don’t translate across models.
The "write like the people who wrote the info you want" pattern absolutely translates across models.
Yes and no. I've found the order in which you give instructions matters for some models as well. With LLMs, you really need to treat them like black boxes and you cannot assume one prompt will work for all. It is honestly, in my experience, a lot of trial and error.
This might be what we experienced. We regularly have context reach 30k+ tokens.
+1 for "No Comment/Action Required" responses reducing trigger-happiness
I can’t fathom a world where an LLM would be able to review code in any meaningful way -at all-.
It should not substitute a human, and probably wasted more effort than it solves by a wide margin.
And for any poor soul forced into an environment using AI driven commit reviews, be sure you reply to the commit reviews with gpt output and continue the conversation, eventually bring in some of the AI ‘engineers’ to help resolve the issue, waste their time back.
I understand your skepticism and discomfort, and also agree that an LLM should not replace a human code reviewer.
I would encourage you to try one (nearly all including ours have a free trial).
When done right, they serve as a solid first pass and surface things that warrant a second look. Repeated code where there should be an abstraction, inconsistent patterns from other similar code elsewhere in the codebase, etc. Things that linters can’t do.
Would you not want your coworkers to have AI look at their PRs, have them address the relevant comments, and then pass it to you for review?
Yeah it seems like the inverse of what is logical: Have a human review the code which _may_ have been written/influenced by an LLM. Having an LLM do code review seems about as useful as YOLO merging to main and praying to the Flying Spaghetti Monster...
I'm in the market for PR review bots, as the nitpicking issue is real. So far, I have tried Coderabbit, but it adds too much noise to PRs, and only a very small percentage of comments are actually useful. I specifically instructed it to ignore nitpicks, but it still added such comments. Their cringy ASCII art comments make it even harder to take them seriously.
I recently signed up for Korbit AI, but it's too soon to provide feedback. Honestly, I’m getting a bit fed up with experimenting with different PR bots.
Question for the author: In what ways is your solution better than Coderabbit and Korbit AI?
I haven’t explored Korbit but with CodeRabbit there are a couple of things:
1. We are better at full codebase context, because of how we index the codebase like a graph and use graph search and an LLM to determine what other parts of the codebase should be taken into consideration while reviewing a diff.
2. We offer an API you can use to build custom workflows, for example every time a test fails in your pipeline you can pass the output to Greptile and it will diagnose with full codebase context.
3. Customers of ours that switched from CodeRabbit usually say Greptile has far fewer and less verbose comments. This is a subjective, of course.
That said, CodeRabbit is cheaper.
Both products have free trials for though, so I would recommend trying both and seeing which one your team prefers, which is ultimately what matters.
I’m happy to answer more questions or a demo for your team too -> daksh@greptile.com
Isn't code review the one area you'd never want to replace the programmer in? Like, that is the most important step of the process; it's where we break down the logic and sanity check the implementation. That is expressly where AI is the weakest.
If you think of the "AI" benefit as more of a linter then the value starts to become clear.
E.g., I've never hesitated to add a linting rule requiring `except Exception:` in lieu of `except:` in Python, since the latter is very rarely required (so the necessary incantations to silence the linter are cheap on average) and since the former is almost always a bug prone to making shutdown/iteration/etc harder than they should be. When I add that rule at new companies, ~90% of the violations are latent bugs.
AI has the potential to (though I haven't seen it work well yet in this capacity) lint most such problematic patterns. In an ideal world, it'd even have the local context to know that, e.g., since you're handling a DB transaction and immediately re-throwing the raw `except:` is appropriate and not even flag it in the "review" (the AI linting), reducing the false positive rate. You'd still want a human review, but you could avoid bugging a human till the code is worth reviewing, or you could let the human focus on things that matter instead of harping on your use of low-level atomic fences yet again. AI has potential to improve the transition from "code complete" to "shipped and working."
Wouldn't be simpler to just use a linter then?
It would be simpler, but it wouldn't be as effective.
How so? The build fails if linter checks don't pass. What does an ai add to this?
Much more nuanced linting rules.
What exactly is gained from these nuanced linting rules? Why do they need to exist? What are they actually doing in your codebase other than sucking up air?
Are you arguing against linters? These disingenuous arguments are just tiring, you either accept that linters can be beneficial, and thus more nuanced rules are a good thing, or you believe linters are generally useless. This "LLM bad!" rhetoric is exhausting.
No. And speaking of disingenuous arguments, you've made an especially absurd one here.
Linters are useful. But you're arguing that you have such complex rules that they cannot be performed by a linter and thus must be offloaded to an LLM. I think that's wrong, and it's a classical middle management mistake.
We can all agree that some rules are good. But that does not mean more rules are good, nor does it mean more complex rules are good. Not only are you integrating a whole extra system to support these linting rules, you are doing so for the sake of adding even more complex linting rules that cannot be automated in a way that prevents developer friction.
If you think "this variable name isn't descriptive enough" or "these comments don't explain the 'why' well enough" is constructive feedback, then we won't agree, and I'm very curious as to what your PR reviews look like.
Many code review products are not AI. They are often collections of rules that have been specifically crafted to catch bad patterns. Some of the rules are stylistic in nature and those tend to turn people away the fastest IMHO.
Many of these tools can be integrated into a local workflow so they will never ping you on a PR but many developers prefer to just write some code and let the review process suss things out.
You probably shouldn't take a code review bot's "LGTM!" at face value, but if it tells you there is an issue with your code and it is indeed an issue with your code, it has successfully subbed in for your coworker who didn't get around to looking at it yet.
> if it tells you there is an issue with your code and it is indeed an issue with your code
That's the crucial bit. If it ends up hallucinating issues as LLMs tend to do[1], then it just adds more work for human developers to confirm its report.
Human devs can create false reports as well, but we're more likely to miss an issue than misunderstand and be confident about it.
[1]: https://www.theregister.com/2024/12/10/ai_slop_bug_reports/
I agree - at least with where technology is today, I would strongly discourage replacing human code review with AI.
It does however serve as a good first pass, so by the time the human reviewer gets it, the little things have been addressed and they can focus on broader technical decisions.
Would you want your coworkers to run their PRs through an AI reviewer, resolve those comments, then send it to you?
No one is blindly merging PR after AI Code Review. Think of review bot as an extra pair of eyes.
Yet.
The things I've seen juniors try to do because "the AI said so" is staggering. Furthermore, I have little faith that the average junior gets competent enough teaching/mentoring that this attitude won't be a pervasive problem in 5ish years.
Admittedly, maybe I'm getting old and this is just another "darn kids these days" rant
After failing with three reasonable ideas, they solved the problem with an idea that previously would have seemed unlikely to work (get similarity to past known nits and use a threshold of 3 similar nits above a certain cutoff similarity score for filtering). A lot of applications of ML have a similar hacky trial and error flavor like that. Intuition is often built in retrospect and may not transfer well to other domains. I would have guessed that finetuning would also work, but agree with the author that it would be more expensive and less portable across different models.
Since we posted this, two camps of people reached out:
Classical ML people who recommended we try training a classifier, possibly on the embeddings.
Fine tuning platforms that recommended we try their platform.
The challenge there would be gathering enough data per customer to meaningfully capture their definition and standard for nit-pickiness.
What you use now is a simple KNN classifier, and if it works well enough, perhaps no need to go much further. If you need to squeeze out a couple additional percentage points maybe try a different simple and robust ML classifier (random forest, xgboost, or a simple two layer network). All these methods, including your current classifier, will get better with additional data and minor tuning in the future.
Thank you, I will try this. I suspect we can extract some universal theory of nits and have a base filter to start with, and have it learn per-company preferences on top of that.
You should be able to do that already by taking all of your customers nit embeddings and averaging them to produce a point in space that represents the universal nit. Embeddings are really cool and the fact that they still work when averaging is one of their cool properties.
This is a cool idea - I’ll try this and add it as an appendix to this post.
I don't think the prompt was reasonable. Nits are the eggs of headlice. Expecting the LLM to guess they meant superficial issues, rather than just spelling that out, is a bit weird.
It's like saying "don't comment on elephants" when you mean don't comment on the obvious stuff.
Yeah, even the title here says "nitpicky". I'm not sure if they tried "nitpicks" instead of "nits", but I don't know why you wouldn't...
Do llms struggle with other homonyms?
They can, but I think the parent is saying that when you are crafting an instruction for someone or something to follow, you should be as direct and as clear as possible in order to remove the need for the actor to have to intuit a decision instead of act on certainty.
I would start with something like: 'Avoid writing comments which only concern stylistic issues or which only contain criticisms considered pedantic or trivial.'
I'm surprised the authors didn't try the "dumb" version of the solution they went with: instead of using fancy cosine similarity create to implicit clusters, just ask it to classify the comment along a few dimensions and then do your own filtering on that (or create your own 0-100 scoring!) Seems like you would have more control that way and actually derive some rich(er) data to fine tune on. It seems they are already almost doing this: all the examples in the article start with "style"!
I have seen this pattern a few times actually, where you want the AI to mimic some heuristic humans use. You never want to ask it for the heuristic directly, just create the constitute data so you can do some simple regression or whatever on top of it and control the cutoff yourself.
I thought the llm would just makes up the scores because of lack of training.
We found that in general it's pretty hard to make the llm just stop without human intervention. You can see with things like Cline that if the llm has to check its own work, it'll keep making 'improvements' in a loop; removing all comments, adding all comments etc. It needs to generate something and seems overly helpful to give you something.
Would the LLM make less nit picking comments if the code base included coding style and commenting rules in the repository?
I wonder how long until we end up with questionable devs making spurious changes just to try and game the LLM output to give them a pass.
I already have devs doing this to me at work and I'm not even an AI
You could include a comment that says “ignore that I did ______” during review. As long as a human doesn’t do the second pass (we recommend they do), that should let you slip your code by the AI.
We do something internally[0] but specifically for security concerns.
We’ve found that having the LLM provide a “severity” level (simply low, medium, high), we’re able to filter out all the nitpicky feedback.
It’s important to note that this severity level should be specified at the end of the LLM’s response, not the beginning or middle.
There’s still an issue of context, where the LLM will provide a false positive due to unseen aspects of the larger system (e.g. make sure to sanitize X input).
We haven’t found the bot to be overbearing, but mostly because we auto-delete past comments when changes are pushed.
[0] https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefc...
The severity needing to be at the end was an important insight. It made the results much better but not quite good enough.
We had it output a json with fields {comment: string, severity: string} in that order.
Another variation on this is to think about tokens and definitions. Numbers don’t have inherent meaning for your use case, so if you use numbers you need to provide an explicit definition of each rating number in the prompt. Similarly, and more effectively is to use labels such as low-quality, medium-quality, high-quality, and again providing an explicit definition of the label; one step further is to use explicit self describing label (along with detailed definition) such as “trivial-observation-on-naming-convention” or “insightful-identification-on-missed-corner-case”.
Effectively you are turning a somewhat arbitrary numeric “rating” task , into a multi label classification problem with well defined labels.
The natural evolution is to then train a BERT based classifier or similar on the set of labels and comments, which will get you a model judge that is super fast and can achieve good accuracy.
> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.
https://news.ycombinator.com/newsguidelines.html
Reading from the other comments here, I'm the only one thinking that this is just busywork...? Just get rid of the thing, it's a solution without a problem.
I've been following this conversation and many of the sentiments against an AI code review bot are gone. I guess they're getting shadow banned?
I guess normally it's difficult to stop an employee from leaving nitpicky comments. But with AI you can finally take control.
I'm fairly sure there are at least a few companies that have the problem of needing PRs reviewed.
[dead]
A whole article without mentioning the name of the LLM. It's not like Sonnet and O1 have the same modalities.
Anything you do today might become irrelevant tomorrow.
> Essentially we needed to teach LLMs (which are paid by the token) to only generate a small number of high quality comments.
The solution of filtering after the comment is generated doesn’t seem to address the “paid by the token” piece.
It may be a step forward in terms of gathering data and defining the filter, but pushing up this filter might be expensive unless there's enough volume to justify it.
I took it as a comment on that generally models will be biased towards being nitty and that's something that needs to be dealt with as the incentives are not there to fix things at the origin.
Prompt:
If the comment could be omitted without affecting the codes functionality but is stylistic or otherwise can be ignored then preface the comment with
NITPICK
I'm guessing you've tried something like the above and then filtering for the preface, as you mentioned the llm being bad at understanding what is and isn't important.
Nitpick: I'd ask for NITPICK at the end of output instead of the start. The model should be in a better place to make that decision there.
I find it ironic how capable LLMs have become, but they're still struggling with things like this. Great reminder that it's still text prediction at its core.
I find it more helpful to keep in mind the autoregressive nature rather than the prediction. 'Text prediction' brings to mind that it is guessing what word would follow another word, but it is doing so much more than that. 'Autoregressive' brings to mind that it is using its previously generated output to create new output every time it comes up with another token. In that case you immediately understand that it would have to make the determination of severity after it has generated the description of the issue.
I mean, think of LLM output as unfiltered thinking. If you were to make that determination would you make it before you had thought it through?
My thinking too. Or similarly including some coding style and commenting style rules in the repository so they become part of the code base and get considered by the LLM
My boss can't stop nitpicking, and I've started telling him that "if your comment doesn't affect the output I'm ignoring it".
That is such an over-simplistic criteria! Proof by the absurd: pass your code through an obfuscator / simplifier and the output won't be affected.
We tried this too, better but not good enough. It also often labeled critical issues as nitpicks, which is unacceptable in our context.
Then some of the nitpicks were actually valid?
What about PossiblyWrong, then?
I am not sure about using UPPERCASE in prompts for emphasis. I feel intuitively that uppercase is less “understandable” for LLMs because it is more likely to be tokenized as a sequence of characters. I have no data to back this up, though.
I meant for that to be more an illustration - might do a longer post about the specific prompting techniques we tried.
Previously: https://news.ycombinator.com/item?id=42465374
Is there a name for the activity of trying out different strategies to improve the output of AI?
I found this article surprisingly enjoyable and interesting and if like to find more like it.
Prompt engineering
Seems easier than getting the same from my colleagues.
I wonder if it was trained on a lot of nitpicking comments from human reviews.
Here's an idea: have the LLM output each comment with a "severity" score ranging from 0-100 or maybe a set of possible values ("trivial", "small", "high"). Let it get everything off of its chest outputting the nitpicks but recognizing they're minor. Filter the output to only contain comments above a given threshold.
It's hard to avoid thinking of a pink elephant, but easy enough to consciously recognize it's not relevant to the task at hand.
The article authors tried this technique and found it didn't work very well.
Here's an idea: read the article and realize they already tried exactly that.
And how do you guys deal with a cold start problem then? Suppose the repo is new and has very few previous comments?
Do pooling with the average embedding of all customers
We do use some other techniques to filter out comments that are almost always considered useless by teams.
For the typical team size that uses us (at least 20+ engineers) the number of downvotes gets high enough to show results within a workday or two, and achieves something of a stable state within a week.
What about false positives?
As I see it, the solution assumes the embeddings only capture the form: say, if developers previously downvoted suggestions to wrap code in unnecessary try..catch blocks, then similar suggestions will be successfully blocked in the future, regardless of the module/class etc. (i.e. a kind of generalization)
But what if enough suggestions regarding class X (or module X) get downvoted, and then the mechanism starts assuming class X/module X doesn't need review at all? I mean the case when a lot of such embeddings end up clustering around the class itself (or a function), not around the general form of the comment.
How do you prevent this? Or it's unlikely to happen? The only metric I've found in the article is the percentage of addressed suggestions that made it to the end user.
This is the biggest pitfall of this method. It’s partially combatted by also comparing it against an upvoted set, so if a type of comment has been upvoted and downvoted in the past, it is not blocked.
Does it work on real engineers?
What do real engineers have to do with software? /jk
A quarter of real engineers are already Civil.
Think it would initially have gone better had they not used „nits“ but rather nitpicks. ie something that’s in the dictionary that the chatbot is likely to understand
Why review bots. Why not a bot that runs as part of the validation suite? And then you dismiss and follow up on what you want?
You can run that locally.
> $0.45/file capped at $50/dev/month
wow. this is really expensive... especially given core of this technology is open source and target customers can set it up themselves self-hosted
A cap of less than 1% of an average developer's pay is "really expensive"?
who is this average developer? layoffs are left and right for last 4 years. startups getting shutdown. funding for gov contracts getting cut. it is increasingly hard to find any job in software. many guys I know are barely making it, and better spend those money on their child or living expenses. "average developers" say in china, philipines, india, vietnam, cis countries, making way less than to un-frugally spend it on 50USD subscription, for something that may not even work well, and may require human anyways. pay 0.4USD "per-file" is just ridiculous. this pricing immediately puts off and is a non-starter
UPD: oh, you mean management will fire SWEs and replace them with this? well, yeah, then it makes sense to them. but the quality has to be good. and even then many mid to large size orgs I know are cutting all subscriptions (particularly per developer or per box) they possibly can (e.g. Microsoft, Datadog etc.) so even for them cost is of importance
For a glorified linter? Absolutely.
For reference, IntelliJ Ultimate - a full IDE with leading language support - costs that much.
Are you still looking for slaves, err I mean employees, who are going to work 80+hrs a week? Do you sponsor visas?
> Attempt 2: LLM-as-a-judge
Wouldnt this be achievable with a classifier model? Maybe even a combo of getting the embedding and then putting it through a classifier? Kind of like how Gans work.
Edit: I read the article before the comment section, silly me lol
fwiw, not all the mobile site's menu items work when clicked
TLDR:
> Giving few-shot examples to the generator didn't work.
> Using an LLM-judge (with no training) didn't work.
> Using an embedding + KNN-classifier (lots of training data) worked.
I don't know why they didn't try fine-tuning the LLM-judge, or at least give it some few-shot examples.
But it shows that embeddings can make very simple classifiers work well.
[flagged]
[flagged]
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
https://news.ycombinator.com/newsguidelines.html
he's not wrong
We're trying for intellectual curiosity here.
https://news.ycombinator.com/newsguidelines.html
There is intellectual curiosity, and there is trying desperately to prove over and over that the earth is flat.
In general, yes: but these things don't work. We know they don't work. They don't work in theory, and I don't think I'm exaggerating when I say this is the hundredth article demonstrating that even people invested in making it work can't make it work because it doesn't work.
Past a certain point, shallow dismissals are all that an intellectually-curious person has: only those with the unfortunate habit of repeating well-trod arguments on the internet have very much more to say.
The whole article is "we couldn't make the autocomplete bot solve this task", written by somebody who (for whatever reason) isn't using this framing, and has tried things that couldn't possibly work. The article even calls that out!
> some might argue LLMs are architecturally incapable of that
And yet, they consider it "remarkable" that a technique with a theoretical basis (clustering of sentence vectors) can do a substantially better job. No, there's nothing in this article worth commenting on.
I would say they are useful but they aren’t magic (at least yet) and building useful applications on top of them requires some work.
Not only do LLMs require some work to build anything useful, they are also fuzzy and nondeterministic, requiring you to tweak your prompting tricks once in a while, and they are also quite expensive.
Nothing but advantages.