This is by far the most practical piece of writing I've seen on the subject of "agents" - it includes actionable definitions, then splits most of the value out into "workflows" and describes those in depth with example applications.
Thanks for all the write-ups on LLMs, you're on top of the news and it makes it way easier to follow what's happening and the existing implementations by following your blog instead.
Lots of lists of the myths of LLMs out there https://masterofcode.com/blog/llms-myths-vs-reality-what-you...
Every single post glosses over some aspect of these myths or posits they can be controlled or mitigated in some way, with no examples of anyone else finding applicability of the solutions to real world problems in a supportable and reliable way. When pushed, a myth in the neighborhood of those in the list above is pushed like the system will get better, or some classical computing mechanism will make up the difference, or that the problems aren't so bad, the solution is good enough in some ambiguous way, or that people or existing systems are just as bad when they are not.
It doesn't seem to me that you're familiar with my work - you seem to be mixing me in with the vast ocean of uncritical LLM boosting content that's out there.
I'm thinking of the system you built to watch videos and parse JSON and the claims of that having a general suitability, which is simply dishonest imo. You seem to be confusing me with someone that hasn't been asking you repeatedly to address these kinds of concerns and the above series are a kind of potemkin set of things that don't intersect with your other work.
It's like criticizing a "Hello World" program for not having proper error handling and security protocols. While those are important for production systems, they're not the point of a demonstration or learning example.
Your response seems to take these examples and hold them to the standard of mission-critical systems, which is a form of technical gatekeeping - raising the bar unnecessarily high for what counts as a "valid" technical demonstration.
Yes, they have actionable definitions, but they are defining something quite different than the normal definition of an "agent". An agent is a party who acts for another.
Often this comes from an employer-employee relationship.
This matters mostly when things go wrong. Who's responsible?
The airline whose AI agent gave out wrong info about airline policies found, in court, that their "intelligent agent" was considered an agent in legal terms. Which meant the airline was stuck paying for their mistake.
Anthropic's definition: Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks.
That's an autonomous system, not an agent. Autonomy is about how much something can do without outside help. Agency is about who's doing what for whom, and for whose benefit and with what authority. Those are independent concepts.
That definition feels like it's playing on the verb, the idea of having "agency" in the world, and not on the noun, of being an "agent" for another party. The former is a philosophical category, while the latter has legal meaning and implication, and it feels somewhat disingenuous to continue to mix them up in this way.
In what way is it 'disingenuous'? You think Norvig is trying to deceive us about something? I'm not saying you have to agree with or like this definition but even if you think it's straight up wrong, 'disingenuous' feels utterly out of nowhere.
It's disingenuous in that it takes a word with a common understanding ("agent") and then conveniently redefines or re-etomologizes the word in an uncommon way that leads people to implicitly believe something about the product that isn't true.
Another great example of this trick is "essential" oils. We all know what the word "essential" means, but the companies selling the stuff use the word in the most uncommon way, to indicate the "essence" of something is in the oil, and then let the human brain fill in the gap and thus believe something that isn't true. It's techinically legal, but we have to agree that's not moral or ethical, right?
Maybe I'm wildly off base here, I have admittedly been wrong about a lot in my life up to this point. I just think the backlash that crops up when people realize what's going on (for example, the airline realizing that their chat bot does not in fact operate under the same rules as a human "agent," and that it's still a technology product) should lead companies to change their messaging and marketing, and the fact that they're just doubling down on the same misleading messaging over and over makes the whole charade feel disingenuous to me.
with a common understanding ("agent") and then conveniently redefines or re-etomologizes the word in an uncommon way that leads people to implicitly believe something about the product that isn't true.
Oh, I have no issue with his textbook definition, I'm saying that it's now being used to sell products by people who know their normal consumer base isn't using the same definition and it conveniently misleads them into believing things about the product that aren't true.
Knowing that your target market (non-tech folks) isn't using the same language as you, but persisting with that language because it creates convenient sales opportunities due to the misunderstandings, feels disingenuous to me.
An "agent" in common terms is just someone acting on behalf of another, but that someone still has autonomy and moral responsibility for their actions. Like for example the airline customer service representative situation. AI agents, when we pull back the curtains, get down to brass tacks, whatever turn of phrase you want to use, are still ultimately deterministic models. They have a lot more parameters, and their determinism is offset by many factors of pseudo-randomness, but given sufficient information we could still predict every single output. That system cannot be an agent in the common sense of the word, because humans are still dictating all of the possible actions and outcomes, and the machine doesn't actually have the autonomy required.
If you fail to keep your tech product from going off-script, you're responsible, because the model itself isn't a non-deterministic causal actor. A human CSR on the other hand is considered by law to have the power and responsibility associated with being a causal actor in the world, and so when they make up wild stuff about the terms of the agreement, you don't have to honor it for the customer, because there's culpability.
I'm drifting into philosophy at this point, which never goes well on HN, but this is ultimately how our legal system determines responsibility for actions, and AI doesn't meet those qualifications. If we ever want it to be culpable for its own actions, we'll have to change the legal framework we all operate under.
Edit: Causal, not casual... Whoops.
Also, I think I'm confusing the situation a bit by mixing the legal distinctions between agency and autonomy with the common understanding of being an "agent" and the philosophical concept of agency and culpability and how that relates to the US legal foundations.
Interesting. The best agents don't have agency, or at least don't use it.
You can think of this in video game terms: Players have agency. NPCs are "agencs", but don't have agency. But they're still not just objects in the game - they can move themselves and react to their environment.
That's actually a great example of what I'm saying, because I don't think the NPCs are agents at all in the traditional sense of "One that acts or has the power or authority to act on behalf of another." Where would the NPC derive its power and authority from? There is a human somewhere in the chain giving it 100% of its parameters, and that human is ultimately 100% responsible for the configuration of the NPC, which is why we don't blame the NPC in the game for behaving in a buggy way, we blame the devs. To say the NPC has agency puts some level of metaphysical responsibility about decision making and culpability on the thing that it doesn't have.
An AI "agent" is the same way, it is not culpable for its actions, the humans who set it up are, but we're leading people to believe that if the AI goes off script then the AI is somehow responsible for its own actions, which is simply not true. These are not autonomous beings, they're technology products.
Where did you get the idea that your definition there is the "normal" definition of agent, especially in the context of AI?
I ask because you seem very confident in it - and my biggest frustration about the term "agent" is that so many people are confident that their personal definition is clearly the one everyone else should be using.
But I'm not sure if that's true. The court didn't define anything, in contrary they only said that (in simplified terms) the chatbot was part of the website and it's reasonable to expect the info on their website to be accurate.
The closest I could find to the chatbot being considered an agent in legal terms (an entity like an employee) is this:
> Air Canada argues it cannot be held liable for information provided by one of its agents, servants, or representatives – including a chatbot.
I searched for the definition of "agent" and none of the results map to the way AI folks are using the word. It's really that simple, because we're marketing this stuff to non-tech people who already use words to mean things.
If we're redefining common words to market this stuff to non-tech people, and then we're conveniently not telling them that we redefined words, and thus allowing them to believe implicit falsehoods about the product that have serious consequences, we're being disingenuous.
That logic doesn't work for me, because many words have multiple meanings. "Agency" can also be a noun that means an organization that you hire - like a design agency. Or it can mean the CIA.
I'm not saying it's not a valid definition of the term, I'm pushing back on the idea that it's THE single correct definition of the term.
Aloha! Indeed, the language is being cleaved by such oversights. You can be in charge of overlooking this issue, effective ahead of two weeks from now. We'll peruse your results and impassionately sanction anything you call out (at least when it's unravelable). This endeavor should prove invaluable. Aloha!
Anything involving real agents likely does get your local spymaster interested. I assume all good AI work attracts the three letter types to make sure that the researcher isn’t trying to make AI that can make bioweapons…
That's only one of many definitions for the word agent outside of the context of AI. Another is something produces effects on the world. Another is something that has agency.
Sort of interesting that we've coalesced on this term that has many definitions, sometimes conflicting, but where many of the definitions vaguely fit into what an "AI Agent" could be for a given person.
But in the context of AI, Agent as Anthropic defines it is an appropriate word because it is a thing that has agency.
It would only be circular if agency was only defined as “the property of being an agent”. That circle of reasoning isn’t being proposed as the formal definitions by anyone.
Perhaps you mean tautological. In which case, an agent having agency would be an informal tautology. A relationship so basic to the subject matter that it essentially must be true. Which would be the strongest possible type of argument.
>Anthropic's definition: Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks.
But that's not their definition, and they explicitly describe that definition as an 'autonomous system'. Their definition comes in the next paragraph:
"At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
* Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
* Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."
Hi David; I’ve seen txtai floating around, and just took a look. Would you say that it fits in a similar niche to something like llamaindex, but starting from a data/embeddings abstraction rather than a retrieval one (building on layers from there - like workflows, agents etc)?
I put the agents in quotes because anthropic actually talks more about what they call "workflows". And imo this is where the real value of LLMs currently lies, workflow automation.
They also say that using LangChain and other frameworks is mostly unnecessary and does more harm than good. They instead argue to use some simple patterns, directly on the API level. Not dis-similar to the old-school Gang of Four software engineering patterns.
Really like this post as a guidance for how to actually build useful tools with LLMs. Keep it simple, stupid.
Deploying in production, the current agentic systems do not really work well. Workflow automation does. The reason is very native to LLMs, but also incredibly basic. Every agentic system starts with planning and reasoning module, where an LLM evaluates the task given and plans about how to accomplish that task, before moving on to next steps.
When an agent is given a task, they inevitably come up with different plans on different tries due to inherent nature of LLMs. Most companies like this step to be predictable, and they end up removing it from the system and doing it manually. Thus turning it into a workflow automation vs an agentic system. I think this is what people actually mean when they want to deploy agents in production. LLMs are great at automation*, not great at problem solving. Examples I have seen - customer support (you want predictability), lead mining, marketing copy generation, code flows and architecture, product specs generation, etc.
The next leap for AI systems is going to be whether they can solve challenging problems at companies - being the experts vs the doing the task they are assigned. They should really be called agents, not the current ones.
I felt deeply vindicated by their assessment of these frameworks, in particular LangChain.
I've built and/or worked on a few different LLM-based workflows, and LangChain definitely makes things worse in my opinion.
What it boils down to is that we are still coming to understand the right patterns of development for how to develop agents and agentic workflows. LangChain made choices about how to abstract things that are not general or universal enough to be useful.
Yes, our previous lead dev built a lot of our infra using LangGraph. I've been slowly ripping it out since assuming ownership of this part of the codebase.
I've been replacing LangGraph with simple primitives, relying on native Python constructs, etc. For example, instead of building this verbose graph of computation with LangGraph, you can just...call functions in the order you want them. Or declare them async, add them to a list, then await the resolution of all of them.
For a time I was maintaining a spreadsheet of all the refactor PRs, and I had a cumulative reduction of over 1,000 lines of code from these changes. Eventually I stopped keeping track.
^ That's a 1K LOC reduction with no functionality changes. I feel pretty strongly that LangChain/LangGraph are a net negative for our use case.
If you do any software engineering all, you would know that a 1k LoC reduction to achieve the same functionality at the same/better performance is non-trivial.
In fact they are mentioning LangGraph (the agent framework from the LangChain company). Imo LangGraph is a much more thoughtful and better built piece of software than the LangChain framework.
As I said, they already mention LangGraph in the article, so the Anthropic's conclusions still hold (i.e. KISS).
But this thread is going in the wrong direction when talking about LangChain
I'm lumping them all in the same category tbh. They say to just use the model libraries directly or a thin abstraction layer (like litellm maybe?) if you want to keep flexibility to change models easily.
I guess a little. I really liked the read though, it put in words what I couldn't and I was curious if others felt the same.
However the post was posted here yesterday and didn't really have a lot of traction.
I thought this was partially because of the term agentic, which the community seems a bit fatigued by. So I put it in quotes to highlight that Anthropic themselves deems it a little vague and hopefully spark more interest.
I don't think it messes with their message too much?
Honestly it didn't matter anyways, without second chance pooling this post would have been lost again (so thanks Daniel!)
My personal view is that the roadmap to AGI requires an LLM acting as a prefrontal cortex: something designed to think about thinking.
It would decide what circumstances call for double-checking facts for accuracy, which would hopefully catch hallucinations. It would write its own acceptance criteria for its answers, etc.
It's not clear to me how to train each of the sub-models required, or how big (or small!) they need to be, or what architecture works best. But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
IMHO with a simple loop LLMs are already capable of some meta thinking, even without any internal new architectures. For me where it still fails is that LLMs cannot catch their own mistakes even some obvious ones. Like with GPT 3.5 I had a persistent problem with the following question: "Who is older, Annie Morton or Terry Richardson?". I was giving it Wikipedia and it was correctly finding out the birth dates of the most popular people with the names - but then instead of comparing ages it was comparing birth years. And once it did that it was impossible to it to spot the error.
Now with 4o-mini I have a similar even if not so obvious problem.
Just writing this down convinced me that there are some ideas to try here - taking a 'report' of the thought process out of context and judging it there, or changing the temperature or even maybe doing cross-checking with a different model?
The meta thinking of LLMs is fascinating to me. Here’s a snippet of a convo I had with Claude 3.5 where it struggles with the validity of its own metacognition:
> … true consciousness may require genuine choice or indeterminacy - that is, if an entity's responses are purely deterministic (like a lookup table or pure probability distribution), it might be merely executing a program rather than experiencing consciousness.
> However, even as I articulate this, I face a meta-uncertainty: I cannot know whether my discussion of uncertainty reflects:
- A genuine contemplation of these philosophical ideas
- A well-trained language model outputting plausible tokens about uncertainty
- Some hybrid or different process entirely
> This creates an interesting recursive loop - I'm uncertain about whether my uncertainty is "real" uncertainty or simulated uncertainty. And even this observation about recursive uncertainty could itself be a sophisticated output rather than genuine metacognition.
I actually felt bad for it (him?), and stopped the conversion before it recursed into “flaming pile of H-100s”
So if like me you have an interior dialogue, which is speaking and which is listening or is it the same one? I do not ascribe the speaker or listener to a lobe, but whatever the language and comprehension centre(s) is(are), it can do both at the same time.
Same half. My understanding is that in split brain patients, it looks like the one half has extremely limited ability to parse language and no ability to create it.
Ah yeah - actually I tested that taking out of context. This is the thing that surprised me - I thought it is about 'writing itself into a corner - but even in a completely different context the LLM is consistently doing an obvious mistake.
Here is the example: https://chatgpt.com/share/67667827-dd88-8008-952b-242a40c2ac...
Janet Waldo was playing Corliss Archer on radio - and the quote the LLM found in Wikipedia was confirming it. But the question was about film - and the LLM cannot spot the gap in its reasoning - even if I try to warn it by telling it the report came from a junior researcher.
> But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
I'm not sure about AGI, but for specialized jobs/tasks (ie having a marketing agent that's familiar with your products and knows how to copywrite for your products) will win over "just add more compute/data" mass-market LLMs. This article does encourage us to keep that architecture simple, which is refreshing to hear. Kind of the AI version of rule of least power.
Admittedly, I have a degree in Cognitive Science, which tended to focus on good 'ol fashioned AI, so I have my biases.
Interesting, because I almost think of it the opposite way. LLMs are like system 1 thinking, fast, intuitive, based on what you consider most probable based on what you know/have experienced/have been trained on. System 2 thinking is different, more careful, slower, logical, deductive, more like symbolic reasoning. And then some metasystem to tie these two together and make them work cohesively.
After I read attention is all you need, my first thought was: "Orchestration is all you need". When 4o came out I published this: https://b.h4x.zip/agi/
> Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.
The questions then become:
1. When can you (i.e. a person who wants to build systems with them) trust them to make decisions on their own?
2. What type of trusted environments are we talking about? (Sandboxing?)
So, that all requires more thought -- perhaps by some folks who hang out at this site. :)
I suspect that someone will come up with a "real-world" application at a non-tech-first enterprise company and let us know.
Just take any example and think how a human would break it down with decision trees.
You are building an AI system to respond to your email.
The first agent decides whether the new email should be responded to, yes or no.
If no, it can send it to another LLM call that decides to archive it or leave it in the inbox for the human.
If yes, it sends it to classifier that decides what type of response is required.
Maybe there are some emails like for your work that require something brief like “congrats!” to all those new feature launch emails you get internally.
Or others that are inbound sales emails that need to go out to another system that fetches product related knowledge to craft a response with the right context. Followed by a checker call that makes sure the response follows brand guidelines.
The point is all of these steps are completely hypothetical but you can imagine how loosely providing some set of instructions and function calls and procedural limits can easily classify things and minimize error rate.
You can do this for any workflow by creatively combining different function calls, recursion, procedural limits, etc. And if you build multiple different decision trees/workflows, you can A/B test those and use LLM-as-a-judge to score the performance. Especially if you’re working on a task with lots of example outputs.
As for trusted environments, assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good. I put mine in their own cloudflare workers where they can’t do any damage beyond giving an odd response to the user.
> The first agent decides whether the new email should be responded to, yes or no.
How would you trust that the agent is following the criteria, and how sure that the criteria is specific enough. Like someone you just meet told you they going to send you something via email, but then the agent misinterpret it due to missing context and decided to respond in a generic manner leading to misunderstanding.
> assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good.
Which is not new. But with formal languages, you have a more precise definition of what acceptable inputs are (the whole point of formalism is precise definitions). With LLM workflows, the whole environment should be assumed to be public information. And you should probably add a fine point that the output does not engage you in anything.
> How would you trust that the agent is following the criteria, and how sure that the criteria is specific enough?
How do you know if a spam filter heuristic works only when intended?
You test it. Hard. On the thousands of emails in your archive, on edge-cases you prepare manually, and on the incoming mails. If it doesn't work for some cases, write tests that test for this, adjust prompt and run the test suite.
It won't ever work in 100% of all cases, but neither do spam filters and we still use them.
This is the location of the arguments. When they work they're "magical" but when they don't work "well people or other things are just as bad" ... And this means that you just cannot reason with the mysticism people have surrounding these things because show stopper problems are minimized or it is implied that they can somehow be reduced. The whole rest of computing does not work this way. So, you can't get the magic of reliable systems from probabilistic outcomes, and people confusing these things are seriously holding back honest and real discussions about probabilistic systems. (Not to harp on you specifically here, the whole language of the field is seriously confusing these issues)
It's one thing if you're up against the staunch reliability of traditional algorithmic systems but if you're not then it's just silly to hang on it.
You are not getting 100% on email handling whatever method you wish to use. You compare LLMs or probabilistic systems to the best of your alternatives.
There are niches where the lack of complete reliability would be a deal breaker. This isn't one of them and it would be weird to act like it were. People aren't sweeping anything under the rug. It simply just isn't a showstopper.
I don't think people are aware of the line at all. You ask five different people which things are reliable in any of these systems and you'll get five different guesses. C'mon here.
That's not what i'm talking about though. The point is that for spam detection, LLMs are up against other probabilistic measures. No-one sane is detecting spam with if-then's. You simply do not have the luxury of rigid reliability.
Couldn’t agree more with this - too many people rush to build autonomous agents when their problem could easily be defined as a DAG workflow. Agents increase the degrees of freedom in your system exponentially making it so much more challenging to evaluate systematically.
Agents are still a misaligned concept in AI. While this article offers a lot in orchestration, memory (only mentioned once in the post) and governance are not really mentioned. The latter is important to increase reliability -- something Ilya Sutskever mentioned to be important as agents can be less deterministic in their responses.
Interestingly, "agency" i.e., the ability of the agent to make own decisions is not mentioned once.
This was an excellent writeup - felt a bit surprised at how much they considered "workflow" instead of agent but I think it's good to start to narrow down the terminology
I think these days the main value of the LLM "agent" frameworks is being able to trivially switch between model providers, though even that breaks down when you start to use more esoteric features that may not be implemented in cleanly overlapping ways
My wish list for LLM APIs to make them more useful for 'agentic' workflows:
Finer grained control over the tools the LLM is supposed to use. The 'tool_choice' should allow giving a list of tools to choose. The point is that the list of all available tools is needed to interpret the past tool calls - so you cannot use it to also limit the LLM choice at a particular step. See also: https://zzbbyy.substack.com/p/two-roles-of-tool-schemas
Control over how many tool calls can go in one request. For stateful tools multiple tool calls in one request leads to confusion.
By the way - is anyone working with stateful tools? Often they seem very natural and you would think that the LLM at training should encounter lots of stateful interactions and be skilled in using them. But there aren't many examples and the libraries are not really geared towards that.
It took less than a day to get a network of agents to correctly fix swebench-lite examples. It's super early, but very fun. One of the cool things is that this uses Inngest under the hood, so you get all of the classic durable execution/step function/tracing/o11y for free, but it's just regular code that you write.
When thinking about AI agents, there is still conflation between how to decide the next step to take vs what information is needed to decide the next step.
If runtime information is insufficient, we can use AI/ML models to fill that information. But deciding the next step could be done ahead of time assuming complete information.
Most AI agent examples short circuit these two steps. When faced with unstructured or insufficient information, the program asks the LLM/AI model to decide the next step. Instead, we could ask the LLM/AI model to structure/predict necessary information and use pre-defined rules to drive the process.
This approach will translate most [1] "Agent" examples into "Workflow" examples. The quotes here are meant to imply Anthropic's definition of these terms.
[1] I said "most" because there might be continuous world systems (such as real world simulacrum) that will require a very large number of rules and is probably impractical to define each of them. I believe those systems are an exception, not a rule.
While I agree with the premise of keeping it simple (especially when it comes to using opaque and overcomplicated frameworks like LangChain/LangGraph!) I do believe there’s a lot more to building agentic systems than this article covers.
I recently wrote[1] about the 4 main components of autonomous AI agents (Profile, Memory, Planning & Action) and all of that can still be accomplished with simple LLM calls, but there’s simply a lot more to think about than simple workflow orchestration if you are thinking of building production-ready autonomous agentic systems.
Good article. I think it can emphasize a bit more on supporting human interactions in agentic workflows. While composing workflows isn't new, involving human-in-the-loop introduces huge complexity, especially for long-running, async processes. Waiting for human input (which could take days), managing retries, and avoiding errors like duplicate refunds or missed updates require careful orchestration.
I think this is where durable execution shines. By ensuring every step in an async processing workflow is fault-tolerant and durable, even interruptions won't lose progress. For example, in a refund workflow, a durable system can resume exactly where it left off—no duplicate refunds, no lost state.
Anyone else using "aichat" (linked below) for this? I'm curious because this one feels nice and light and simple for this sort of thing, but can't tell if there may be something better out there?
Have been building agents for past 2 years, my tl;dr is that:
Agents are Interfaces, Not Implementations
The current zeitgeist seems to think of agents as passthrough agents: e.g. a lite wrapper around a core that's almost 100% a LLM.
The most effective agents I've seen, and have built, are largely traditional software engineering with a sprinkling of LLM calls for "LLM hard" problems. LLM hard problems are problems that can ONLY be solved by application of an LLM (creative writing, text synthesis, intelligent decision making). Leave all the problems that are amenable to decades of software engineering best practice to good old deterministic code.
I've been calling system like this "Transitional Software Design." That is, they're mostly a traditional software application under the hood (deterministic, well structured code, separation of concerns) with judicious use of LLMs where required.
Ultimately, users care about what the agent does, not how it does it.
The biggest differentiator I've seen between agents that work and get adoption, and those that are eternally in a demo phase, is related to the cardinality of the state space the agent is operating in. Too many folks try and "boil the ocean" and try and implement a generic purpose capability: e.g. Generate Python code to do something, or synthesizing SQL based on natural language.
The projects I've seen that work really focus on reducing the state space of agent decision making down to the smallest possible set that delivers user value.
e.g. Rather than generating arbitrary SQL, work out a set of ~20 SQL templates that are hyper-specific to the business problem you're solving. Parameterize them with the options for select, filter, group by, order by, and the subset of aggregate operations that are relevant. Then let the agent chose the right template + parameters from a relatively small finite set of options.
^^^ the delta in agent quality between "boiling the ocean" vs "agent's free choice over a small state space" is night and day. It lets you deploy early, deliver value, and start getting user feedback.
Building Transitional Software Systems:
1. Deeply understand the domain and CUJs,
2. Segment out the system into "problems that traditional software is good at solving" and "LLM-hard problems",
3. For the LLM hard problems, work out the smallest possible state space of decision making,
4. Build the system, and get users using it,
5. Gradually expand the state space as feedback flows in from users.
The smaller and more focused the context, the higher the consistency of output, and the lower the chance of jank.
Fundamentally no different than giving instructions to a junior dev. Be more specific -- point them to the right docs, distill the requirements, identify the relevant areas of the source -- to get good output.
My last attempt at a workflow of agents was at the 3.5 to 4 transition and OpenAI wasn't good enough at that point to produce consistently good output and was slow to boot.
My team has taken the stance that getting consistently good output from LLMs is really an ETL exercise: acquire, aggregate, and transform the minimum relevant data for the output to reach the desired level of quality and depth and let the LLM do it's thing.
There’ll always be an advantage for those who understand the problem they’re solving for sure.
The balance of traditional software components and LLM driven components in a system is an interesting topic - I wonder how the capabilities of future generations of foundation model will change that?
Certain the end state is "one model to rule them all" hence the "transitional."
Just that the pragmatic approach, today, given current LLM capabilities, is to minimize the surface area / state space that the LLM is actuating. And then gradually expand that until the whole system is just a passthrough. But starting with a passthrough kinda doesn't lead to great products in December 2024.
Unrelated, but since you seem to have experience here, how would you recommend getting into the bleeding edge of LLMs/Agents? Traditional SWE is obviously on it's way out, but I can't even tell where to start with this new tech and struggle to find ways to apply them to an actual project.
The whole Agent thing can easily blow in complexity.
Here some challenges I personally faced recently
- Durable Execution Paradigm: You may need the system to operate in a "durable execution" fashion like Temporal, Hatchet, Inngest, and Windmill. Your processes need to run for months, be upgraded and restarted. Links below
- FSM vs. DAG: Sometimes, a Finite State Machine (FSM) is more appropriate than a Directed Acyclic Graph (DAG) for my use cases. FSMs support cyclic behavior, allowing for repeated states or loops (e.g., in marketing sequences). FSM done right is hard. If you need FSM, you can't use most tools without "magic" hacking
- Observability and Tracing - takes time to put it everything nice in Grafana (Alloy, Tempo, Loki, Prometheus) or whatever you prefer. Attention switch between multiple systems is not an option during to limited attention span and "skills" issue. Most of "out of box" functionality or new Agents frameworks quickly becomes a liability
- Token/Inference Economy - token consumption and identifying edge cases with poor token management is a challenge, similar to Ethereum's gas consumption issues. Building a billing system based on actual consumption on the top of Stripe was a challenge. This is even 10x harder ... at least for me ;)
- Context Switching - managing context switching is akin to handling concurrency and scheduling with async/await paradigms, which can become complex. Simple prompts is a ok, but once you start joggling documents or screenshots or screen reading it's another game.
What I like about the all above it's nothing new - all design patterns, architecture are known for a while.
It's just hard to see it through AI/ML buzzwords storm ... but once you start looking at source code ... the fog of mind wars become clear.
Which do you think is the best workflow engine to use here? I've chosen temporal. Engineering management and their background at AWS means the platform is rock solid.
IMHO Temporal and its team is great - it checks all boxes on abstracting away queues, schedulers, distributed state machines for your workflows related load balancers/gateways.
After following discussions and commits Hatchet, Inngest, Windmill, I have a feeling in few years time all other systems will have 95% overlap in core features. They are all influence each other.
Much bigger question what price you will pay by introducing workflow system like Temporal in your code base.
Temporal and co are not for real-time data pubsub.
If latency is an issue or want to keep small memory footprint, better to use something else.
The max payload is 2 MB. It needs to be serializable. Event History has limitations. It's a postgres write heavy.
Bringing the entire team on the same page, and it's not trivial either. If your team has strong Golang developers like in mine. They might oppose it and state something like Temporal is unnecessary abstraction.
For now, I decided to keep prototyping with Temporal and has it running on my personal projects till I create strong use cases and discover all edges.
The great side of effect of exploring Temporal and its competitors you will see better ways of structuring of your code. Especially with distributed state and decoupling execution.
Note how much the principles here resemble general programming principles: keep complexity down, avoid frameworks if you can, avoid unnecessary layers, make debugging easy, document, and test.
It’s as if AI took over the writing-the-program part of software engineering, but sort of left all the rest.
Claude api lacks structured output, without uniformity in output, it's not useful as agent. I've had agents system broke down suddenly due to degradation in output, which leads to the previous suggested json output hacks (from official cookbook) stopped working.
Anthropic keeps advertising its MCP (Model Context Protocol), but to the extent it doesn't support other LLMs, e.g. GPT, it couldn't possibly gain adoption. I have yet to see any example of MCP that can be extended to use a random LLM.
You can use it with any LLM in LibreChat, Cody, Zed, etc. See https://modelcontextprotocol.io/clients. The protocol doesn’t prescribe an LLM, has facilities to sample from the client independent of LLM and brings support to build your own bespoke host in their SDK.
Key to understanding the power of agentic workflows is tool usage. You don't have to write logic anymore, you simply give an agent the tools it needs to accomplish a task and ask it to do so. Models like the latest Sonnet have gotten so advanced now that coding abilities are reaching superhuman levels. All the hallucinations and "jitter" of models from 1-2 years ago has gone away. They can be reasoned on now and you can build reliable systems with them.
Depends on what you’re building. A general assistant is going to have a lot of nuance. A well defined agent like a tutor only has so many tools to call upon.
This is by far the most practical piece of writing I've seen on the subject of "agents" - it includes actionable definitions, then splits most of the value out into "workflows" and describes those in depth with example applications.
There's also a cookbook with useful code examples: https://github.com/anthropics/anthropic-cookbook/tree/main/p...
Blogged about this here: https://simonwillison.net/2024/Dec/20/building-effective-age...
Thanks for all the write-ups on LLMs, you're on top of the news and it makes it way easier to follow what's happening and the existing implementations by following your blog instead.
Probably the least critical and most myth pushing content imo.
> most myth pushing content
Care to elaborate?
Lots of lists of the myths of LLMs out there https://masterofcode.com/blog/llms-myths-vs-reality-what-you... Every single post glosses over some aspect of these myths or posits they can be controlled or mitigated in some way, with no examples of anyone else finding applicability of the solutions to real world problems in a supportable and reliable way. When pushed, a myth in the neighborhood of those in the list above is pushed like the system will get better, or some classical computing mechanism will make up the difference, or that the problems aren't so bad, the solution is good enough in some ambiguous way, or that people or existing systems are just as bad when they are not.
I've written extensively about myths and misconceptions about LLMs, much of which overlaps with the observations in that post.
Here's my series about misconceptions: https://simonwillison.net/series/llm-misconceptions/
It doesn't seem to me that you're familiar with my work - you seem to be mixing me in with the vast ocean of uncritical LLM boosting content that's out there.
I'm thinking of the system you built to watch videos and parse JSON and the claims of that having a general suitability, which is simply dishonest imo. You seem to be confusing me with someone that hasn't been asking you repeatedly to address these kinds of concerns and the above series are a kind of potemkin set of things that don't intersect with your other work.
> dishonest Potemkin
It's like criticizing a "Hello World" program for not having proper error handling and security protocols. While those are important for production systems, they're not the point of a demonstration or learning example.
Your response seems to take these examples and hold them to the standard of mission-critical systems, which is a form of technical gatekeeping - raising the bar unnecessarily high for what counts as a "valid" technical demonstration.
Yes, they have actionable definitions, but they are defining something quite different than the normal definition of an "agent". An agent is a party who acts for another. Often this comes from an employer-employee relationship.
This matters mostly when things go wrong. Who's responsible? The airline whose AI agent gave out wrong info about airline policies found, in court, that their "intelligent agent" was considered an agent in legal terms. Which meant the airline was stuck paying for their mistake.
Anthropic's definition: Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks.
That's an autonomous system, not an agent. Autonomy is about how much something can do without outside help. Agency is about who's doing what for whom, and for whose benefit and with what authority. Those are independent concepts.
AI people have been using a much broader definition of 'agent' for ages, though. One from Russel and Norvig's 90s textbook:
"Anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators"
https://en.wikipedia.org/wiki/Intelligent_agent#As_a_definit...
That definition feels like it's playing on the verb, the idea of having "agency" in the world, and not on the noun, of being an "agent" for another party. The former is a philosophical category, while the latter has legal meaning and implication, and it feels somewhat disingenuous to continue to mix them up in this way.
In what way is it 'disingenuous'? You think Norvig is trying to deceive us about something? I'm not saying you have to agree with or like this definition but even if you think it's straight up wrong, 'disingenuous' feels utterly out of nowhere.
It's disingenuous in that it takes a word with a common understanding ("agent") and then conveniently redefines or re-etomologizes the word in an uncommon way that leads people to implicitly believe something about the product that isn't true.
Another great example of this trick is "essential" oils. We all know what the word "essential" means, but the companies selling the stuff use the word in the most uncommon way, to indicate the "essence" of something is in the oil, and then let the human brain fill in the gap and thus believe something that isn't true. It's techinically legal, but we have to agree that's not moral or ethical, right?
Maybe I'm wildly off base here, I have admittedly been wrong about a lot in my life up to this point. I just think the backlash that crops up when people realize what's going on (for example, the airline realizing that their chat bot does not in fact operate under the same rules as a human "agent," and that it's still a technology product) should lead companies to change their messaging and marketing, and the fact that they're just doubling down on the same misleading messaging over and over makes the whole charade feel disingenuous to me.
with a common understanding ("agent") and then conveniently redefines or re-etomologizes the word in an uncommon way that leads people to implicitly believe something about the product that isn't true.
What is the 'product' here? It's a university textbook. Like, where is the parallel between https://en.wikipedia.org/wiki/Intelligent_agent and 'essential oils'.
Oh, I have no issue with his textbook definition, I'm saying that it's now being used to sell products by people who know their normal consumer base isn't using the same definition and it conveniently misleads them into believing things about the product that aren't true.
Knowing that your target market (non-tech folks) isn't using the same language as you, but persisting with that language because it creates convenient sales opportunities due to the misunderstandings, feels disingenuous to me.
An "agent" in common terms is just someone acting on behalf of another, but that someone still has autonomy and moral responsibility for their actions. Like for example the airline customer service representative situation. AI agents, when we pull back the curtains, get down to brass tacks, whatever turn of phrase you want to use, are still ultimately deterministic models. They have a lot more parameters, and their determinism is offset by many factors of pseudo-randomness, but given sufficient information we could still predict every single output. That system cannot be an agent in the common sense of the word, because humans are still dictating all of the possible actions and outcomes, and the machine doesn't actually have the autonomy required.
If you fail to keep your tech product from going off-script, you're responsible, because the model itself isn't a non-deterministic causal actor. A human CSR on the other hand is considered by law to have the power and responsibility associated with being a causal actor in the world, and so when they make up wild stuff about the terms of the agreement, you don't have to honor it for the customer, because there's culpability.
I'm drifting into philosophy at this point, which never goes well on HN, but this is ultimately how our legal system determines responsibility for actions, and AI doesn't meet those qualifications. If we ever want it to be culpable for its own actions, we'll have to change the legal framework we all operate under.
Edit: Causal, not casual... Whoops.
Also, I think I'm confusing the situation a bit by mixing the legal distinctions between agency and autonomy with the common understanding of being an "agent" and the philosophical concept of agency and culpability and how that relates to the US legal foundations.
I need to go touch grass.
Interesting. The best agents don't have agency, or at least don't use it.
You can think of this in video game terms: Players have agency. NPCs are "agencs", but don't have agency. But they're still not just objects in the game - they can move themselves and react to their environment.
That's actually a great example of what I'm saying, because I don't think the NPCs are agents at all in the traditional sense of "One that acts or has the power or authority to act on behalf of another." Where would the NPC derive its power and authority from? There is a human somewhere in the chain giving it 100% of its parameters, and that human is ultimately 100% responsible for the configuration of the NPC, which is why we don't blame the NPC in the game for behaving in a buggy way, we blame the devs. To say the NPC has agency puts some level of metaphysical responsibility about decision making and culpability on the thing that it doesn't have.
An AI "agent" is the same way, it is not culpable for its actions, the humans who set it up are, but we're leading people to believe that if the AI goes off script then the AI is somehow responsible for its own actions, which is simply not true. These are not autonomous beings, they're technology products.
Where did you get the idea that your definition there is the "normal" definition of agent, especially in the context of AI?
I ask because you seem very confident in it - and my biggest frustration about the term "agent" is that so many people are confident that their personal definition is clearly the one everyone else should be using.
Didn't he mention it was the court's definition?
But I'm not sure if that's true. The court didn't define anything, in contrary they only said that (in simplified terms) the chatbot was part of the website and it's reasonable to expect the info on their website to be accurate.
The closest I could find to the chatbot being considered an agent in legal terms (an entity like an employee) is this:
> Air Canada argues it cannot be held liable for information provided by one of its agents, servants, or representatives – including a chatbot.
Source: https://www.canlii.org/en/bc/bccrt/doc/2024/2024bccrt149/202...
I searched for the definition of "agent" and none of the results map to the way AI folks are using the word. It's really that simple, because we're marketing this stuff to non-tech people who already use words to mean things.
If we're redefining common words to market this stuff to non-tech people, and then we're conveniently not telling them that we redefined words, and thus allowing them to believe implicit falsehoods about the product that have serious consequences, we're being disingenuous.
Defining "agent" as "thing with agency" seems legitimate to me, what with them being the same word.
That logic doesn't work for me, because many words have multiple meanings. "Agency" can also be a noun that means an organization that you hire - like a design agency. Or it can mean the CIA.
I'm not saying it's not a valid definition of the term, I'm pushing back on the idea that it's THE single correct definition of the term.
May I push back on the idea that a single word may mean (completely) different things?
Aloha! Indeed, the language is being cleaved by such oversights. You can be in charge of overlooking this issue, effective ahead of two weeks from now. We'll peruse your results and impassionately sanction anything you call out (at least when it's unravelable). This endeavor should prove invaluable. Aloha!
You're pushing up against the english language, then. 'let' has 46 entries in the dictionary (more if you cinsider obsolete usages).
It's pretty clearly true.
Bank: financial institution, edge of a river, verb to stash something away
Spring: a season, a metal coil, verb to jump
Match: verb to match things together, noun a thing to start fires, noun a competition between two teams
Bat: flying mammal, stick for hitting things
And so on.
What's the single, unambiguous definition of the word "cleave"?
[flagged]
Anything involving real agents likely does get your local spymaster interested. I assume all good AI work attracts the three letter types to make sure that the researcher isn’t trying to make AI that can make bioweapons…
That's only one of many definitions for the word agent outside of the context of AI. Another is something produces effects on the world. Another is something that has agency.
Sort of interesting that we've coalesced on this term that has many definitions, sometimes conflicting, but where many of the definitions vaguely fit into what an "AI Agent" could be for a given person.
But in the context of AI, Agent as Anthropic defines it is an appropriate word because it is a thing that has agency.
> But in the context of AI, Agent as Anthropic defines it is an appropriate word because it is a thing that has agency.
That seems circular.
It would only be circular if agency was only defined as “the property of being an agent”. That circle of reasoning isn’t being proposed as the formal definitions by anyone.
Perhaps you mean tautological. In which case, an agent having agency would be an informal tautology. A relationship so basic to the subject matter that it essentially must be true. Which would be the strongest possible type of argument.
>Anthropic's definition: Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks.
But that's not their definition, and they explicitly describe that definition as an 'autonomous system'. Their definition comes in the next paragraph:
"At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
* Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
* Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."
And "autonomous" is "having one's own laws".
https://www.etymonline.com/word/autonomous
I'm glad they are publishing their cookbooks recipes on github too. Openai used to be more active there.
[flagged]
Eh, let's nip this in the bud: we could end up in a "it feels like...", coupled to free association, cycle. :)
More substantively, we can check our vibe. OpenAI is just as active as it ever was w/notebooks. To an almost absurd degree. 5-10 commits a week. https://github.com/openai/openai-cookbook/activity
If you're looking for a lightweight open-source framework designed to handle the patterns mentioned in this article: https://github.com/neuml/txtai
Disclaimer: I'm the author of the framework.
Hi David; I’ve seen txtai floating around, and just took a look. Would you say that it fits in a similar niche to something like llamaindex, but starting from a data/embeddings abstraction rather than a retrieval one (building on layers from there - like workflows, agents etc)?
Hello - This is a great and accurate description. The idea is that out of the box there is a pipeline but each component is also customizable.
100% agree. I did a research on workflows, durable execution engines in context of Agents and RAGs. Put some links in a comment to the article below
How do you protect from compounding errors?
read the article, close the feedback loop with something verifiable (e.g. tests)
And who tests the tests, etc
I put the agents in quotes because anthropic actually talks more about what they call "workflows". And imo this is where the real value of LLMs currently lies, workflow automation.
They also say that using LangChain and other frameworks is mostly unnecessary and does more harm than good. They instead argue to use some simple patterns, directly on the API level. Not dis-similar to the old-school Gang of Four software engineering patterns.
Really like this post as a guidance for how to actually build useful tools with LLMs. Keep it simple, stupid.
Deploying in production, the current agentic systems do not really work well. Workflow automation does. The reason is very native to LLMs, but also incredibly basic. Every agentic system starts with planning and reasoning module, where an LLM evaluates the task given and plans about how to accomplish that task, before moving on to next steps.
When an agent is given a task, they inevitably come up with different plans on different tries due to inherent nature of LLMs. Most companies like this step to be predictable, and they end up removing it from the system and doing it manually. Thus turning it into a workflow automation vs an agentic system. I think this is what people actually mean when they want to deploy agents in production. LLMs are great at automation*, not great at problem solving. Examples I have seen - customer support (you want predictability), lead mining, marketing copy generation, code flows and architecture, product specs generation, etc.
The next leap for AI systems is going to be whether they can solve challenging problems at companies - being the experts vs the doing the task they are assigned. They should really be called agents, not the current ones.
I felt deeply vindicated by their assessment of these frameworks, in particular LangChain.
I've built and/or worked on a few different LLM-based workflows, and LangChain definitely makes things worse in my opinion.
What it boils down to is that we are still coming to understand the right patterns of development for how to develop agents and agentic workflows. LangChain made choices about how to abstract things that are not general or universal enough to be useful.
The article does not mention the LangChain framework. LangGraph is a different framework, have you tried it?
Yes, our previous lead dev built a lot of our infra using LangGraph. I've been slowly ripping it out since assuming ownership of this part of the codebase.
I've been replacing LangGraph with simple primitives, relying on native Python constructs, etc. For example, instead of building this verbose graph of computation with LangGraph, you can just...call functions in the order you want them. Or declare them async, add them to a list, then await the resolution of all of them.
For a time I was maintaining a spreadsheet of all the refactor PRs, and I had a cumulative reduction of over 1,000 lines of code from these changes. Eventually I stopped keeping track.
^ That's a 1K LOC reduction with no functionality changes. I feel pretty strongly that LangChain/LangGraph are a net negative for our use case.
So all you achieved from a spreadsheet's worth of PRs was a 1k LoC reduction?
If you do any software engineering all, you would know that a 1k LoC reduction to achieve the same functionality at the same/better performance is non-trivial.
They’re questioning whether it was a valuable use of time, not whether a spreadsheet of PRs was time-consuming which is apparent
In fact they are mentioning LangGraph (the agent framework from the LangChain company). Imo LangGraph is a much more thoughtful and better built piece of software than the LangChain framework.
As I said, they already mention LangGraph in the article, so the Anthropic's conclusions still hold (i.e. KISS).
But this thread is going in the wrong direction when talking about LangChain
I'm lumping them all in the same category tbh. They say to just use the model libraries directly or a thin abstraction layer (like litellm maybe?) if you want to keep flexibility to change models easily.
Indeed. Very clarifying.
I would just posit that they do make a distinction between workflows and agents
Aren't you editorialising by doing so?
I guess a little. I really liked the read though, it put in words what I couldn't and I was curious if others felt the same.
However the post was posted here yesterday and didn't really have a lot of traction. I thought this was partially because of the term agentic, which the community seems a bit fatigued by. So I put it in quotes to highlight that Anthropic themselves deems it a little vague and hopefully spark more interest. I don't think it messes with their message too much?
Honestly it didn't matter anyways, without second chance pooling this post would have been lost again (so thanks Daniel!)
My personal view is that the roadmap to AGI requires an LLM acting as a prefrontal cortex: something designed to think about thinking.
It would decide what circumstances call for double-checking facts for accuracy, which would hopefully catch hallucinations. It would write its own acceptance criteria for its answers, etc.
It's not clear to me how to train each of the sub-models required, or how big (or small!) they need to be, or what architecture works best. But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
IMHO with a simple loop LLMs are already capable of some meta thinking, even without any internal new architectures. For me where it still fails is that LLMs cannot catch their own mistakes even some obvious ones. Like with GPT 3.5 I had a persistent problem with the following question: "Who is older, Annie Morton or Terry Richardson?". I was giving it Wikipedia and it was correctly finding out the birth dates of the most popular people with the names - but then instead of comparing ages it was comparing birth years. And once it did that it was impossible to it to spot the error.
Now with 4o-mini I have a similar even if not so obvious problem.
Just writing this down convinced me that there are some ideas to try here - taking a 'report' of the thought process out of context and judging it there, or changing the temperature or even maybe doing cross-checking with a different model?
The meta thinking of LLMs is fascinating to me. Here’s a snippet of a convo I had with Claude 3.5 where it struggles with the validity of its own metacognition:
> … true consciousness may require genuine choice or indeterminacy - that is, if an entity's responses are purely deterministic (like a lookup table or pure probability distribution), it might be merely executing a program rather than experiencing consciousness.
> However, even as I articulate this, I face a meta-uncertainty: I cannot know whether my discussion of uncertainty reflects: - A genuine contemplation of these philosophical ideas - A well-trained language model outputting plausible tokens about uncertainty - Some hybrid or different process entirely
> This creates an interesting recursive loop - I'm uncertain about whether my uncertainty is "real" uncertainty or simulated uncertainty. And even this observation about recursive uncertainty could itself be a sophisticated output rather than genuine metacognition.
I actually felt bad for it (him?), and stopped the conversion before it recursed into “flaming pile of H-100s”
Brains are split internally, with each having their own monologue. One happens to have command.
I don't think there's reason to believe both halves have a monologue, is there? Experience, yes, but doesn't only one half do language?
[0] https://www.youtube.com/watch?v=fJRx9wItvKo
[1] https://thersa.org/globalassets/pdfs/blogs/rsa-divided-brain...
[2] https://en.wikipedia.org/wiki/Lateralization_of_brain_functi...
You have two minds (at least). One happens to be dominant.
Neither of my halves need a monologue, thanks.
So if like me you have an interior dialogue, which is speaking and which is listening or is it the same one? I do not ascribe the speaker or listener to a lobe, but whatever the language and comprehension centre(s) is(are), it can do both at the same time.
Same half. My understanding is that in split brain patients, it looks like the one half has extremely limited ability to parse language and no ability to create it.
Ah yeah - actually I tested that taking out of context. This is the thing that surprised me - I thought it is about 'writing itself into a corner - but even in a completely different context the LLM is consistently doing an obvious mistake. Here is the example: https://chatgpt.com/share/67667827-dd88-8008-952b-242a40c2ac...
Janet Waldo was playing Corliss Archer on radio - and the quote the LLM found in Wikipedia was confirming it. But the question was about film - and the LLM cannot spot the gap in its reasoning - even if I try to warn it by telling it the report came from a junior researcher.
> But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
I'm not sure about AGI, but for specialized jobs/tasks (ie having a marketing agent that's familiar with your products and knows how to copywrite for your products) will win over "just add more compute/data" mass-market LLMs. This article does encourage us to keep that architecture simple, which is refreshing to hear. Kind of the AI version of rule of least power.
Admittedly, I have a degree in Cognitive Science, which tended to focus on good 'ol fashioned AI, so I have my biases.
Interesting, because I almost think of it the opposite way. LLMs are like system 1 thinking, fast, intuitive, based on what you consider most probable based on what you know/have experienced/have been trained on. System 2 thinking is different, more careful, slower, logical, deductive, more like symbolic reasoning. And then some metasystem to tie these two together and make them work cohesively.
After I read attention is all you need, my first thought was: "Orchestration is all you need". When 4o came out I published this: https://b.h4x.zip/agi/
> Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.
The questions then become:
1. When can you (i.e. a person who wants to build systems with them) trust them to make decisions on their own?
2. What type of trusted environments are we talking about? (Sandboxing?)
So, that all requires more thought -- perhaps by some folks who hang out at this site. :)
I suspect that someone will come up with a "real-world" application at a non-tech-first enterprise company and let us know.
Just take any example and think how a human would break it down with decision trees.
You are building an AI system to respond to your email.
The first agent decides whether the new email should be responded to, yes or no.
If no, it can send it to another LLM call that decides to archive it or leave it in the inbox for the human.
If yes, it sends it to classifier that decides what type of response is required.
Maybe there are some emails like for your work that require something brief like “congrats!” to all those new feature launch emails you get internally.
Or others that are inbound sales emails that need to go out to another system that fetches product related knowledge to craft a response with the right context. Followed by a checker call that makes sure the response follows brand guidelines.
The point is all of these steps are completely hypothetical but you can imagine how loosely providing some set of instructions and function calls and procedural limits can easily classify things and minimize error rate.
You can do this for any workflow by creatively combining different function calls, recursion, procedural limits, etc. And if you build multiple different decision trees/workflows, you can A/B test those and use LLM-as-a-judge to score the performance. Especially if you’re working on a task with lots of example outputs.
As for trusted environments, assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good. I put mine in their own cloudflare workers where they can’t do any damage beyond giving an odd response to the user.
> The first agent decides whether the new email should be responded to, yes or no.
How would you trust that the agent is following the criteria, and how sure that the criteria is specific enough. Like someone you just meet told you they going to send you something via email, but then the agent misinterpret it due to missing context and decided to respond in a generic manner leading to misunderstanding.
> assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good.
Which is not new. But with formal languages, you have a more precise definition of what acceptable inputs are (the whole point of formalism is precise definitions). With LLM workflows, the whole environment should be assumed to be public information. And you should probably add a fine point that the output does not engage you in anything.
> How would you trust that the agent is following the criteria, and how sure that the criteria is specific enough?
How do you know if a spam filter heuristic works only when intended?
You test it. Hard. On the thousands of emails in your archive, on edge-cases you prepare manually, and on the incoming mails. If it doesn't work for some cases, write tests that test for this, adjust prompt and run the test suite.
It won't ever work in 100% of all cases, but neither do spam filters and we still use them.
This is the location of the arguments. When they work they're "magical" but when they don't work "well people or other things are just as bad" ... And this means that you just cannot reason with the mysticism people have surrounding these things because show stopper problems are minimized or it is implied that they can somehow be reduced. The whole rest of computing does not work this way. So, you can't get the magic of reliable systems from probabilistic outcomes, and people confusing these things are seriously holding back honest and real discussions about probabilistic systems. (Not to harp on you specifically here, the whole language of the field is seriously confusing these issues)
It's one thing if you're up against the staunch reliability of traditional algorithmic systems but if you're not then it's just silly to hang on it.
You are not getting 100% on email handling whatever method you wish to use. You compare LLMs or probabilistic systems to the best of your alternatives.
There are niches where the lack of complete reliability would be a deal breaker. This isn't one of them and it would be weird to act like it were. People aren't sweeping anything under the rug. It simply just isn't a showstopper.
I don't think people are aware of the line at all. You ask five different people which things are reliable in any of these systems and you'll get five different guesses. C'mon here.
That's not what i'm talking about though. The point is that for spam detection, LLMs are up against other probabilistic measures. No-one sane is detecting spam with if-then's. You simply do not have the luxury of rigid reliability.
Couldn’t agree more with this - too many people rush to build autonomous agents when their problem could easily be defined as a DAG workflow. Agents increase the degrees of freedom in your system exponentially making it so much more challenging to evaluate systematically.
Agents are still a misaligned concept in AI. While this article offers a lot in orchestration, memory (only mentioned once in the post) and governance are not really mentioned. The latter is important to increase reliability -- something Ilya Sutskever mentioned to be important as agents can be less deterministic in their responses. Interestingly, "agency" i.e., the ability of the agent to make own decisions is not mentioned once.
I work on CAAs and document my journey on my substack (https://jdsmerau.substack.com)
That URL says Not Found.
Seems to be https://jdsemrau.substack.com/, also in their bio.
Thanks. I was rushing out to the gym.
This was an excellent writeup - felt a bit surprised at how much they considered "workflow" instead of agent but I think it's good to start to narrow down the terminology
I think these days the main value of the LLM "agent" frameworks is being able to trivially switch between model providers, though even that breaks down when you start to use more esoteric features that may not be implemented in cleanly overlapping ways
My wish list for LLM APIs to make them more useful for 'agentic' workflows:
Finer grained control over the tools the LLM is supposed to use. The 'tool_choice' should allow giving a list of tools to choose. The point is that the list of all available tools is needed to interpret the past tool calls - so you cannot use it to also limit the LLM choice at a particular step. See also: https://zzbbyy.substack.com/p/two-roles-of-tool-schemas
Control over how many tool calls can go in one request. For stateful tools multiple tool calls in one request leads to confusion.
By the way - is anyone working with stateful tools? Often they seem very natural and you would think that the LLM at training should encounter lots of stateful interactions and be skilled in using them. But there aren't many examples and the libraries are not really geared towards that.
It looks like Agents are less about DAG workflows and fully autonomous "networks of agents", but more of a stateful network:
* A "network of agents" is a system of agents and tools
* That run and build up state (both "memory" and actual state via tool use)
* Which is then inspected when routing as a kind of "state machine".
* Routing should specify which agent (or agents, in parallel) to run next, via that state.
* Routing can also use other agents (routing agents) to figure out what to do next, instead of code.
We're codifying this with durable workflows in a prototypical library — AgentKit: https://github.com/inngest/agent-kit/ (docs: https://agentkit.inngest.com/overview).
It took less than a day to get a network of agents to correctly fix swebench-lite examples. It's super early, but very fun. One of the cool things is that this uses Inngest under the hood, so you get all of the classic durable execution/step function/tracing/o11y for free, but it's just regular code that you write.
When thinking about AI agents, there is still conflation between how to decide the next step to take vs what information is needed to decide the next step.
If runtime information is insufficient, we can use AI/ML models to fill that information. But deciding the next step could be done ahead of time assuming complete information.
Most AI agent examples short circuit these two steps. When faced with unstructured or insufficient information, the program asks the LLM/AI model to decide the next step. Instead, we could ask the LLM/AI model to structure/predict necessary information and use pre-defined rules to drive the process.
This approach will translate most [1] "Agent" examples into "Workflow" examples. The quotes here are meant to imply Anthropic's definition of these terms.
[1] I said "most" because there might be continuous world systems (such as real world simulacrum) that will require a very large number of rules and is probably impractical to define each of them. I believe those systems are an exception, not a rule.
While I agree with the premise of keeping it simple (especially when it comes to using opaque and overcomplicated frameworks like LangChain/LangGraph!) I do believe there’s a lot more to building agentic systems than this article covers.
I recently wrote[1] about the 4 main components of autonomous AI agents (Profile, Memory, Planning & Action) and all of that can still be accomplished with simple LLM calls, but there’s simply a lot more to think about than simple workflow orchestration if you are thinking of building production-ready autonomous agentic systems.
[1] https://melvintercan.com/p/anatomy-of-an-autonomous-ai-agent
Good article. I think it can emphasize a bit more on supporting human interactions in agentic workflows. While composing workflows isn't new, involving human-in-the-loop introduces huge complexity, especially for long-running, async processes. Waiting for human input (which could take days), managing retries, and avoiding errors like duplicate refunds or missed updates require careful orchestration.
I think this is where durable execution shines. By ensuring every step in an async processing workflow is fault-tolerant and durable, even interruptions won't lose progress. For example, in a refund workflow, a durable system can resume exactly where it left off—no duplicate refunds, no lost state.
Anyone else using "aichat" (linked below) for this? I'm curious because this one feels nice and light and simple for this sort of thing, but can't tell if there may be something better out there?
https://github.com/sigoden/aichat
Have been building agents for past 2 years, my tl;dr is that:
Agents are Interfaces, Not Implementations
The current zeitgeist seems to think of agents as passthrough agents: e.g. a lite wrapper around a core that's almost 100% a LLM.
The most effective agents I've seen, and have built, are largely traditional software engineering with a sprinkling of LLM calls for "LLM hard" problems. LLM hard problems are problems that can ONLY be solved by application of an LLM (creative writing, text synthesis, intelligent decision making). Leave all the problems that are amenable to decades of software engineering best practice to good old deterministic code.
I've been calling system like this "Transitional Software Design." That is, they're mostly a traditional software application under the hood (deterministic, well structured code, separation of concerns) with judicious use of LLMs where required.
Ultimately, users care about what the agent does, not how it does it.
The biggest differentiator I've seen between agents that work and get adoption, and those that are eternally in a demo phase, is related to the cardinality of the state space the agent is operating in. Too many folks try and "boil the ocean" and try and implement a generic purpose capability: e.g. Generate Python code to do something, or synthesizing SQL based on natural language.
The projects I've seen that work really focus on reducing the state space of agent decision making down to the smallest possible set that delivers user value.
e.g. Rather than generating arbitrary SQL, work out a set of ~20 SQL templates that are hyper-specific to the business problem you're solving. Parameterize them with the options for select, filter, group by, order by, and the subset of aggregate operations that are relevant. Then let the agent chose the right template + parameters from a relatively small finite set of options.
^^^ the delta in agent quality between "boiling the ocean" vs "agent's free choice over a small state space" is night and day. It lets you deploy early, deliver value, and start getting user feedback.
Building Transitional Software Systems:
Same experience.
The smaller and more focused the context, the higher the consistency of output, and the lower the chance of jank.
Fundamentally no different than giving instructions to a junior dev. Be more specific -- point them to the right docs, distill the requirements, identify the relevant areas of the source -- to get good output.
My last attempt at a workflow of agents was at the 3.5 to 4 transition and OpenAI wasn't good enough at that point to produce consistently good output and was slow to boot.
My team has taken the stance that getting consistently good output from LLMs is really an ETL exercise: acquire, aggregate, and transform the minimum relevant data for the output to reach the desired level of quality and depth and let the LLM do it's thing.
There’ll always be an advantage for those who understand the problem they’re solving for sure.
The balance of traditional software components and LLM driven components in a system is an interesting topic - I wonder how the capabilities of future generations of foundation model will change that?
Certain the end state is "one model to rule them all" hence the "transitional."
Just that the pragmatic approach, today, given current LLM capabilities, is to minimize the surface area / state space that the LLM is actuating. And then gradually expand that until the whole system is just a passthrough. But starting with a passthrough kinda doesn't lead to great products in December 2024.
Unrelated, but since you seem to have experience here, how would you recommend getting into the bleeding edge of LLMs/Agents? Traditional SWE is obviously on it's way out, but I can't even tell where to start with this new tech and struggle to find ways to apply them to an actual project.
When trying to do everything, they end up doing nothing.
Do you have a public example of a good agentic system. I would like to experience it.
The whole Agent thing can easily blow in complexity.
Here some challenges I personally faced recently
- Durable Execution Paradigm: You may need the system to operate in a "durable execution" fashion like Temporal, Hatchet, Inngest, and Windmill. Your processes need to run for months, be upgraded and restarted. Links below
- FSM vs. DAG: Sometimes, a Finite State Machine (FSM) is more appropriate than a Directed Acyclic Graph (DAG) for my use cases. FSMs support cyclic behavior, allowing for repeated states or loops (e.g., in marketing sequences). FSM done right is hard. If you need FSM, you can't use most tools without "magic" hacking
- Observability and Tracing - takes time to put it everything nice in Grafana (Alloy, Tempo, Loki, Prometheus) or whatever you prefer. Attention switch between multiple systems is not an option during to limited attention span and "skills" issue. Most of "out of box" functionality or new Agents frameworks quickly becomes a liability
- Token/Inference Economy - token consumption and identifying edge cases with poor token management is a challenge, similar to Ethereum's gas consumption issues. Building a billing system based on actual consumption on the top of Stripe was a challenge. This is even 10x harder ... at least for me ;)
- Context Switching - managing context switching is akin to handling concurrency and scheduling with async/await paradigms, which can become complex. Simple prompts is a ok, but once you start joggling documents or screenshots or screen reading it's another game.
What I like about the all above it's nothing new - all design patterns, architecture are known for a while.
It's just hard to see it through AI/ML buzzwords storm ... but once you start looking at source code ... the fog of mind wars become clear.
Durable Execution / Workflow Engines
- Temporal https://github.com/temporalio - https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
- Hatchet https://news.ycombinator.com/item?id=39643136
- Inngest https://news.ycombinator.com/item?id=36403014
- Windmill https://news.ycombinator.com/item?id=35920082
Any comments and links on the above challenges and solutions are greatly appreciated!
Which do you think is the best workflow engine to use here? I've chosen temporal. Engineering management and their background at AWS means the platform is rock solid.
IMHO Temporal and its team is great - it checks all boxes on abstracting away queues, schedulers, distributed state machines for your workflows related load balancers/gateways.
After following discussions and commits Hatchet, Inngest, Windmill, I have a feeling in few years time all other systems will have 95% overlap in core features. They are all influence each other.
Much bigger question what price you will pay by introducing workflow system like Temporal in your code base.
Temporal and co are not for real-time data pubsub.
If latency is an issue or want to keep small memory footprint, better to use something else.
The max payload is 2 MB. It needs to be serializable. Event History has limitations. It's a postgres write heavy.
Bringing the entire team on the same page, and it's not trivial either. If your team has strong Golang developers like in mine. They might oppose it and state something like Temporal is unnecessary abstraction.
Writing your code is fun. Studying and reusing someone else patterns is not so much. Check https://github.com/temporalio/samples-go
For now, I decided to keep prototyping with Temporal and has it running on my personal projects till I create strong use cases and discover all edges.
The great side of effect of exploring Temporal and its competitors you will see better ways of structuring of your code. Especially with distributed state and decoupling execution.
Note how much the principles here resemble general programming principles: keep complexity down, avoid frameworks if you can, avoid unnecessary layers, make debugging easy, document, and test.
It’s as if AI took over the writing-the-program part of software engineering, but sort of left all the rest.
Claude api lacks structured output, without uniformity in output, it's not useful as agent. I've had agents system broke down suddenly due to degradation in output, which leads to the previous suggested json output hacks (from official cookbook) stopped working.
Slightly off topic, but does anyone have a suggestion for a tool to make the visualizations of the different architectures like in this post?
https://www.mermaidchart.com/
Sounds like this entire process can be automated, of course with feedback/supervision.
I have always voted for the Unix style multiple do one thing good blackboxes as the plumbing in the ruling agent.
Divide and conquer me hearties.
I’m part of a team that is currently #1 at the SWEBench-lite benchmark. Interesting times!
Tangent but anyone know what software is used to draw those workflow diagrams?
they probably use their own designers. jk. the arrows looks a lot like Figma/Figjam.
indeed, we've seen this approach as well. All these "frameworks" in real business cases become too complicated.
Does any one have a solid examples of a real agent, deployed in production?
Anthropic keeps advertising its MCP (Model Context Protocol), but to the extent it doesn't support other LLMs, e.g. GPT, it couldn't possibly gain adoption. I have yet to see any example of MCP that can be extended to use a random LLM.
You can use it with any LLM in LibreChat, Cody, Zed, etc. See https://modelcontextprotocol.io/clients. The protocol doesn’t prescribe an LLM, has facilities to sample from the client independent of LLM and brings support to build your own bespoke host in their SDK.
Why is it that there isn't a single example showing its use with GPT?
Key to understanding the power of agentic workflows is tool usage. You don't have to write logic anymore, you simply give an agent the tools it needs to accomplish a task and ask it to do so. Models like the latest Sonnet have gotten so advanced now that coding abilities are reaching superhuman levels. All the hallucinations and "jitter" of models from 1-2 years ago has gone away. They can be reasoned on now and you can build reliable systems with them.
> you simply give an agent the tools
That isn’t simple. There is a lot of nuance in tool definition.
Depends on what you’re building. A general assistant is going to have a lot of nuance. A well defined agent like a tutor only has so many tools to call upon.
[dead]