hshdhdhehd 3 hours ago

There is a lot of stuff I should do. From making my own CPU from a breadboard of nand gates to building a CDN in Rust. But aint got time for all the things.

That said I built an LLM following Karpathy's tutorial. So I think it aims good to dabble a bit.

  • coffeecoders 2 hours ago

    Yeah, it’s a never-ending curve.

    I built an 8-bit computer on breadboards once, then went down the rabbit hole of flight training for a PPL. Every time I think I’m "done," the finish line moves a few miles further.

    Guess we nerds are never happy.

    • javchz 2 hours ago

      One should be melting sand to get silicon, anything else it's too abstract to my taste.

      • tomcam an hour ago

        Glad you’ve got all that time on your hands. I am still working on the fusion reactor portion of my supernova simulator, so that I can generate the silicon you so blithely refer to.

    • krsdcbl an hour ago

      Given the premise, one could also say we nerds are forever happy.

  • qwertygnu 2 hours ago

    Very early in TFA it explains how easy it is to do. That's the whole point of the post.

    • z2 2 hours ago

      It's good to go through the exercise, but agents are easy until you build a whole application using an API endpoint that OpenAI or LangChain decides to yank, and you spend the next week on a mini migration project. I don't disagree with the claim that MCP is reinventing the wheel but sometimes I'm happy plugging my tools and data into someone else's platform because they are spending orders of magnitudes more time than me doing the janitor work to keep up with whatever's trendy.

dave1010uk 4 hours ago

Two years ago I wrote an agent in 25 lines of PHP [0]. It was surprisingly effective, even back then before tool calling was a thing and you had to coax the LLM into returning structured output. I think it even worked with GPT-3.5 for trivial things.

In my mind LLMs are just UNIX strong manipulation tools like `sed` or `awk`: you give them an input and command and they give you an output. This is especially true if you use something like `llm` [1].

It then seems logical that you can compose calls to LLMs, loop and branch and combine them with other functions.

[0] https://github.com/dave1010/hubcap

[1] https://github.com/simonw/llm

azimux an hour ago

I wrote an agent from scratch in Ruby several months back. Was fun!

These 4 lines wound up being the heart of it, which is surprisingly simple, conceptually.

        until mission_accomplished? or given_up? or killed?
          determine_next_command_and_inputs
          run_next_command
        end
rmoriz 39 minutes ago

Side note: While the example uses GPT-5, the query interface is already some kind of industry standard. For example you could easily connect OpenRouter.ai and switch models and providers during runtime as needed. OpenRouter also has free models like some of the DeepSeek. While they are slow/rate limited and quantized, they are great for examples and playing around with it. https://openrouter.ai/models?fmt=cards&order=pricing-low-to-...

ericd 5 hours ago

Absolutely, especially the part about just rolling your own alternative to Claude Code - build your own lightsaber. Having your coding agent improve itself is a pretty magical experience. And then you can trivially swap in whatever model you want (Cerebras is crazy fast, for example, which makes a big difference for these many-turn tool call conversations with big lumps of context, though gpt-oss 120b is obviously not as good as one of the frontier models). Add note-taking/memory, and ask it to remember key facts to that. Add voice transcription so that you can reply much faster (LLMs are amazing at taking in imperfect transcriptions and understanding what you meant). Each of these things takes on the order of a few minutes, and it's super fun.

  • andai 18 minutes ago

    What are you using for transcription?

    I tried Whisper, but it's slow and not great.

    I tried the gpt audio models, but they're trained to refuse to transcribe things.

    I tried Google's models and they were terrible.

    I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.

    So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!

    • tptacek 15 minutes ago

      I recently bought a mint-condition Alf phone, in the shape of Gordon Shumway of TV's "Alf", out of the back of an old auto shop in the south suburbs of Chicago, and naturally did the most obvious thing, which was to make a Gordon Shumway phone that has conversations in the voice of Gordon Shumway (sampled from Youtube and synthesized with ElevenLabs). I use https://github.com/etalab-ia/faster-whisper-server (I think?) as the Whisper backend. It's fine! Asterix feeds me WAV files, an ASI program feeds them to Whisper (running locally as a server) and does audio synthesis with the ElevenLabs API. Took like 2 hours.

  • lowbloodsugar 2 hours ago

    >build your own lightsaber

    I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.

    The other takeaway is that agents are fantastically simple.

    • ericd 44 minutes ago

      Agreed, and it's actually how I've been thinking about it, but it's also straight from the article, so can't claim credit. But it was fun to see it put into words by someone else.

      And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.

  • lukevp 4 hours ago

    What’s a good staring point for getting into this? I don’t even know what Cerebras is. I just use GitHub copilot in VS Code. Is this local models?

    • ericd 3 hours ago

      A lot of it is just from HN osmosis, but /r/LocalLLaMA/ is a good place to hear about the latest open weight models, if that's interesting.

      gpt-oss 120b is an open weight model that OpenAI released a while back, and Cerebras (a startup that is making massive wafer-scale chips that keep models in SRAM) is running that as one of the models they provide. They're a small scale contender against nvidia, but by keeping the model weights in SRAM, they get pretty crazy token throughput at low latency.

      In terms of making your own agent, this one's pretty good as a starting point, and you can ask the models to help you make tools for eg running ls on a subdirectory, or editing a file. Once you have those two, you can ask it to edit itself, and you're off to the races.

  • anonym29 5 hours ago

    Cerebras now has glm 4.6. Still obscenely fast, and now obscenely smart, too.

    • ericd 4 hours ago

      Ooh thanks for the heads up!

riskable 6 hours ago

It's interesting how much this makes you want to write Unix-style tools that do one thing and only one thing really well. Not just because it makes coding an agent simpler, but because it's much more secure!

  • tptacek 5 hours ago

    One thing that radicalized me was building an agent that tested network connectivity for our fleet. Early on, in like 2021, I deployed a little mini-fleet of off-network DNS probes on, like, Vultr to check on our DNS routing, and actually devising metrics for them and making the data that stuff generated legible/operationalizable was annoying and error prone. But you can give basic Unix network tools --- ping, dig, traceroute --- to an agent and ask it for a clean, usable signal, and they'll do a reasonable job! They know all the flags and are generally better at interpreting tool output than I am.

    I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now. But I do know that getting an agent across the 90% threshold of utility for a problem like this is much, much easier than building the good telemetry system is.

    • 0xbadcafebee an hour ago

      > I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now.

      And that's why I won't touch 'em. All the agents will be abandoned when people realize their inherent flaws (security, reliability, truthfulness, etc) are not worth the constant low-grade uncertainty.

      In a way it fits our times. Our leaders don't find truth to be a very useful notion. So we build systems that hallucinate and act unpredictably, and then invest all our money and infrastructure in them. Humans are weird.

    • foobarian 5 hours ago

      Honestly the top AI use case for me right now is personal throwaway dev tools. Where I used to write shell oneliners with dozen pipes including greps and seds and jq and other stuff, now I get an AI to write me a node script and throw in a nice Web UI to boot.

      Edit: reflecting on what the lesson is here, in either case I suppose we're avoiding the pain of dealing with Unix CLI tools :-D

      • jacquesm 5 hours ago

        Interesting. You have to wonder if all the tools that is based on would have been written in the first place if that kind of thing had been possible all along. Who needs 'grep' when you can write a prompt?

        • tptacek 5 hours ago

          My long running joke is that the actual good `jq` is just the LLM interface that generates `jq` queries; 'simonw actually went and built that.

        • agumonkey 3 hours ago

          It's highly plausible that all we assumed was good design / engineering will disappear if LLMs/Agents can produce more without having the be modular. (sadly)

          • jacquesm 2 hours ago

            There is some kind of parallel behind 'AI' and 'Fuzzy Logic'. Fuzzy logic to me always appeared like a large number of patches to get enough coverage for a system to work even if you didn't understand it. AI just increases the number of patches to billions.

            • agumonkey 11 minutes ago

              true, there's often a point where your system becomes a blurry miracle

    • chickensong 2 hours ago

      I hadn't given much thought to building agents, but the article and this comment are inspiring, thx. It's interesting to consider agents as a new kind of interface/function/broker within a system.

    • zahlman 5 hours ago

      > They know all the flags and are generally better at interpreting tool output than I am.

      In the toy example, you explicitly restrict the agent to supply just a `host`, and hard-code the rest of the command. Is the idea that you'd instead give a `description` something like "invoke the UNIX `ping` command", and a parameter described as constituting all the arguments to `ping`?

      • tptacek 5 hours ago

        Honestly, I didn't think very hard about how to make `ping` do something interesting here, and in serious code I'd give it all the `ping` options (and also run it in a Fly Machine or Sprite where I don't have to bother checking to make sure none of those options gives code exec). It's possible the post would have been better had I done that; it might have come up with an even better test.

        I was telling a friend online that they should bang out an agent today, and the example I gave her was `ps`; like, I think if you gave a local agent every `ps` flag, it could tell you super interesting things about usage on your machine pretty quickly.

        • mwcampbell 3 hours ago

          What is Sprite in this context?

        • zahlman 4 hours ago

          Also to be clear: are the schemas for the JSON data sent and parsed here specific to the model used? Or is there a standard? (Is that the P in MCP?)

  • chemotaxis 6 hours ago

    You could even imagine a world in which we create an entire suite of deterministic, limited-purpose tools and then expose it directly to humans!

    • SatvikBeri 4 hours ago

      Half my use of LLM tools is just to remember the options for command line tools, including ones I wrote but only use every few months.

    • layer8 5 hours ago

      I wonder if we could develop a language with well-defined semantics to interact with and wire up those tools.

      • chubot 5 hours ago

        > language with well-defined semantics

        That would certainly be nice! That's why we have been overhauling shell with https://oils.pub , because shell can't be described as that right now

        It's in extremely poor shape

        e.g. some things found from building several thousand packages with OSH recently (decades of accumulated shell scripts)

        - bugs caused by the differing behavior of 'echo hi | read x; echo x=$x' in shells, i.e. shopt -s lastpipe in bash.

        - 'set -' is an archaic shortcut for 'set +v +x'

        - Almquist shell is technically a separate dialact of shell -- namely it supports 'chdir /tmp' as well as cd /tmp. So bash and other shells can't run any Alpine builds.

        I used to maintain this page, but there are so many problems with shell that I haven't kept up ...

        https://github.com/oils-for-unix/oils/wiki/Shell-WTFs

        OSH is the most bash-compatible shell, and it's also now Almquist shell compatible: https://pages.oils.pub/spec-compat/2025-11-02/renamed-tmp/sp...

        It's more POSIX-compatible than the default /bin/sh on Debian, which is dash

        The bigger issue is not just bugs, but lack of understanding among people who write foundational shell programs. e.g. the lastpipe issue, using () as grouping instead of {}, etc.

        ---

        It is often treated like an "unknowable" language

        Any reasonable person would use LLMs to write shell/bash, and I think that is a problem. You should be able to know the language, and read shell programs that others have written

        • jacquesm 5 hours ago

          I love it how you went from 'Shell-WTFs' to 'let's fix this'. Kudos, most people get stuck at the first stage.

      • zahlman 5 hours ago

        As it happens, I have a prototype for this, but the syntax is honestly rather unwieldy. Maybe there's a way to make it more like natural human language....

        • imiric 5 hours ago

          I can't tell whether any comment in this thread is a parody or not.

          • AdieuToLogic an hour ago

            When in doubt, there's always the option of rewriting an existing interactive shell in Rust.

          • zahlman 5 hours ago

            (Mine was intended as ironic, suggesting that a circle of development ideas would eventually complete. I interpreted the previous comments as satirically pointing at the fact that the notion of "UNIX-like tools" owes to the fact that there is actually such a thing as UNIX.)

  • danpalmer 5 hours ago

    Doing one thing well means you need a lot more tools to achieve outcomes, and more tools means more context and potentially more understanding of how to string them together.

    I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.

hoppp 3 hours ago

I should? what problems can I solve, that can be only done with an agent? As long as every AI provider is operating at a loss starting a sustainably monetizable project doesn't feel that realistic.

  • simonw 2 hours ago

    > what problems can I solve, that can be only done with an agent?

    The problem that you might not intuitively understand how agents work and what they are and aren't capable of - at least not as well as you would understand it if you spent half an hour building one for yourself.

  • johnfn 3 hours ago

    The post is just about playing around with the tech for fun. Why does monetization come into it? It feels like saying you don't want to use Python because Astral, the company that makes uv, is operating at a loss. What?

    • hoppp 3 hours ago

      Agents use Apis that I will need to pay for and generally software dev is a job for me that needs to generate income.

      If the Apis I call are not profitable for the provider then they won't be for me either.

      This post is a fly.io advertisement

      • simonw 2 hours ago

        "Agents use Apis that I will need to pay for"

        Not if you run them against local models, which are free to download and free to run. The Qwen 3 4B models only need a couple of GBs of available RAM and will run happily on CPU as opposed to GPU. Cost isn't a reason not to explore this stuff.

        • awayto an hour ago

          Google has what I would call a generous free tier, even including Gemini 2.5 Pro (https://ai.google.dev/gemini-api/docs/rate-limits). Just get an API key from AiStudio. Also very easy to just make a switch in your agent so that if you hit up against a rate limit for one model, re-request the query with the next model. With Pro/Flash/Flash-Lite and their previews, you've got 2500+ free requests per day.

      • sprobertson 2 hours ago

        > software dev is a job for me that needs to generate income

        sir, this is a hackernews

        • lojack 28 minutes ago

          > This post is a <insert-startup-here> advertisement

          same thing you said but in a different context... sir, this is a hackernews

      • tptacek 38 minutes ago

        No, we are not an LLM provider.

      • vel0city 2 hours ago

        Practically everything is something you will need to pay for in the end. You probably spent money on an internet connection, electricity, and computing equipment to write this comment. Are you intending to make a profit from commenting here?

        You don't need to run something like this against a paid API provider. You could easily rework this to run against a local agent hosted on hardware you own. A number of not-stupid-expensive consumer GPUs can run some smaller models locally at home for not a lot of money. You can even play videogames with those cards after.

        Get this: sometimes people write code and tinker with things for fun. Crazy, I know.

        • hoppp an hour ago

          The submission is an advertisement for fly.io and OpenAI , both are paid services. We are commenting on an ad. The person who wrote it did it for money. Fly.io operates for money, OpenAi charges for their API.

          They posted it here expecting to find customers. This is a sales pitch.

          At this point why is it an issue to expect a developer to make money on it?

          As a dev, If the chain of monetization ends with me then there is no mainstream adoption whatsoever on the horizon.

          I love to tinker but I do it for free not using paid services.

          As for tinkering with agents, its a solution looking for a problem.

          • johnfn an hour ago

            Why are you repeatedly stating that the post is an ad as if it is some sort of dunk? Companies have blogs. Tech blogs often produce useful content. It is possible that an ad can both successfully promote the company and be useful to engineers. I find the Fly blog to be particularly well-written and thoughtful; it's taught me a good deal about Wireguard, for instance.

            • hoppp an hour ago

              And that sounds fine, but Wireguard is not an overhyped industry promising huge gains in the future to investors and to developers jumping on a bandwagon who can find problems for this solution.

              I actually have built agents already in the past and this is my opinion. If you read the article the author says they want to hear the reasoning for disliking it, so this is mine, the only way to create a business is raising money and hoping somebody strikes gold with the shovel Im paying for.

              • tptacek 36 minutes ago

                They're mentioning WireGuard because we do in fact do WireGuard, unlike LLM agents, which we do not offer as a service.

              • simonw an hour ago

                How would you feel about this post if the exact same content was posted on a developer's personal blog instead?

                I ask because it's rare for a post on a corporate blog to also make sense outside of the context of that company, but this one does.

          • tptacek 37 minutes ago

            You keep saying this, but there is nothing in this post about our service. I didn't use Fly.io at all to write this post. Across the thread, someone had to remind me that I could have.

    • balder1991 2 hours ago

      Yeah we have open source models too that we can use, and it’s actually more fun than using cloud providers in my opinion.

  • furyofantares 3 hours ago

    > As long as every AI provider is operating at a loss

    None of them are doing that.

    They need funding because the next model has always been much more expensive to train than the profits of the previous model. And many do offer a lot of free usage which is of course operated at a loss. But I don't think any are operating inference at a loss, I think their margins are actually rather large.

    • roadside_picnic 3 hours ago

      Parent comment never said operating inference at a loss, though it wouldn't surprise me, they just said "operating at a loss" which they most definitely are [0].

      However, knowing a few people on teams at inference-only providers, I can promise you some of them absolutely are operating inference at a loss.

      0. https://www.theregister.com/2025/10/29/microsoft_earnings_q1...

      • furyofantares 3 hours ago

        > Parent comment never said operating inference at a loss

        Context. Whether inference is profitable at current prices is what informs how risky it is to build a product that depends on buying inference, which is what the post was about.

        • roadside_picnic 2 hours ago

          So you're assuming there's a world where these companies exist solely by providing inference?

          The first obvious limitation of this would be that all models would be frozen in time. These companies are operating at an insane loss and a major part of that loss is required to continue existing. It's not realistic to imagine that there is an "inference" only future for these large AI companies.

          And again, there are many inference only startups right now, and I know plenty of them are burning cash providing inference. I've done a lot of work fairly close to the inference layer and getting model serving happening with the requirements for regular business use is fairly tricky business and not as cheap as you seem to think.

          • vel0city 2 hours ago

            The models may be somewhat frozen in time but with the right tools available to it they don't need all information innately coded into it. If they're able to query for reliable information to drag in they can talk about things that are well outside their original training data.

            • roadside_picnic an hour ago

              For a few months of news this works, but over the span of years even the statistical nature of language drifts a bit. Have you shipped natural language models to production? Even simple classifiers need to be updated periodically because of drift. There is no world where you lead the industry serving LLMs and don't train them as well.

          • furyofantares 2 hours ago

            > So you're assuming there's a world where these companies exist solely by providing inference?

            Yes, obviously? There is no world where the models and hardware just vanish.

            • roadside_picnic an hour ago

              > and hardware just vanish.

              Okay, this tells me you really don't understand model serving or any of the details of infrastructure. The hardware is incredibly ephemeral. Your home GPU might last a few years (and I'm starting to doubt that you've even trained a model at home), but these GPUs have incredibly short lifespans under load for production use.

              Even if you're not working on the back end of these models, you should be well aware that one of the biggest concerns about all this investment is how limited the lifetime of GPUs is. It's not just about being "outdated" by superior technology, GPUs are relatively fragile hardware and don't last too long under constant load.

              As far as models go, I have a hard time imagining a world in 2030 where the model replies "sorry, my cutoff date was 2026" and people have no problem with this.

              Also, you still didn't address my point that startups doing inference only model serving are burning cash. Production inference is not the same as running inference locally where you can wait a few minutes for the result. I'm starting to wonder if you've ever even deployed a model of any size to production.

            • HDThoreaun an hour ago

              If the game is inference the winners are the cloud mega scalers, not the ai labs.

              • furyofantares 18 minutes ago

                This thread isn't about who wins, it's about the implication that it's too risky to build anything that depends on inference because AI companies are operating at a loss.

    • GoatInGrey 3 hours ago

      So AI companies are profitable when you ignore some of the things they have to spend money on to operate?

      Snark aside, inference is still being done at a loss. Anthropic, the most profitable AI vendor, is operating at a roughly -140% margin. xAI is the worst at somewhere around -3,600% margin.

      • simonw 2 hours ago

        The interesting companies to look at here are the ones that sell inference against open weight models that were trained by other companies - Fireworks, Cloudflare, DeepInfra, Together AI etc.

        They need to cover their serving costs but are not spending money on training models. Are they profitable? Probably not yet, because they're investing a lot of cash in competing with each other to R&D more efficient ways of serving etc, but they're a lot closer to profitability than the labs that are spending millions of dollars on training runs.

      • fluidcruft 3 hours ago

        If they are not operating inference at a loss and current models remain useful (why would they regress?), they could just stop developing the next model.

        • balder1991 2 hours ago

          They could, but that’s a recipe for going out of business in the current environment.

          • fluidcruft 2 hours ago

            Yes, but at the same time it's unlikely for existing models to disappear. You won't get the next model, but there is no choice but to keep inference running to pay off creditors.

      • kalkin 3 hours ago

        Where do those numbers come from?

    • necovek an hour ago

      Sounds quite a bit like pyramid scheme "business model": how is it different?

      If a company stops training new models until they can fund it out of previous profits, do we only slow down or halt altogether? If they all do?

    • hoppp 3 hours ago

      When comparing the cost of an H100 GPU per hour and calculating cost of tokens, it seems the OpenAI offering for the latest model is 5 times cheaper than renting the hardware.

      OpenAI balance sheet also shows an $11 billion loss .

      I can't see any profit on anything they create. The product is good but it relies on investors fueling the AI bubble.

      • simonw 2 hours ago

        > When comparing the cost of an H100 GPU per hour and calculating cost of tokens, it seems the OpenAI offering for the latest model is 5 times cheaper than renting the hardware.

        How did you come to that conclusion? That would be a very notable result if it did turn out OpenAI were selling tokens for 5x the cost it took to serve them.

        • necovek an hour ago

          I am reading it as OpenAI selling them for 20% of the cost to serve them (serving at the equivalent token/s with cloud pay-per-use GPUs).

          • simonw 15 minutes ago

            You're right, I misunderstood.

        • khimaros an hour ago

          it seems to me they are saying the opposite

    • throwaway8xak92 3 hours ago

      > None of them are doing that.

      Can you point us to the data?

  • throwaway6977 3 hours ago

    You can be your own AI provider.

    • hoppp 3 hours ago

      For internal software maybe, but for a client facing service the incentives are not right when the norm is to operate at a loss.

    • bilbo0s 3 hours ago

      >starting a sustainably monetizable project doesn't feel that realistic.

      and

      >You can be your own AI provider.

      Not sure that being your own AI provider is "sustainably monetizable"?

  • DANmode an hour ago

    Lots of emotion to unpack here.

    Older developer?

    • yeasku an hour ago

      What is the point of your comment?

      Lonely developer?

  • aidenn0 2 hours ago

    Show me where TFA even implied that you should start a sustainably monetizable project with agents?

  • paulcole 3 hours ago

    I love how programmers generally tout themselves as these tinkerers who love learning about and exploring technology… until it comes to AI and then it’s like “show me the profitable use case.” Just say you don’t like AI!

    • hoppp 2 hours ago

      Yeah but fly.io is a cloud provider doing this advertisement with OpenAI Apis. Both cost money, so if it's not free to operate then the developed project should offset the costs.

      Its about balance.

      Really its the AI providers that have been promising unreal gains during this hype period, so people are more profit oriented.

      • tptacek 34 minutes ago

        What does "cloud provider" even have to do with this post?

    • seba_dos1 3 hours ago

      It doesn't have to be profitable. Elegant and clever would suffice.

    • ilikehurdles 2 hours ago

      I don't think hn is reflective of where programmers are today, culturally. 10 years ago, sure, it probably was.

      • khimaros an hour ago

        what place is more reflective today?

oooyay 6 hours ago

Heh, the bit about context engineering is palpable.

I'm writing a personal assistant which, imo, is distinct from an agent in that it has a lot of capabilities a regular agent wouldn't necessarily need such as memory, task tracking, broad solutioning capabilities, etc... I ended up writing agents that talk to other agents which have MCP prompts, resources, and tools to guide them as general problem solvers. The first agent that it hits is a supervisor that specializes in task management and as a result writes a custom context and tool selection for the react agent it tasks.

All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/

  • qwertox 5 hours ago

    This sounds really great.

wayy 4 hours ago

everybody loves building agents, nobody likes debugging them. agents hit the classic llm app lifecycle problem: at first it feels magical. it nails the first few tasks, doing things you didn’t even think were possible. you get excited, start pushing it further. you run it and then it fails on step 17, then 41, then step 9.

now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong

  • furyofantares 3 hours ago

    That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.

    • AdieuToLogic an hour ago

      In the event this comment is slathered in sarcasm:

        Well done!  :-D
    • ht96 3 hours ago

      Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

      • AdieuToLogic an hour ago

        There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.

jbmsf an hour ago

I agree. I find LLMs a bit overblown. I don't think most people want to use chat as their primary interface. But writing a few agents was incredibly informative.

throwaway8xak92 3 hours ago

I lost all respect for fly.io last time they published an article swearing about people are insane to not believe in vibe coding.

Looks like they keep up with the swearing in the company’s blog. Just not my thing I guess.

  • simonw 2 hours ago

    I don't think "insane to not believe in vibe coding" is a fair summary of https://fly.io/blog/youre-all-nuts/ - that post wasn't about vibe coding (at least by its I-think-correct definition of prompt-driven coding where you don't pay any attention to the code that's being written), it was about AI-assisted engineering by professional software developers.

    It did have some swear words in - as did many of the previous posts on the Fly.io corporate blog.

    • AceJohnny2 2 hours ago

      Worth highlighting that both OP article and the one Simon linked are by @tptacek, who is also one of the top commenters here on HN.

      His fly.io posts are very much in his style. I figure they let him post there, without corp-washing, because any publicity is good publicity.

chrisweekly 6 hours ago

There's something(s) about @tptacek's writing style that has always made me want to root for fly.io.

sibeliuss an hour ago

Its easy to create a toy, but much harder to make something right! Like anything, so much weird polish stuff creeps in at the 90% mark.

rbren 4 hours ago

Spoiler: it's not actually that easy. Compaction, security, sandboxing, planning, custom tools--all this is really hard to get right.

We're about to launch an SDK that gives devs all these building blocks, specifically oriented around software agents. Would love feedback if anyone wants to look: https://github.com/OpenHands/software-agent-sdk

  • olingern an hour ago

    Only on HN is there a “well, actually” with little substance followed by a comment about a launch.

    The article isn’t about writing production ready agents, so it does appear to be that easy

  • solarkraft 3 hours ago

    How autonomous/controllable are the agents with this SDK?

    When I build an agent my standard is Cursor, which updates the UI at every reportable step of the way, and gives you a ton of control opportunities, which I find creates a lot of confidence.

    Is this level of detail and control possible with the OpenHands SDK? I’m asking because the last SDK that was simple to get into lacked that kind of control.

vinhnx 3 hours ago

> “You only think you understand how a bicycle works, until you learn to ride one.”

This resonates deeply with me. That's why I built one myself [0], I really really love to truly understand how coding agents work. The learning has been immense for me, I now have working knowledge of ANSI escape codes, grapheme clusters, terminal emulators, Unicode normalization, VT protocols, PTY sessions, and filesystem operations - all the low-level details I would have never think about until I were implementing them.

[0] https://github.com/vinhnx/vtcode

  • dfex 2 hours ago

    >> “You only think you understand how a bicycle works, until you learn to ride one.”

    > This resonates deeply with me. That's why I built one myself [0]

    I was hoping to see a home-made bike at that link.. Came away disappointed

    • vinhnx an hour ago

      Good one! Sorry to disappoint you. But personally, that line strike deeply with me, honestly.

  • lowbloodsugar an hour ago

    It's conflating two issues though. Most people who can ride a bike can't explain the physics. They really don't know how it works. The bicycle lesson is about training the brain on a new task that cannot be taught in any other way.

    This case is more like a journeyman blacksmith who has to make his own tools before he can continue. In doing so, he gets tools of his own, but the real reward was learning what is required to handle the metal such that it makes a strong hammer. And like the blacksmith, you learn more if you use an existing agent to write your agent.

    • vinhnx an hour ago

      Agree, to me, the wheel is the greatest invention of all. Everyone could have rode a bike, but the underlying physic and motion that came to `riding` is a whole another story.

threecheese 5 hours ago

Does anyone have an understanding - or intuition - of what the agentic loop looks like in the popular coding agents? Is it purely a “while 1: call_llm(system, assistant)”, or is there complex orchestration?

I’m trying to understand if the value for Claude Code (for example) is purely in Sonnet/Haiku + the tool system prompt, or if there’s more secret sauce - beyond the “sugar” of instruction file inclusion via commands, tools, skills etc.

  • mrkurt 4 hours ago

    Claude Code is an obfuscated javascript app. You can point Claude Code at it's own package and it will pretty reliably tell you how it works.

    I think Claude Code's magic is that Anthropic is happy to burn tokens. The loop itself is not all that interesting.

    What is interesting is how they manage the context window over a long chat. And I think a fair amount of that is serverside.

    • AdieuToLogic 41 minutes ago

      > Claude Code is an obfuscated javascript app. You can point Claude Code at it's own package and it will pretty reliably tell you how it works.

      This is why I keep coming back to Hacker News. If the above is not a quintessential "hack", then I've never seen one.

      Bravo!

  • jeremy_k 4 hours ago

    https://github.com/sst/opencode opencode is open source. Here's a session I started but haven't had time to get back to which is using opencode to ask it about how the loop works https://opencode.ai/s/4P4ancv4

    The summary is

    The beauty is in the simplicity: 1. One loop - while (true) 2. One step at a time - stopWhen: stepCountIs(1) 3. One decision - "Did LLM make tool calls? → continue : exit" 4. Message history accumulates tool results automatically 5. LLM sees everything from previous iterations This creates emergent behavior where the LLM can: - Try something - See if it worked - Try again if it failed - Keep iterating until success - All without explicit retry logic!

  • CraftThatBlock 5 hours ago

    Generally, that's pretty much it. More advanced tools like Claude Code will also have context compaction (which sometimes isn't very good), or possibly RAG on code (unsure about this, I haven't used any tools that did this). Context compaction, to my understanding, is just passing all the previous context into a call which summarizes it, then that becomes to new context starting point.

8note 4 hours ago

> A subtler thing to notice: we just had a multi-turn conversation with an LLM. To do that, we remembered everything we said, and everything the LLM said back, and played it back with every LLM call. The LLM itself is a stateless black box. The conversation we’re having is an illusion we cast, on ourselves.

the illusion was broken for me by Cline context overflows/summaries, but i think its very easy to miss if you never push the LLM hard or build you own agent. I really like this wording, amd the simple description is missing from how science communicators tend to talk about agents and LLMs imo

8cvor6j844qw_d6 3 hours ago

Question, how hard is it for someone new to agents to dip their toes into writing a simple agent to get data? (e.g., getting reviews from sites for sentiment analysis?)

Forgive if I get someting wrong: From what I see, it seems fundamentally it is a LLM being ran each loop with information about tools provided to it. On each loop the LLM evaluates inputs/context (from tool calls, inputs, etc.) and decided which tool to call / text output.

  • simonw 2 hours ago

    You can prototype this without writing any code at all.

    Fire up "claude --dangerously-skip-permissions" in a fresh directory (ideally in a Docker container if you want to limit the chance of it breaking anything else) and prompt this:

    > Use Playwright to fetch ten reviews from http://www.example.com/ then run sentiment analysis on them and write the results out as JSON files. Install any missing dependencies.

    Watch what it does. Be careful not to let it spider the site in a way that would justifiably upset the site owners.

behnamoh 6 hours ago

> nobody knows anything yet

that sums up my experience in AI over the past three years. so many projects reinvent the same thing, so much spaghetti thrown at the wall to see what sticks, so much excitement followed by disappointment when a new model drops, so many people grifting, and so many hacks and workarounds like RAG with no evidence of them actually working other than "trust me bro" and trial and error.

  • w_for_wumbo 5 hours ago

    I think we'd get better results if we thought of it as a conscious agent. If we recognized that it was going to mirror back or unconscious biases and try to complete the task as we define it, instead of how we think it should behave. Then we'd at least get our own ignorance out of the way when writing prompts.

    Being able to recognize that 'make this code better' provides no direction, it should make sense that the output is directionless.

    But on more subtle levels, whatever subtle goals that we have and hold in the workplace will be reflected back by the agents.

    If you're trying to optimise costs, and increase profits as your north star. Having layoffs and unsustainable practices is a logical result, when you haven't balanced this with any incentives to abide by human values.

byronic an hour ago

The author shoulda written a REPL

solomonb 5 hours ago

This work predates agents as we know them now and was intended for building chat bots (as in irc chat bots) but when auto-gpt I realized I could formalize it super nicely with this library:

https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/

I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..

nowittyusername 5 hours ago

I agree with the sentiment but I also recommend you build a local only agent. Something that runs on llama.cpp or vllm, whatever... This way you can better grasp the more fundamental nature of what LLM's really are and how they work under the hood. That experience will also make you realize how much control you are giving up when using cloud based api providers like OpenAI and why so mane engineers feel that LLM's are a "black box". Well duh buddy you been working with apis this whole time, of course you wont understand much working just with that.

  • 8note 5 hours ago

    ive been trying this for a few week, but i dont at all currently own hardware good enough to be useful for local inference.

    ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens

    • nowittyusername 4 hours ago

      My man, we now have llms that are anywhere between 130 million to 1 trillion parameters available for us to run locally, I can guarantee there is a model for you there that even your toaster can run. I have a RTX 4090 but for most of my fiddling i use small models like Qwen 3 4b and they work amazing so there's no excuse :P.

      • 8note 4 hours ago

        well, i got some gemini models running on my phone, but if i switch apps, android kills it, so the call to the server always hangs... and then the screen goes black

        the new laptop only has 16GB of memory total, with another 7 dedicated to the NPU.

        i tried pulling up Qwen 3 4B on it, but the max context i can get loaded is about 12k before the laptop crashes.

        my next attempt is gonna be a 0.5B one, but i think ill still end up having to compress the context every call, which is my real challenge

        • nowittyusername 3 hours ago

          I recommend use low quantized models first. for example anywhere between q4 and q8 gguf models. Also dont need high context to fiddle around and learn the ins and outs. for example 4k context is more then enough to figure out what you need in agentic solutions. In fact thats a good limit to impose on yourself and start developing decent automatic context management systems internally as that will be very important when making robus agentic solutions. with all that you should be able to load an llm no issues on many devices.

qwertox 6 hours ago

I've found it much more useful to create an MCP server, and this is where Claude really shines. You would just say to Claude on web, mobile or CLI that it should "describe our connectivity to google" either via one of the three interfaces, or via `claude -p "describe our connectivity to google"`, and it will just use your tool without you needing to do anything special. It's like custom-added intelligence to Claude.

  • tptacek 5 hours ago

    You can do this. Claude Code can do everything the toy agent this post shows, and much more. But you shouldn't, because doing that (1) doesn't teach you as much as the toy agent does, (2) isn't saving you that much time, and (3) locks you into Claude Code's context structure, which is just one of a zillion different structures you can use. That's what the post is about, not automating ping.

  • mattmanser 5 hours ago

    Honest question, as your comment confuses me.

    Did you get to the part where he said MCP is pointless and are saying he's wrong?

    Or did you just read the start of the article and not get to that bit?

    • vidarh 5 hours ago

      I'd second the article on this, but also add to it that the biggest reason MCP servers don't really matter much any more is that the models are so capable of working with APIs, that most of the time you can just point them at an API and give them a spec instead. And the times that doesn't work, just give them a CLI tool with a good --help option.

      Now you have a CLI tool you can use yourself, and the agent has a tool to use.

      Anthropic itself have made MCP server increasingly pointless: With agents + skills you have a more composeable model that can use the model capabilities to do all an MCP server can with or without CLI tools to augment them.

      • simplesagar 5 hours ago

        I feel the CLI vs MCP debate is an apples to oranges framing. When you're using claude you can watch it using CLI's, running brew, mise, lots of jq but what about when you've built an agent that needs to work through a complicated API? You don't want to make 5 CRUD calls to get the right answer. A curated MCP tool ensures it can determinism where it matters most.. when interacting with customer data

zahlman 5 hours ago

> Imagine what it’ll do if you give it bash. You could find out in less than 10 minutes. Spoiler: you’d be surprisingly close to having a working coding agent.

Okay, but what if I'd prefer not to have to trust a remote service not to send me

    { "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }

?
  • tptacek 5 hours ago

    Obviously if you're concerned about that, which is very reasonable, don't run it in an environment where `rm -rf` can cause you a real problem.

    • awayto 5 hours ago

      Also if you're doing function calls you can just have the command as one response param, and arguments array as another response param. Then just black/white list commands you either don't want to run or which should require a human to say ok.

      • aidenn0 2 hours ago

        blacklist is going to be a bad idea since so many commands can be made to run other commands with their arguments.

        • awayto an hour ago

          Yeah I agree. Ultimately I would suggest not having any kind of function call which returns an arbitrary command.

          Instead, think of it as if you were enabling capabilities for AppArmor, by making a function call definition for just 1 command. Then over time suss out what commands you need your agent do to and nothing more.

  • worldsayshi 5 hours ago

    There are MCP configured virtualization solutions that is supposed to be safe for letting LLM go wild. Like this one:

    https://github.com/zerocore-ai/microsandbox

    I haven't tried it.

    • awayto 5 hours ago

      You can build your agent into a docker image then easily limit both networking and file system scope.

          docker run -it --rm \
            -e SOME_API_KEY="$(SOME_API_KEY)" \
            -v "$(shell pwd):/app" \ <-- restrict file system to whatever folder
            --dns=127.0.0.1 \ <-- restrict network calls to localhost
            $(shell dig +short llm.provider.com 2>/dev/null | awk '{printf " --add-host=llm-provider.com:%s", $$0}') \ <-- allow outside networking to whatever api your agent calls
            my-agent-image
      
      Probably could be a bit cleaner, but it worked for me.
robot-wrangler 5 hours ago

> Another thing to notice: we didn’t need MCP at all. That’s because MCP isn’t a fundamental enabling technology. The amount of coverage it gets is frustrating. It’s barely a technology at all. MCP is just a plugin interface for Claude Code and Cursor, a way of getting your own tools into code you don’t control. Write your own agent. Be a programmer. Deal in APIs, not plugins.

Hold up. These are all the right concerns but with the wrong conclusion.

You don't need MCP if you're making one agent, in one language, in one framework. But the open coding and research assistants that we really want will be composed of several. MCP is the only thing out there that's moving in a good direction in terms of enabling us to "just be programmers" and "use APIs", and maybe even test things in fairly isolated and reproducible contexts. Compare this to skills.md, which is actually defacto proprietary as of now, does not compose, has opaque run-times and dispatch, is pushing us towards certain models, languages and certain SDKs, etc.

MCP isn't a plugin interface for Claude, it's just JSON-RPC.

  • tptacek 5 hours ago

    I think my thing about MCP, besides the outsized press coverage it gets, is the implicit presumption it smuggles in that agents will be built around the context architecture of Claude Code --- that is to say, a single context window (maybe with sub-agents) with a single set of tools. That straitjacket is really most of the subtext of this post.

    I get that you can use MCP with any agent architecture. I debated whether I wanted to hedge and point out that, even if you build your own agent, you might want to do an MCP tool-call feature just so you can use tool definitions other people have built (though: if you build your own, you'd probably be better off just implementing Claude Code's "skill" pattern).

    But I decided to keep the thrust of that section clearer. My argument is: MCP is a sideshow.

    • robot-wrangler 5 hours ago

      I still don't really get it, but would like to hear more. Just to get it out of the way, there's obvious bad aspects. Re: press coverage, everything in AI is bound to be frustrating this way. The MCP ecosystem is currently still a lot of garbage. It feels like a very shitty app-store, lots of abandonware, things that are shipped without testing, the usual band-wagoning. For example instead of a single obvious RAG tool there's 200 different specific tools for ${language} docs

      The core MCP tech though is not only directionally correct, but even the implementation seems to have made lots of good and forward-looking choices, even if those are still under-utilized. For example besides tools, it allows for sharing prompts/resources between agents. In time, I'm also expecting the idea of "many agents, one generic model in the background" is going to die off. For both costs and performance, agents will use special-purpose models but they still need a place and a way to collaborate. If some agents coordinate other agents, how do they talk? AFAIK without MCP the answer for this would be.. do all your work in the same framework and language, or to give all agents access to the same database or the same filesystem, reinventing ad-hoc protocols and comms for every system.

    • 8note 4 hours ago

      i treat MCP as a shorthand for "schema + documentation, passed to the LLM as context"

      you dont need the MCP implementation, but the idea is useful and you can consider the tradeoffs to your context window, vs passing in the manual as fine tuning or something.

zb3 an hour ago

No, because I know that "agents" are token burning machines - for me they're less efficient than the chat interface, slower and burning much more tokens.

I'm not surprised that AI companies would want me to use them though.. I know what you're doing there :)

dagss 5 hours ago

I realize now what I need in Cursor: A button for "fork context".

I believe that would be a powerful tool solving many things there are now separate techniques for.

  • all2 5 hours ago

    crush-cli has this. I think the google gemini chat app also has this now.

andai 3 hours ago

.text-gray-600 { color: black; }

rambojohnson 2 hours ago

The bravado posturing in this article is nauseating. Sure, there are a few serious points buried in there, but damn...dial it down, please.

ATechGuy 6 hours ago

Maybe we should write an agent that writes an agent that writes an agent...

_pdp_ 5 hours ago

It is also very simple to be a programmer.. see,

print "Hello world!"

so easy...

  • dan_can_code 5 hours ago

    But that didn't use the H100 I just bought to put me out of my own job!

wahnfrieden 2 hours ago

The Codex agent has an official TypeScript SDK now.

Why would Fly.io advocate using the vanilla GPT API to write an agent, instead of the official agent?

  • tptacek 11 minutes ago

    Because you won't learn as much using an agent framework, and, as you can see from the post, you absolutely don't need one.

a-dub 3 hours ago

they kinda feel like the cgi perl scripts of the mid 2020s.

vkou 5 hours ago

> It’s Incredibly Easy

    client = OpenAI()
    context_good, context_bad = [{
        "role": "system", "content": "you're Alph and you only tell the truth"
    }], [{
        "role": "system", "content": "you're Ralph and you only tell lies"
    }]
    ...

And this will work great until next week's update when Ralph responses will consist of "I'm sorry, it would be unethical for me to respond with lies, unless you pay for the Premium-Super-Deluxe subscription, only available to state actors and firms with a six-figure contract."

You're building on quicksand.

You're delegating everything important to someone who has no responsibility to you.

  • tptacek 10 minutes ago

    I love that the thing you singled out as not safe to run long term, because (apparently) of woke, was my weird deep-cut Labyrinth joke.

manishsharan 6 hours ago

How.. please don't say use langxxx library

I am looking for a language or library agnostic pattern like we have MVC etc. for web applications. Or Gang of Four patterns but for building agents.

  • tptacek 6 hours ago

    The whole post is about not using frameworks; all you need is the LLM API. You could do it with plain HTTP without much trouble.

    • manishsharan 6 hours ago

      When I ask for Patterns, I am seeking help for recurring problems that I have encountered. Context management .. small llms ( ones with small context size) break and get confused and forget work they have done or the original goal.

      • zahlman 5 hours ago

        Start by thinking about how big the context window is, and what the rules should be for purging old context.

        Design patterns can't help you here. The hard part is figuring out what to do; the "how" is trivial.

      • skeledrew 5 hours ago

        That's why you want to use sub-agents which handle smaller tasks and return results to a delegating agent. So all agents have their own very specialized context window.

        • tptacek 5 hours ago

          That's one legit answer. But if you're not stuck in Claude's context model, you can do other things. One extremely stupid simple thing you can do, which is very handy when you're doing large-scale data processing (like log analysis): just don't save the bulky tool responses in your context window once the LLM has generated a real response to them.

          My own dumb TUI agent, I gave a built in `lobotomize` tool, which dumps a text list of everything in the context window (short summary text plus token count), and then lets it Eternal Sunshine of the Spotless Agent things out of the window. It works! The models know how to drive that tool. It'll do a series of giant ass log queries, filling up the context window, and then you can watch as it zaps things out of the window to make space for more queries.

          This is like 20 lines of code.

          • adiasg 3 hours ago

            Did something similar - added `summarize` and `restore` tools to maximize/minimize messages. Haven't gotten it to behave like I want. Hoping that some fiddling with the prompt will do it.

            • lbotos 3 hours ago

              FYI -- I vouched for you to undead this comment. It felt like a fine comment? I don't think you are shadowbanned but consider emailing the mods if you think you might me.

  • oooyay 6 hours ago

    I'm not going to link my blog again but I have a reply on this post where I link to my blog post where I talk about how I built mine. Most agents fit nicely into a finite state machine or a directed acyclic graph that responds to an event loop. I do use provider SDKs to interact with models but mostly because it saves me a lot of boilerplate. MCP clients and servers are also widely available as SDKs. The biggest thing to remember, imo, is to keep the relationship between prompts, resources, and tools in mind. They make up a sort of dynamic workflow engine.

imiric 4 hours ago

> Give each call different tools. Make sub-agents talk to each other, summarize each other, collate and aggregate. Build tree structures out of them. Feed them back through the LLM to summarize them as a form of on-the-fly compression, whatever you like.

You propose increasing the complexity of interactions of these tools, and giving them access to external tools that have real-world impact? As a security researcher, I'm not sure how you can suggest that with a straight face, unless your goal is to have more vulnerable systems.

Most people can't manage to build robust and secure software using SOTA hosted "agents". Building their own may be a fun learning experience, but relying on a Rube Goldberg assembly of disparate "agents" communicating with each other and external tools is a recipe for disaster. Any token could trigger a cascade of hallucinations, wild tangents, ignored prompts, poisoned contexts, and similar issues that have plagued this tech since the beginning. Except that now you've wired them up to external tools, so maybe the system chooses to wipe your home directory for whatever reason.

People nonchalantly trusting nondeterministic tech with increasingly more real-world tasks should concern everyone. Today it's executing `ping` and `rm`; tomorrow it's managing nuclear launch systems.

teiferer 6 hours ago

Write an agent, it's easy! You will learn so much!

... let's see ...

client = OpenAI()

Um right. That's like saying you should implement a web server, you will learn so much, and then you go and import http (in golang). Yeah well, sure, but that brings you like 98% of the way there, doesn't it? What am I missing?

  • tptacek 6 hours ago

    That OpenAI() is a wrapper around a POST to a single HTTP endpoint:

        POST https://api.openai.com/v1/responses
  • MeetingsBrowser 6 hours ago

    I think you might be conflating an agent with an LLM.

    The term "agent" isn't really defined, but its generally a wrapper around an LLM designed to do some task better than the LLM would on its own.

    Think Claude vs Claude Code. The latter wraps the former, but with extra prompts and tooling specific to software engineering.

  • bootwoot 6 hours ago

    That's not an agent, it's an LLM. An agent is an LLM that takes real-world actions

  • Bjartr 5 hours ago

    No, it's saying "let's build a web service" and starting with a framework that just lets you write your endpoints. This is about something higher level than the nuts and bolts. Both are worth learning.

    The fact you find this trivial is kind of the point that's being made. Some people think having an agent is some kind of voodoo, but it's really not.

  • munchbunny 6 hours ago

    An agent is more like a web service in your metaphor. Yes, building a web server is instructive, but almost nobody has a reason to do it instead of using an out of the box implementation once it’s time to build a production web service.

  • victorbjorklund 6 hours ago

    maybe more like “let’s write a web server but let’s use a library for the low level networking stack”. That can still teach you a lot.

zkmon 5 hours ago

A very good blog article that I have read in a while. Maybe MCP could have been involved as well?