There is nothing wrong with the HTTP layer, it's just a way to get a string into the model.
The problem is the industry obsession on concatenating messages into a conversation stream. There is no reason to do it this way. Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs. (A drawback is caching won't help much if most of the context window is composed dynamically)
Coding CLIs as well as web chat works well because the agent can pull in information into the session at will (read a file, web search). The pain point is that if you're appending messages a stream, you're just slowly filling up the context.
The fix is to keep the message stream concept for informal communication with the prompter, but have an external, persistent message system that the agent can interact with (a bit like email). The agent can decide which messages they want to pull into the context, and which ones are no longer relevant.
The key is to give the agent not just the ability to pull things into context, but also remove from it. That gives you the eternal context needed for permanent, daemonized agents.
I guess the short-cut is to include all the chat conversation history, and then if the history contains "do X" followed by "no actually do Y instead", then the LLM can figure that out. But isn't it fairly tricky for the agent harness to figure that out, to work out relevancy, and to work out what context to keep? Perhaps this is why the industry defaults to concatenating messages into a conversation stream?
Let's say that you have two agents running concurrently: A & B. Agent A decides to push a message into the context of agent B. It does that and the message ends up somewhere in the list of the message right at the bottom of the conversation.
The question is, will agent B register that a new message was inserted and will it act on it?
If you do this experiment you will find out that this architecture does not work very well. New messages that are recent but not the latest have little effect for interactive session. In other words, Agent A will not respond and say, "and btw, this and that happened" unless perhaps instructed very rigidly or perhaps if there is some other instrumentation in place.
Your mileage may vary depending on the model.
A better architecture is pull-based. In other words, the agent has tools to query any pending messages. That way whatever needs to be communicated is immediately visible as those are right at the bottom of the context so agents can pay attention to them.
An agent in that case slightly more rigid in a sense that the loop needs to orchestrate and surface information and there is certainly not one-size-fits-all solution here.
I hope this helps. We've learned this the hard way.
> All of these features are about breaking the coupling between a human sitting at a terminal or chat window and interacting turn-by-turn with the agent.
This means:
- less and less "man-in-the-loop"
- less and less interaction between LLMs and humans
- more and more automation
- more and more decision-making autonomy for agents
- more and more risk (i.e., LLMs' responsibility)
- less and less human responsibility
Problem:
Tasks that require continuous iteration and shared decision-making with humans have two possible options:
Getting to that point is likely going to involve a lot of (the business and personal equivalent of) Teslas electing to drive through white semitrailers.
Struggling with this at the moment too - the second you have a task that is a blend of CI style pipeline, LLM processing and openclaw handing that data back and forth, maintaining state and triggering next step gets tricky. They're essentially different paradigms of processing data and where they meet there are impedance mismatches.
Even if I can string it together it's pretty fragile.
That said I don't really want to solve this with a SaaS. Trying really hard to keep external reliance to a minimum (mostly the llm endpoint)
> The interesting thing is what agents can do while not being synchronously supervised by a human.
I vibe coded a message system where I still have all the chat windows open but my agents run a command that finished once a message meant for them comes along and then they need to start it back up again themselves. I kept it semi-automatic like that because I'm still experimenting whether this is what I want.
I don't think this is quite right. I do work for a pub/sub company that's involved in this space, but this article isn't a commercial sales pitch and we do have a product that exists.
The article is about how agents are getting more and more async features, because that's what makes them useful and interesting. And how the standard HTTP based SSE streaming of response tokens is hard to make work when agents are async.
The idea of the "session" is an interesting solution, I'll be looking forward to new developments from you on this.
I don't think it solves the other half of the problem that we've been working on, which is what happens if you were not the one initiating the work, and therefore can't "connect back into a session" since the session was triggered by the agent in the first place.
With the approach based on pub/sub channels, this is possible to do if you know the name of the session (i.e. know the name of the channel).
Of course the hard bit then is; how does the client know there's new information from the agent, or a new session?
Generally we'd recommend having a separate kind of 'notification' or 'control' pub/sub channel that clients always subscribe to to be notified of new 'sessions'. Then they can subscribe to the new session based purely on knowing the session name.
I recognize the problem statement and decomposition of it. But not the solution.
Especially saying that he sees the same problem being worked on by N people. And now that makes in N+1?
I’ve been more interested by the protocols and standard that could truly solve this for everyone in a cross-compatible way.
Some people have dabbled with atproto as the transport and “memory” storage for example.
I don't know Kitaru too well, but I do know Temporal a bit.
The pattern I describe in the article of 'channels' works really well for one of the hardest bits of using a durable execution tool like Temporal. If your workflow step is long running, or async, it's often hard to 'signal' the result of the step out to some frontend client. But using channels or sessions like in the article it becomes super easy because you can write the result to the channel and it's sent in realtime to the subscribed client. No HTTP polling for results, or anything like that.
at https://agentblocks.ai we just use Google-style LROs for this, do we really need a "durable transport for AI agents built around the idea of a session"?
Assuming LROs are "Long running operations", then you kick off some work with an API request, and get some ID back. Then you poll some endpoint for that ID until the operation is "done". This can work, but when you try and build in token-streaming to this model, you end up having to thread every token through a database (which can work), and increasing the latency experienced by the user as you poll for more tokens/completion status.
Obviously polling works, it's used in lots of systems. But I guess I am arguing that we can do better than polling, both in terms of user experience, and the complexity of what you have to build to make it work.
If your long running operations just have a single simple output, then polling for them might be a great solution. But streaming LLM responses (by nature of being made up of lots of individual tokens) makes the polling design a bit more gross than it really needs to be. Which is where the idea of 'sessions' comes in.
The problem is the industry obsession on concatenating messages into a conversation stream. There is no reason to do it this way. Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs. (A drawback is caching won't help much if most of the context window is composed dynamically)
Coding CLIs as well as web chat works well because the agent can pull in information into the session at will (read a file, web search). The pain point is that if you're appending messages a stream, you're just slowly filling up the context.
The fix is to keep the message stream concept for informal communication with the prompter, but have an external, persistent message system that the agent can interact with (a bit like email). The agent can decide which messages they want to pull into the context, and which ones are no longer relevant.
The key is to give the agent not just the ability to pull things into context, but also remove from it. That gives you the eternal context needed for permanent, daemonized agents.
This is absolutely the hardest bit.
I guess the short-cut is to include all the chat conversation history, and then if the history contains "do X" followed by "no actually do Y instead", then the LLM can figure that out. But isn't it fairly tricky for the agent harness to figure that out, to work out relevancy, and to work out what context to keep? Perhaps this is why the industry defaults to concatenating messages into a conversation stream?
Maybe there’s a way to play around with this idea in pi. I’ll dig into it.
Let's say that you have two agents running concurrently: A & B. Agent A decides to push a message into the context of agent B. It does that and the message ends up somewhere in the list of the message right at the bottom of the conversation.
The question is, will agent B register that a new message was inserted and will it act on it?
If you do this experiment you will find out that this architecture does not work very well. New messages that are recent but not the latest have little effect for interactive session. In other words, Agent A will not respond and say, "and btw, this and that happened" unless perhaps instructed very rigidly or perhaps if there is some other instrumentation in place.
Your mileage may vary depending on the model.
A better architecture is pull-based. In other words, the agent has tools to query any pending messages. That way whatever needs to be communicated is immediately visible as those are right at the bottom of the context so agents can pay attention to them.
An agent in that case slightly more rigid in a sense that the loop needs to orchestrate and surface information and there is certainly not one-size-fits-all solution here.
I hope this helps. We've learned this the hard way.
This means:
- less and less "man-in-the-loop"
- less and less interaction between LLMs and humans
- more and more automation
- more and more decision-making autonomy for agents
- more and more risk (i.e., LLMs' responsibility)
- less and less human responsibility
Problem:
Tasks that require continuous iteration and shared decision-making with humans have two possible options:
- either they stall until human input
- or they decide autonomously at our risk
Unfortunately, automation comes at a cost: RISK.
Why do you think the same will not also be true for AI steerers/managers/CEO?
In a year of two, having a human in the loop, will all of their biases and inconsistencies will be considered risky and irresponsible.
Even if I can string it together it's pretty fragile.
That said I don't really want to solve this with a SaaS. Trying really hard to keep external reliance to a minimum (mostly the llm endpoint)
I vibe coded a message system where I still have all the chat windows open but my agents run a command that finished once a message meant for them comes along and then they need to start it back up again themselves. I kept it semi-automatic like that because I'm still experimenting whether this is what I want.
But they get plenty done without me this way.
The article is about how agents are getting more and more async features, because that's what makes them useful and interesting. And how the standard HTTP based SSE streaming of response tokens is hard to make work when agents are async.
I don't think it solves the other half of the problem that we've been working on, which is what happens if you were not the one initiating the work, and therefore can't "connect back into a session" since the session was triggered by the agent in the first place.
Of course the hard bit then is; how does the client know there's new information from the agent, or a new session?
Generally we'd recommend having a separate kind of 'notification' or 'control' pub/sub channel that clients always subscribe to to be notified of new 'sessions'. Then they can subscribe to the new session based purely on knowing the session name.
The pattern I describe in the article of 'channels' works really well for one of the hardest bits of using a durable execution tool like Temporal. If your workflow step is long running, or async, it's often hard to 'signal' the result of the step out to some frontend client. But using channels or sessions like in the article it becomes super easy because you can write the result to the channel and it's sent in realtime to the subscribed client. No HTTP polling for results, or anything like that.
Having long living requests, where you submit one, you get back a request_id, and then you can poll for it's status is a 20 year old solved problem.
Why is this such a difficult thing to do in practice for chat apps? Do we need ASI to solve this problem?
> So how are folks solving this?
$5 per month dedicated server, SSH, tmux.
Obviously polling works, it's used in lots of systems. But I guess I am arguing that we can do better than polling, both in terms of user experience, and the complexity of what you have to build to make it work.
If your long running operations just have a single simple output, then polling for them might be a great solution. But streaming LLM responses (by nature of being made up of lots of individual tokens) makes the polling design a bit more gross than it really needs to be. Which is where the idea of 'sessions' comes in.