Over the past few days, since Friday, I've been designing and implementing a more lightweight, more opinionated version of Druids.

I started doing this after a discussion with Leni where we converged on the fact that a few issues in Druids made it unwieldy and less than ideal as a base upon which to do our agent science. I think this is the right call, and that this new library will be a significant step up and speed us up as well as make our lives easier.

The problems with original Druids, namely, stem from a lot of complexity of it being a real user-facing product that makes it harder and more annoying to use internally.

Namely:

1. The client-server boundary causes a lot of software complexity that makes many things more annoying
    - have this local-prod env gap
        - can't do multi-file
        - can't use dependencies easily
    - can't just run things as normal code
    - lots of internal software complexity
2. It's hard to see what the agents are doing
3. Configuring and controlling the environments in which agents run is also more annoying with resrtictions because of all this complexity in druids.
4. The fact we are using ACP limits our control of the harness, instead we can use pi and pi extensions

So I set out to design this new system that would be much simpler, abolishing this distinction, and also namely much more useful as an open source tool because much more natively lightweight.

Advantages:

- easier to develop quickly with
- easier to hack, less overdue complexity
- agents will be fully visible to the user in tmux, due to advantages of pi
- all-local

Rather than having programs you upload to a server, you just write python files that call a library in arbitrary ways, and that library manages and monitors a registry of agents and the machines on which they run.

Along the way, I reflected a bit more on the design principles of such a library, and what made running agents for me, in our past language, hard.

I think this has broadly been fruitful and revealead ideas around reliability and what patterns can be useful to get agents to do things well. It's also pretty open-ended and difficult, and there is always a tradeoff between theorizing and interacting with the world/iterating, so at times I've felt more confused. Discussions with fulcroids have been useful there.

Notably, these concerns are of a different class than the ones that initially suggested this refactor, but concrete decisions in the design of the language made it a good time to consider them, even if they slowed the project down.

Here are some example issues with current druids (some are fundamental, not saying we will fully solve all of these, but can we design abstractions that make it easy for the user to):

- druids programs have timing issues in agent behavior, and in terms of how the agents and machines get spawned in a way that can be confusing for the user to think about
- not optimized currently for fault-tolerance - it's hard to notice when an agent has failed and model that in a program, and this is super important for getting somewhere with these larger programs and unpredictable agent behavior => want to make it easy to notice error, notice agent failure, and restart
- not very modular - often hard to try to isolate and reuse components of programs in neat ways, which makes it hard to reduce complexity
- distributed coordination problems around writing to shared states, like a codebase, list of ideas, etc...

These problems are quite deep problems in software design. I've been reading and learning and making progress, and it's been fun. But also need to converge.

## Ideas

This is going to be less cogent as I summarize my/our ideas and how they relate to each other.

- Using Result types more to make it easy to catch errors and retry
    - also related: callbacks with failure modes like timeouts that u can catch
- having state boundaries that make it easier to delimit to state and which agents or processes can write to which state
    - eg agent state
    - or broader kinds of divisions
- having a better message sending API
    - make it easier to clearly interrupt, await responses, things like this
- exposing more agent callbacks, on_idle, on_stop, etc...
- taking the idea that agent programs should be modular stateful functions with an actual output more seriously, maybe with some ways to communicate with it?
    - this is my latest idea today, I think it's pretty interesting to me?
    - seems better than agent state, as I think about what these agent programs will look like beyond N=1
