VIDEO=https://youtu.be/jgzSq5YGK_Q
hello I'm going to be demonstrating a
draft for resolution to produce
self-improving agents but just a caveat
we are not at its current state and for
the within the scope of the draft it is
only addressing um how to produce a
self-improving LLM call um this is
because this solution does not address
calling tools it does not address um
dynamic dynamic workflows which are the
two tenants of an LLM agent it is only
addressing producing the best output for
an LLM call so to begin we have a key um
this key consists so so sorry I I first
going to introduce the key so the key is
going to have a target which as you can
see here is basically the LLM call that
we want to um that we want to update and
improve the reason I have this is
because there might be several LM calls
there's going to be several LM calls in
this draft then we have the color gray
gray black which indicates that this is
something that might be introduced in a
future version then yellow indicates an
action blue indicates an observation um
the reason why I have these two uh keys
is because this um the way the system is
being built is guided by reinforcement
learning where actions and observations
are a key um are a key concept um in in
in order to build re a reinforcement
learning
system so we b I basically projected
those two concepts over to my solution
so first before we even get into the
solution let's just cover what is a um
what is the common workflow for a
developer that wants to create an
LLM and who wants to create and optimize
an LLM call so first the developer is
going to configure the LLM call which
may consist of a system prompt right
consists of a system prompt the chat
history the output an output constraint
which is comprised of two aspects one is
the structure which is more often than
not um
not more often than not set and then um
persisted and then the details which are
um set and might be iterated read it on
then you have the metadata which
are which consists of temperature and
top k top p which modify the way the llm
samples things um then you have the
model itself the and then the evil
prompt and and eval rails i will address
the evil in a moment so once you have
these parameters set by the developer
you pass them over to the llm to to make
the llm call um and the LLM call is made
and if you pass in an instrument which
is um something that basically wraps the
LLM call um it it is able to observe the
parameters and the outputs of the of the
function which um
in most cases is is going to be like an
an open a an open AI um schema SDK uh
that calls some provider
So once you once you attach that
instrument um the the instrument then
keeps track of the inputs and outputs
and then once we have those inputs and
outputs the inputs being what we already
addressed over here and then the outputs
being what is generated by the by the
provider you pass those two over to the
EVO so now let's address the evil the
you have you may for for all intents and
purposes we are going to be addressing
the evil as um an LLM as a judge and
what an LM of a judge as a judge
consists of is basically you pass in a
prompt for for the LLM judge to judge
your inputs and outputs
so in in the case that you want to just
improve formatting then you're going to
have the the inputs that you've already
provided and then it's going to have an
output that may or may not be formatted
correctly and then in the evolve prompt
you're going to tell it hey this is the
format that I want um once
that's once that prompt is provided then
you have
um then you also pass in the rails which
are just to constrain the output space
to either some scalar um some discrete
number or some or some category um so in
our case we're just we would say like
either formatted or not formatted um and
so we pass all of this information into
the eval and it's going to determine
whether or not the outputs are formatted
or not for formatted and that's where
the score is going to that will produce
a score in addition arise which is going
to be our telemetry system that provided
the instrumentation also provides um an
exclamation to why the score was uh is
is what it is and so we can also tap
into that but this is this would be
optional uh this is the the the
exclamation is optional so once you have
the exclam the exclamation score inputs
outputs you pass that all to be stored
inside arise or the phoenix um uh
telemetry platform the telemetry
platform of your choice um but in our ca
but in my case I'm basing this off of
the arise platform so moving forward so
what is the actual solution that what I
oh and before I
forget so once you have that information
what the developer what a developer does
is they look at the logs from arise and
phoenix look at the inputs and outputs
but oftent times that you're talking
about hundreds of um hundreds of input
output pairs you don't have time to
really look at it so instead you'll just
look at the scores and see get a gist of
like how things are actually performing
without having to look into the
nitty-gritty details of things so the
developer observes this as needed and
then makes a decision as to what needs
to be changed so what can actually be
changed i haven't addressed the colors
here well the
developer is mo more like the developer
is going to most likely change the
system prompt
details metadata and model these are the
actions that the developer can take the
rest are just observations like the chat
the chat history is not in the
developer's control that is up to the um
that's up to the user or at least
whatever um whatever information is
being passed to the de developer the
structure that is also dictated by
whatever contract the developer needs to
adhere to um and then the eval prompt
and evil rails ideally you don't want to
change those otherwise you you're kind
of cheating your own evals um yeah so
these are the the the yellow industry
indicates what the developer can
actually change so moving forward you
have um we're going to move Oh and once
those are changed you iterate again uh
same process over and over and over the
developer iterates onto until they have
like an an optimal LLM
call then you have the autonomous flow
this autonomous
flow this
auton so this is my solution for the
autonomous flow um you have
the well the the the configuration that
you need to set up to set this up
consists of a very simple instrument for
the EVAL the only reason uh that you
need to add this is because um I'm going
to need access to the to the eval prompt
for further evaluation during the update
step so once you have that evol
instrument then you um then you're set
up um and what happens afterwards is you
basically have the user or whatever um
through whatever means you end up
populating and getting results getting
these input output pairs and getting
evaluations that's what you want in
order to update things so that they
improve um so that that's going to
happen here I illustrate it as a user
using your platform which invokes the
call provider uh the target call
provider which is um right here call um
calls LLM provider
calls LLM provider and that produces a
generation and then over and over and
over for however many times your um your
uh your inference has been called
afterwards afterwards you have
um you have the actual update step so
what the update step consists of
is sorry I had to switch rooms get a bit
of silence um so what needs to happen in
the update step the developer needs to
configure only the date time um this
indicates like from what time and onward
you want to get your samples from once
that is indicated you're going to move
on to um the data that you want to
download from from the data that you
want to download SLquery from Horizon
Phoenix or whatever your telemetry
provider is so once once you have that
data um then you start pre-processing it
so you start to narrow it down to only
the data that you need which in our
cases is going to consist of the chat
history eval score evolve exclamation
eval output generation and out output
generation and the datetime which is
what we originally filtered by um once
you have all that you basically pass it
um you pass it up and into the update
template so you're going to pass in your
chat again chat chat history chat
history output evol scores evol
reasoning you're going to pass in as
many instances of those as as you need
uh depending on the on the datetime that
you configured then you pass in the
actions and we only need one instance of
this because we're assuming that you
haven't changed those throughout the
course of this iteration so so you're
going to have a system prompt and then
you're going to have output constraint
details hyperparameters whatever else um
you you only need one instance of those
and then you have the evolve template um
which is provided by the evoling
instrument that we configured before um
so once you have that eval template uh
once you have each of these which once
you have each of these documents you
basically are going to pass them into an
eval template that will be structured
such that it basically serves as the
reward function for updating things so
what would this evolve template look
like it would basically consist of
listing out each one of the chat history
output evol scores and eval reasoning
then saying hey here are the things that
you can change in in the form of a
system prompt output constraints details
and then at the end and these were all
evaluated by this evolve template um and
then once yeah and this and these are
all evaluated by the evolve template
please update the actions so that uh we
get better results and so you pass that
all in to the evolve template you pass
that template over to make an an LLM
call ideally with a very strong model um
which will then make updates to the
actions so that may either be the system
prompt and maybe this system prompt and
output constraints and so on and so
forth and then you then you pass these
new updated actions over to the to the
same calls provider that you're trying
that you're looking to update um while
also passing in the chat history and
output into the calls provider um so
that you're running it against the same
samples that uh that you're looking to
update against which is going to produce
uh a series of outputs um then you pass
in all that information
you pass in all that information into
the EVO as before and then the EVO is
going to um with with the EVO prompt and
the EVO scoring system remaining static
um only with only the actions
change with the with the inputs meaning
the chat history the the newly generated
outputs and the same old EVA you're
gonna evaluate and then see if the evol
score improved compared to what you had
before and if it does not improve then
you go back and then you update the you
update the actions and then the evolve
scores and evolve reasoning and um and
it'll continue iterating until it
reaches a until the evaluation it ends
up
improving um
once so so yeah um and once your evol
actually does does get better then it
provides feedback back back to the
developer telling him hey I changed this
um which resulted in in um these better
results and apply them as you will um we
can in fact even close that portion of
the loop um but for this first first
initial draft version that is it um I'd
love to hear your feedback i'd love to
um know what you think any and and
respond to any comments or questions you
may have um it all helps in trying to
build this as um robustly and as
effective as possible and before I close
out what is the like uh the the the
vision going forward so you're we have
this system it is grounded on
reinforcement learning concepts we can
extrapolate this such
that we can also take into account other
LLM calls down um um downstream and your
whatever flow you're iterating you're
incorporating and we can assign um we
can assign values and we can assign um
reward values we can assign um value
functions we can and we can iterate over
time steps so that um this is scalable
over to an actual agentic system which
involves several LLM calls um but yeah
that's um that's further in the future
but this is the general idea of
self-improving
um LLM calls thank you