[0.7s] hi everyone so I've wanted to make this
[2.8s] video for a while it is a comprehensive
[5.4s] but General audience introduction to
[8.1s] large language models like Chachi PT and
[11.2s] what I'm hoping to achieve in this video
[12.6s] is to give you kind of mental models for
[14.6s] thinking through what it is that this
[17.2s] tool is it is obviously magical and
[19.7s] amazing in some respects it's uh really
[22.4s] good at some things not very good at
[23.9s] other things and there's also a lot of
[25.4s] sharp edges to be aware of so what is
[28.1s] behind this text box you can put
[29.7s] anything in there and press enter but uh
[32.2s] what should we be putting there and what
[34.6s] are these words generated back how does
[36.7s] this work and what what are you talking
[38.2s] to exactly so I'm hoping to get at all
[40.4s] those topics in this video we're going
[42.2s] to go through the entire pipeline of how
[44.0s] this stuff is built but I'm going to
[45.6s] keep everything uh sort of accessible to
[48.2s] a general audience so let's take a look
[50.5s] at first how you build something like
[51.8s] chpt and along the way I'm going to talk
[53.8s] about um you know some of the sort of
[56.9s] cognitive psychological implications of
[59.7s] the tools okay so let's build Chachi PT
[62.9s] so there's going to be multiple stages
[64.4s] arranged sequentially the first stage is
[67.0s] called the pre-training stage and the
[69.6s] first step of the pre-training stage is
[71.1s] to download and process the internet now
[73.3s] to get a sense of what this roughly
[74.6s] looks like I recommend looking at this
[76.8s] URL here so um this company called
[80.5s] hugging face uh collected and created
[83.4s] and curated this data set called Fine
[86.3s] web and they go into a lot of detail on
[88.6s] this block post on how how they
[90.1s] constructed the fine web data set and
[92.4s] all of the major llm providers like open
[94.3s] AI anthropic and Google and so on will
[96.4s] have some equivalent internally of
[98.8s] something like the fine web data set so
[101.2s] roughly what are we trying to achieve
[102.4s] here we're trying to get ton of text
[104.5s] from the internet from publicly
[105.8s] available sources so we're trying to
[108.0s] have a huge quantity of very high
[110.5s] quality documents and we also want very
[113.1s] large diversity of documents because we
[115.1s] want to have a lot of knowledge inside
[116.8s] these models so we want large diversity
[119.4s] of high quality documents and we want
[121.6s] many many of them and achieving this is
[124.0s] uh quite complicated and as you can see
[125.6s] here takes multiple stages to do well so
[128.6s] let's take a look at what some of these
[129.8s] stages look like in a bit for now I'd
[131.8s] like to just like to note that for
[133.2s] example the fine web data set which is
[134.8s] fairly representative what you would see
[136.5s] in a production grade application
[138.6s] actually ends up being only about 44
[140.4s] terabyt of dis space um you can get a
[143.6s] USB stick for like a terabyte very
[145.2s] easily or I think this could fit on a
[147.1s] single hard drive almost today so this
[149.5s] is not a huge amount of data at the end
[151.6s] of the day even though the internet is
[153.3s] very very large we're working with text
[155.5s] and we're also filtering it aggressively
[157.3s] so we end up with about 44 terabytes in
[159.4s] this example so let's take a look at uh
[162.4s] kind of what this data looks like and
[164.7s] what some of these stages uh also are so
[167.3s] the starting point for a lot of these
[168.5s] efforts and something that contributes
[170.5s] most of the data by the end of it is
[172.8s] Data from common crawl so common craw is
[176.0s] an organization that has been basically
[177.5s] scouring the internet since 2007 so as
[180.8s] of 2024 for example common CW has
[183.2s] indexed 2.7 billion web
[185.5s] pages uh and uh they have all these
[188.0s] crawlers going around the internet and
[189.8s] what you end up doing basically is you
[191.1s] start with a few seed web pages and then
[193.3s] you follow all the links and you just
[195.0s] keep following links and you keep
[196.2s] indexing all the information and you end
[197.5s] up with a ton of data of the internet
[199.0s] over time so this is usually the
[201.4s] starting point for a lot of the uh for a
[204.1s] lot of these efforts now this common C
[206.3s] data is quite raw and is filtered in
[208.0s] many many different ways
[210.2s] so here they Pro they document this is
[213.2s] the same diagram they document a little
[215.0s] bit the kind of processing that happens
[216.7s] in these stages so the first thing here
[219.5s] is something called URL
[221.4s] filtering so what that is referring to
[223.9s] is that there's these block
[227.0s] lists of uh basically URLs that are or
[230.4s] domains that uh you don't want to be
[232.6s] getting data from so usually this
[234.3s] includes things like U malware websites
[236.6s] spam websites marketing websites uh
[238.9s] racist websites adult sites and things
[241.0s] like that so there's a ton of different
[242.7s] types of websites that are just
[244.0s] eliminated at this stage because we
[246.3s] don't want them in our data set um the
[248.8s] second part is text extraction you have
[250.8s] to remember that all these web pages
[252.2s] this is the raw HTML of these web pages
[254.5s] that are being saved by these crawlers
[256.8s] so when I go to inspect
[258.6s] here this is what the raw HTML actually
[261.4s] looks like you'll notice that it's got
[263.4s] all this markup uh like lists and stuff
[266.9s] like that and there's CSS and all this
[268.8s] kind of stuff so this is um computer
[271.3s] code almost for these web pages but what
[273.7s] we really want is we just want this text
[275.6s] right we just want the text of this web
[277.2s] page and we don't want the navigation
[278.9s] and things like that so there's a lot of
[280.5s] filtering and processing uh and heris
[282.8s] that go into uh adequately filtering for
[285.2s] just their uh good content of these web
[288.3s] pages the next stage here is language
[290.5s] filtering so for example fine web
[293.8s] filters uh using a language classifier
[296.3s] they try to guess what language every
[298.6s] single web page is in and then they only
[300.5s] keep web pages that have more than 65%
[302.6s] of English as an
[304.1s] example and so you can get a sense that
[306.0s] this is like a design decision that
[307.5s] different companies can uh can uh take
[310.1s] for themselves what fraction of all
[312.8s] different types of languages are we
[314.0s] going to include in our data set because
[315.9s] for example if we filter out all of the
[317.7s] Spanish as an example then you might
[319.4s] imagine that our model later will not be
[321.2s] very good at Spanish because it's just
[322.5s] never seen that much data of that
[324.4s] language and so different companies can
[326.5s] focus on multilingual performance to uh
[328.8s] to a different degree as an example so
[330.9s] fine web is quite focused on English and
[333.4s] so their language model if they end up
[335.0s] training one later will be very good at
[336.8s] English but not may be very good at
[338.5s] other
[339.7s] languages after language filtering
[341.7s] there's a few other filtering steps and
[343.4s] D duplication and things like that um
[347.0s] finishing with for example the pii
[349.1s] removal this is personally identifiable
[352.2s] information so as an example addresses
[354.5s] Social Security numbers and things like
[356.0s] that you would try to detect them and
[357.7s] you would try to filter out those kinds
[359.0s] of web pages from the the data set as
[360.2s] well so there's a lot of stages here and
[362.7s] I won't go into full detail but it is a
[365.3s] fairly extensive part of the
[366.6s] pre-processing and you end up with for
[368.7s] example the fine web data set so when
[370.8s] you click in on it uh you can see some
[372.7s] examples here of what this actually ends
[374.2s] up looking like and anyone can download
[376.2s] this on the huging phase web page and so
[379.3s] here are some examples of the final text
[381.3s] that ends up in the training set so this
[384.4s] is some article about tornadoes in
[387.5s] 2012 um so there's some t tadoes in 2020
[390.9s] in 2012 and what
[393.0s] happened uh this next one is something
[396.3s] about did you know you have two little
[398.6s] yellow 9vt battery sized adrenal glands
[401.0s] in your body okay so this is some kind
[403.7s] of a odd medical
[406.9s] article so just think of these as
[409.0s] basically uh web pages on the internet
[411.6s] filtered just for the text in various
[413.8s] ways and now we have a ton of text 40
[416.8s] terabytes off it and that now is the
[418.8s] starting point for the next step of this
[420.8s] stage now I wanted to give you an
[422.7s] intuitive sense of where we are right
[424.1s] now so I took the first 200 web pages
[426.6s] here and remember we have tons of them
[429.2s] and I just take all that text and I just
[431.2s] put it all together concatenate it and
[433.5s] so this is what we end up with we just
[435.0s] get this just just raw text raw internet
[438.6s] text and there's a ton of it even in
[441.0s] these 200 web pages so I can continue
[442.6s] zooming out here and we just have this
[444.9s] like massive tapestry of Text data and
[448.2s] this text data has all these p patterns
[450.2s] and what we want to do now is we want to
[451.8s] start training neural networks on this
[453.4s] data so the neural networks can
[455.3s] internalize and model how this text
[459.4s] flows right so we just have this giant
[462.6s] texture of text and now we want to get
[465.5s] neural Nets that mimic it okay now
[468.4s] before we plug text into neural networks
[471.3s] we have to decide how we're going to
[472.5s] represent this text uh and how we're
[474.6s] going to feed it in now the way our
[477.2s] technology works for these neuron Lots
[478.6s] is that they expect
[480.0s] a one-dimensional sequence of symbols
[482.9s] and they want a finite set of symbols
[485.9s] that are possible and so we have to
[488.1s] decide what are the symbols and then we
[490.2s] have to represent our data as
[491.8s] one-dimensional sequence of those
[494.2s] symbols so right now what we have is a
[496.4s] onedimensional sequence of text it
[498.6s] starts here and it goes here and then it
[500.7s] comes here Etc so this is a
[502.2s] onedimensional sequence even though on
[503.8s] my monitor of course it's laid out in a
[506.1s] two-dimensional way but it goes from
[507.8s] left to right and top to bottom right so
[509.8s] it's a one-dimensional sequence of text
[512.2s] now this being computers of course
[513.9s] there's an underlying representation
[515.4s] here so if I do what's called utf8 uh
[518.2s] encode this text then I can get the raw
[521.2s] bits that correspond to this text in the
[524.2s] computer and that's what uh that looks
[526.5s] like this so it turns out that for
[530.2s] example this very first bar here is the
[533.3s] first uh eight bits as an
[536.0s] example so what is this thing right this
[539.1s] is um representation that we are looking
[541.9s] for uh in in a certain sense we have
[544.6s] exactly two possible symbols zero and
[546.9s] one and we have a very long sequence of
[550.2s] it right now as it turns out um this
[554.2s] sequence length is actually going to be
[556.5s] very finite and precious resource uh in
[559.2s] our neural network and we actually don't
[561.0s] want extremely long sequences of just
[563.1s] two symbols instead what we want is we
[565.7s] want to trade off uh this um symbol
[570.0s] size uh of this vocabulary as we call it
[572.9s] and the resulting sequence length so we
[575.3s] don't want just two symbols and
[576.4s] extremely long sequences we're going to
[578.7s] want more symbols and shorter sequences
[582.1s] okay so one naive way of compressing or
[584.8s] decreasing the length of our sequence
[586.4s] here is to basically uh consider some
[589.6s] group of consecutive bits for example
[591.9s] eight bits and group them into a single
[595.0s] what's called bite so because uh these
[597.9s] bits are either on or off if we take a
[600.0s] group of eight of them there turns out
[601.8s] to be only 256 possible combinations of
[604.3s] how these bits could be on or off and so
[606.5s] therefore we can re repesent this
[607.9s] sequence into a sequence of bytes
[610.6s] instead so this sequence of bytes will
[613.5s] be eight times shorter but now we have
[616.2s] 256 possible symbols so every number
[619.1s] here goes from 0 to
[620.8s] 255 now I really encourage you to think
[622.9s] of these not as numbers but as unique
[625.2s] IDs or like unique symbols so maybe it's
[628.0s] a bit more maybe it's better to actually
[630.4s] think of these to replace every one of
[632.1s] these with a unique Emoji you'd get
[634.0s] something like this so um we basically
[637.4s] have a sequence of emojis and there's
[639.0s] 256 possible emojis you can think of it
[641.3s] that way now it turns out that in
[644.8s] production for state-of-the-art language
[646.4s] models uh you actually want to go even
[648.4s] Beyond this you want to continue to
[650.3s] shrink the length of the sequence uh
[652.5s] because again it is a precious resource
[654.8s] in return for more symbols in your
[657.7s] vocabulary and the way this is done is
[660.1s] done by running what's called The Bite
[662.2s] pair encoding algorithm and the way this
[664.4s] works is we're basically looking for
[666.4s] consecutive bytes or symbols that are
[670.3s] very common so for example turns out
[673.8s] that the sequence 116 followed by 32 is
[677.0s] quite common and occurs very frequently
[679.0s] so what we're going to do is we're going
[680.3s] to group uh this um pair into a new
[684.5s] symbol so we're going to Mint a symbol
[686.6s] with an ID 256 and we're going to
[688.9s] rewrite every single uh pair 11632 with
[692.6s] this new symbol and then can we can
[694.5s] iterate this algorithm as many times as
[696.2s] we wish and each time when we mint a new
[698.7s] symbol we're decreasing the length and
[700.7s] we're increasing the symbol size and in
[703.6s] practice it turns out that a pretty good
[705.4s] setting of um the basically the
[708.1s] vocabulary size turns out to be about
[709.8s] 100,000 possible symbols so in
[712.3s] particular GPT 4 uses
[715.4s] 100,
[717.0s] 277 symbols
[720.0s] um and this process of converting from
[724.2s] raw text into these symbols or as we
[727.1s] call them tokens is the process called
[730.4s] tokenization so let's now take a look at
[732.8s] how gp4 performs tokenization conting
[735.6s] from text to tokens and from tokens back
[738.2s] to text and what this actually looks
[740.0s] like so one website I like to use to
[741.9s] explore these token representations is
[744.7s] called tick tokenizer and so come here
[747.1s] to the drop down and select CL 100 a
[749.4s] base which is the gp4 base model
[752.4s] tokenizer and here on the left you can
[754.3s] put in text and it shows you the
[756.1s] tokenization of that text so for example
[760.6s] heo space
[763.6s] world so hello world turns out to be
[766.1s] exactly two Tokens The Token hello which
[769.2s] is the token with ID
[771.6s] 15339 and the token space
[774.7s] world that is the token 1
[777.8s] 1917 so um hello space world now if I
[782.4s] was to join these two for example I'm
[784.8s] going to get again two tokens but it's
[786.4s] the token H followed by the token L
[789.7s] world without the
[791.7s] H um if I put in two Spa two spaces here
[795.4s] between hello and world it's again a
[796.8s] different uh tokenization there's a new
[799.2s] token 220
[802.1s] here okay so you can play with this and
[804.2s] see what happens here also keep in mind
[806.7s] this is not uh this is case sensitive so
[808.7s] if this is a capital H it is something
[810.9s] else or if it's uh hello world then
[815.2s] actually this ends up being three tokens
[816.6s] instead of just two
[821.2s] tokens yeah so you can play with this
[823.2s] and get an sort of like an intuitive
[824.7s] sense of uh what these tokens work like
[827.2s] we're actually going to loop around to
[828.4s] tokenization a bit later in the video
[830.1s] for now I just wanted to show you the
[831.2s] website and I wanted to uh show you that
[833.8s] this text basically at the end of the
[836.0s] day so for example if I take one line
[837.8s] here this is what GT4 will see it as so
[841.1s] this text will be a sequence of length
[844.4s] 62 this is the sequence here and this is
[848.2s] how the chunks of text correspond to
[851.5s] these symbols and again there's 100,
[856.6s] 27777 possible symbols and we now have
[859.4s] one-dimensional sequences of those
[861.7s] symbols so um yeah we're going to come
[864.4s] back to tokenization but that's uh for
[866.4s] now where we are okay so what I've done
[868.4s] now is I've taken this uh sequence of
[870.4s] text that we have here in the data set
[872.3s] and I have re-represented it using our
[874.1s] tokenizer into a sequence of tokens and
[877.5s] this is what that looks like now so for
[880.2s] example when we go back to the Fine web
[881.7s] data set they mentioned that not only is
[883.7s] this 44 terab of dis space but this is
[885.8s] about a 15 trillion token sequence of um
[890.6s] in this data set and so here these are
[893.7s] just some of the first uh one or two or
[896.3s] three or a few thousand here I think uh
[898.6s] tokens of this data set but there's 15
[901.3s] trillion here uh to keep in mind and
[903.9s] again keep in mind one more time that
[905.8s] all of these represent little text
[907.2s] chunks they're all just like atoms of
[909.8s] these sequences and the numbers here
[911.7s] don't make any sense they're just uh
[913.2s] they're just unique IDs okay so now we
[917.4s] get to the fun part which is the uh
[919.6s] neural network training and this is
[921.5s] where a lot of the heavy lifting happens
[923.0s] computationally when you're training
[924.6s] these neural networks so what we do here
[928.4s] in this this step is we want to model
[930.8s] the statistical relationships of how
[932.2s] these tokens follow each other in the
[933.8s] sequence so what we do is we come into
[936.1s] the data and we take Windows of tokens
[940.0s] so we take a window of tokens uh from
[943.4s] this data fairly
[944.8s] randomly and um the windows length can
[949.0s] range anywhere anywhere between uh zero
[951.6s] tokens actually all the way up to some
[954.0s] maximum size that we decide on uh so for
[957.1s] example in practice you could see a
[958.5s] token with Windows of say 8,000 tokens
[961.5s] now in principle we can use arbitrary
[963.5s] window lengths of tokens uh but uh
[967.5s] processing very long uh basically U
[971.0s] window sequences would just be very
[972.8s] computationally expensive so we just
[975.0s] kind of decide that say 8,000 is a good
[976.7s] number or 4,000 or 16,000 and we crop it
[979.9s] there now in this example I'm going to
[982.4s] be uh taking the first four tokens just
[985.5s] so everything fits nicely so these
[988.1s] tokens
[990.0s] we're going to take a window of four
[992.3s] tokens this bar view in and space single
[997.5s] which are these token
[999.1s] IDs and now what we're trying to do here
[1001.2s] is we're trying to basically predict the
[1002.6s] token that comes next in the sequence so
[1005.7s] 3962 comes next right so what we do now
[1009.3s] here is that we call this the context
[1012.0s] these four tokens are context and they
[1014.4s] feed into a neural
[1016.1s] network and this is the input to the
[1018.0s] neural network
[1019.9s] now I'm going to go into the detail of
[1021.8s] what's inside this neural network in a
[1023.2s] little bit for now it's important to
[1024.8s] understand is the input and the output
[1026.2s] of the neural net so the input are
[1028.9s] sequences of tokens of variable length
[1032.1s] anywhere between zero and some maximum
[1034.1s] size like 8,000 the output now is a
[1037.6s] prediction for what comes next so
[1041.2s] because our vocabulary has
[1043.8s] 100277 possible tokens the neural
[1046.7s] network is going to Output exactly that
[1048.3s] many numbers
[1049.4s] and all of those numbers correspond to
[1050.9s] the probability of that token as coming
[1053.8s] next in the sequence so it's making
[1055.8s] guesses about what comes
[1057.3s] next um in the beginning this neural
[1059.6s] network is randomly initialized so um
[1062.7s] and we're going to see in a little bit
[1064.2s] what that means but it's a it's a it's a
[1066.5s] random transformation so these
[1068.3s] probabilities in the very beginning of
[1069.8s] the training are also going to be kind
[1071.3s] of random uh so here I have three
[1073.6s] examples but keep in mind that there's
[1075.5s] 100,000 numbers here um so the
[1078.1s] probability of this token space
[1079.7s] Direction neural network is saying that
[1081.7s] this is 4% likely right now 11799 is 2%
[1085.8s] and then here the probility of 3962
[1088.0s] which is post is 3% now of course we've
[1091.4s] sampled this window from our data set so
[1093.9s] we know what comes next we know and
[1096.0s] that's the label we know that the
[1098.1s] correct answer is that 3962 actually
[1100.0s] comes next in the sequence so now what
[1103.0s] we have is this mathematical process for
[1105.9s] doing an update to the neural network we
[1108.1s] have the way of tuning it and uh we're
[1110.7s] going to go into a little bit of of
[1112.0s] detail in a bit but basically we know
[1114.7s] that this probability here of 3% we want
[1118.2s] this probability to be higher and we
[1120.6s] want the probabilities of all the other
[1122.1s] tokens to be
[1124.0s] lower and so we have a way of
[1126.0s] mathematically calculating how to adjust
[1128.9s] and update the neural network so that
[1131.8s] the correct answer has a slightly higher
[1133.6s] probability so if I do an update to the
[1135.7s] neural network now the next time I Fe
[1139.2s] this particular sequence of four tokens
[1140.8s] into neural network the neural network
[1142.7s] will be slightly adjusted now and it
[1144.0s] will say Okay post is maybe 4% and case
[1147.1s] now maybe is
[1148.6s] 1% and uh Direction could become 2% or
[1152.1s] something like that and so we have a way
[1154.1s] of nudging of slightly updating the
[1156.2s] neuronet to um basically give a higher
[1159.7s] probability to the correct token that
[1161.0s] comes next in the sequence and now you
[1163.0s] just have to remember that this process
[1165.6s] happens not just for uh this um token
[1169.5s] here where these four fed in and
[1171.3s] predicted this one this process happens
[1173.8s] at the same time for all of these tokens
[1176.3s] in the entire data set and so in
[1178.3s] practice we sample little windows little
[1180.3s] batches of Windows and then at every
[1182.6s] single one of these tokens we want to
[1184.8s] adjust our neural network so that the
[1186.6s] probability of that token becomes
[1188.0s] slightly higher and this all happens in
[1190.2s] parallel in large batches of these
[1192.2s] tokens and this is the process of
[1194.3s] training the neural network it's a
[1195.8s] sequence of updating it so that it's
[1198.8s] predictions match up the statistics of
[1201.5s] what actually happens in your training
[1202.7s] set and its probabilities become
[1205.3s] consistent with the uh statistical
[1208.0s] patterns of how these tokens follow each
[1209.8s] other in the data so let's now briefly
[1212.0s] get into the internals of these neural
[1213.4s] networks just to give you a sense of
[1214.9s] what's inside so neural network
[1217.4s] internals so as I mentioned we have
[1219.5s] these inputs uh that are sequences of
[1222.2s] tokens in this case this is four input
[1224.8s] tokens but this can be anywhere between
[1226.7s] zero up to let's say 8,000 tokens in
[1230.3s] principle this can be an infinite number
[1231.8s] of tokens we just uh it would just be
[1234.0s] too computationally expensive to process
[1235.9s] an infinite number of tokens so we just
[1237.7s] crop it at a certain length and that
[1239.3s] becomes the maximum context length of
[1241.2s] that uh
[1242.6s] model now these inputs X are mixed up in
[1246.2s] a giant mathematical expression together
[1248.9s] with the parameters or the weights of
[1251.8s] these neural networks so here I'm
[1253.7s] showing six example parameters and their
[1256.6s] setting but in practice these uh um
[1260.4s] modern neural networks will have
[1261.8s] billions of these uh parameters and in
[1264.8s] the beginning these parameters are
[1266.3s] completely randomly set now with a
[1269.0s] random setting of parameters you might
[1271.1s] expect that this uh this neural network
[1273.6s] would make random predictions and it
[1275.4s] does in the beginning it's totally
[1276.8s] random predictions but it's through this
[1279.6s] process of iteratively updating the
[1282.6s] network uh as and we call that process
[1284.9s] training a neural network so uh that the
[1288.1s] setting of these parameters gets
[1289.4s] adjusted such that the outputs of our
[1291.7s] neural network becomes consistent with
[1294.0s] the patterns seen in our training
[1296.3s] set so think of these parameters as kind
[1299.2s] of like knobs on a DJ set and as you're
[1301.3s] twiddling these knobs you're getting
[1303.0s] different uh predictions for every
[1305.4s] possible uh token sequence input and
[1309.2s] training in neural network just means
[1310.6s] discovering a setting of parameters that
[1312.8s] seems to be consistent with the
[1314.7s] statistics of the training
[1316.4s] set now let me just give you an example
[1318.7s] what this giant mathematical expression
[1319.9s] looks like just to give you a sense and
[1321.7s] modern networks are massive expressions
[1323.8s] with trillions of terms probably but let
[1326.0s] me just show you a simple example here
[1328.6s] it would look something like this I mean
[1330.1s] these are the kinds of Expressions just
[1331.4s] to show you that it's not very scary we
[1333.7s] have inputs x uh like X1 x2 in this case
[1337.2s] two example inputs and they get mixed up
[1340.0s] with the weights of the network w0 W1 2
[1342.9s] 3 Etc and this mixing is simple things
[1347.0s] like multiplication addition addition
[1349.5s] exponentiation division Etc and it is
[1352.8s] the subject of neural network
[1354.0s] architecture research to design
[1356.6s] effective mathematical Expressions uh
[1359.3s] that have a lot of uh kind of convenient
[1361.3s] characteristics they are expressive
[1363.0s] they're optimizable they're paralyzable
[1365.3s] Etc and so but uh at the end of the day
[1368.2s] these are these are not complex
[1369.6s] expressions and basically they mix up
[1372.1s] the inputs with the parameters to make
[1374.3s] predictions and we're optimizing uh the
[1377.6s] parameters of this neural network so
[1379.6s] that the predictions come out consistent
[1381.7s] with the training set now I would like
[1384.2s] to show you an actual production grade
[1386.2s] example of what these neural networks
[1387.9s] look like so for that I encourage you to
[1389.8s] go to this website that has a very nice
[1391.8s] visualization of one of these
[1394.0s] networks so this is what you will find
[1396.4s] on this website and this neural network
[1399.5s] here that is used in production settings
[1401.7s] has this special kind of structure this
[1404.1s] network is called the Transformer and
[1406.7s] this particular one as an example has 8
[1408.6s] 5,000 roughly
[1410.9s] parameters now here on the top we take
[1413.2s] the inputs which are the token
[1416.2s] sequences and then information flows
[1419.8s] through the neural network until the
[1421.9s] output which here are the logit softmax
[1425.2s] but these are the predictions for what
[1426.5s] comes next what token comes
[1428.7s] next and then here there's a sequence of
[1432.4s] Transformations and all these
[1434.2s] intermediate values that get produced
[1436.4s] inside this mathematical expression s it
[1438.6s] is sort of predicting what comes next so
[1441.2s] as an example these tokens are embedded
[1444.6s] into kind of like this distributed
[1446.0s] representation as it's called so every
[1448.0s] possible token has kind of like a vector
[1450.2s] that represents it inside the neural
[1451.9s] network so first we embed the tokens and
[1455.3s] then those values uh kind of like flow
[1458.2s] through this diagram and these are all
[1460.7s] very simple mathematical Expressions
[1462.1s] individually so we have layer norms and
[1464.2s] Matrix multiplications and uh soft Maxes
[1467.1s] and so on so here kind of like the
[1468.9s] attention block of this Transformer and
[1471.7s] then information kind of flows through
[1473.6s] into the multi-layer perceptron block
[1475.5s] and so on and all these numbers here
[1478.6s] these are the intermediate values of the
[1480.2s] expression and uh you can almost think
[1482.2s] of these as kind of like the firing
[1484.8s] rates of these synthetic neurons but I
[1487.9s] would caution you to uh not um kind of
[1490.4s] think of it too much like neurons
[1492.6s] because these are extremely simple
[1493.8s] neurons compared to the neurons you
[1495.3s] would find in your brain your biological
[1497.2s] neurons are very complex dynamical
[1499.0s] processes that have memory and so on
[1501.0s] there's no memory in this expression
[1502.9s] it's a fixed mathematical expression
[1504.4s] from input to Output with no memory it's
[1506.5s] just a
[1507.4s] stateless so these are very simple
[1509.3s] neurons in comparison to biological
[1510.7s] neurons but you can still kind of
[1512.2s] loosely think of this as like a
[1513.6s] synthetic piece of uh brain tissue if
[1516.0s] you if you like uh to think about it
[1517.7s] that way so information flows through
[1521.1s] all these neurons fire until we get to
[1524.3s] the predictions now I'm not actually
[1526.7s] going to dwell too much on the precise
[1528.8s] kind of like mathematical details of all
[1530.3s] these Transformations honestly I don't
[1532.0s] think it's that important to get into
[1533.9s] what's really important to understand is
[1535.3s] that this is a mathematical function it
[1538.4s] is uh parameterized by some fixed set of
[1541.5s] parameters like say 85,000 of them and
[1544.0s] it is a way of transforming inputs into
[1546.0s] outputs and as we twiddle the parameters
[1548.7s] we are getting uh different kinds of
[1550.6s] predictions and then we need to find a
[1552.6s] good setting of these parameters so that
[1554.4s] the predictions uh sort of match up with
[1556.6s] the patterns seen in training set
[1559.2s] so that's the Transformer okay so I've
[1562.2s] shown you the internals of the neural
[1563.5s] network and we talked a bit about the
[1565.1s] process of training it I want to cover
[1567.4s] one more major stage of working with
[1570.0s] these networks and that is the stage
[1571.9s] called inference so in inference what
[1574.2s] we're doing is we're generating new data
[1576.2s] from the model and so uh we want to
[1579.0s] basically see what kind of patterns it
[1581.0s] has internalized in the parameters of
[1583.2s] its Network so to generate from the
[1586.7s] model is relatively straightforward
[1588.9s] we start with some tokens that are
[1590.8s] basically your prefix like what you want
[1592.6s] to start with so say we want to start
[1594.4s] with the token 91 well we feed it into
[1597.2s] the
[1597.9s] network and remember that the network
[1599.9s] gives us probabilities right it gives us
[1603.2s] this probability Vector here so what we
[1605.3s] can do now is we can basically flip a
[1607.2s] biased coin so um we can sample uh
[1612.1s] basically a token based on this
[1614.9s] probability distribution so the tokens
[1617.2s] that are given High probability by the
[1619.4s] model are more likely to be sampled when
[1621.6s] you flip this biased coin you can think
[1623.7s] of it that way so we sample from the
[1625.8s] distribution to get a single unique
[1628.0s] token so for example token 860 comes
[1631.4s] next uh so 860 in this case when we're
[1634.1s] generating from model could come next
[1636.1s] now 860 is a relatively likely token it
[1638.7s] might not be the only possible token in
[1640.5s] this case there could be many other
[1641.7s] tokens that could have been sampled but
[1643.6s] we could see that 86c is a relatively
[1645.2s] likely token as an example and indeed in
[1647.5s] our training examp example here 860 does
[1649.8s] follow 91 so let's now say that we um
[1654.2s] continue the process so after 91 there's
[1656.7s] a60 we append it and we again ask what
[1659.4s] is the third token let's sample and
[1662.0s] let's just say that it's 287 exactly as
[1664.6s] here let's do that again we come back in
[1667.9s] now we have a sequence of three and we
[1670.0s] ask what is the likely fourth token and
[1672.6s] we sample from that and get this one and
[1675.6s] now let's say we do it one more time we
[1678.2s] take those four we sample and we get
[1680.4s] this one and this
[1682.9s] 13659 uh this is not actually uh 3962 as
[1686.9s] we had before so this token is the token
[1689.6s] article uh instead so viewing a single
[1692.8s] article and so in this case we didn't
[1695.5s] exactly reproduce the sequence that we
[1697.4s] saw here in the training data so keep in
[1700.3s] mind that these systems are stochastic
[1702.6s] they have um we're sampling and we're
[1705.8s] flipping coins and sometimes we lock out
[1708.8s] and we reproduce some like small chunk
[1710.9s] of the text and training set but
[1712.8s] sometimes we're uh we're getting a token
[1715.6s] that was not verbatim part of any of the
[1718.3s] documents in the training data so we're
[1720.4s] going to get sort of like remixes of the
[1723.2s] data that we saw in the training because
[1724.9s] at every step of the way we can flip and
[1727.1s] get a slightly different token and then
[1728.7s] once that token makes it in if you
[1730.6s] sample the next one and so on you very
[1732.6s] quickly uh start to generate token
[1735.1s] streams that are very different from the
[1737.1s] token streams that UR
[1738.7s] in the training documents so
[1740.6s] statistically they will have similar
[1742.2s] properties but um they are not identical
[1745.2s] to your training data they're kind of
[1746.7s] like inspired by the training data and
[1749.1s] so in this case we got a slightly
[1750.6s] different sequence and why would we get
[1752.8s] article you might imagine that article
[1754.8s] is a relatively likely token in the
[1756.8s] context of bar viewing single Etc and
[1761.0s] you can imagine that the word article
[1762.3s] followed this context window somewhere
[1764.7s] in the training documents uh to some
[1766.8s] extent and we just happen to sample it
[1768.8s] here at that stage so basically
[1771.2s] inference is just uh predicting from
[1773.5s] these distributions one at a time we
[1775.4s] continue feeding back tokens and getting
[1777.3s] the next one and we uh we're always
[1780.0s] flipping these coins and depending on
[1782.0s] how lucky or unlucky we get um we might
[1785.6s] get very different kinds of patterns
[1787.0s] depending on how we sample from these
[1789.4s] probability distributions so that's
[1791.6s] inference so in most common scenarios uh
[1795.3s] basically downloading the internet and
[1797.2s] tokenizing it is is a pre-processing
[1798.8s] step you do that a single time and then
[1801.9s] uh once you have your token sequence we
[1804.3s] can start training networks and in
[1806.6s] Practical cases you would try to train
[1808.6s] many different networks of different
[1810.0s] kinds of uh settings and different kinds
[1812.0s] of arrangements and different kinds of
[1813.5s] sizes and so you''ll be doing a lot of
[1815.4s] neural network training and um then once
[1818.1s] you have a neural network and you train
[1819.5s] it and you have some specific set of
[1821.6s] parameters that you're happy with um
[1824.2s] then you can take the model and you can
[1825.8s] do inference and you can actually uh
[1828.0s] generate data from the model and when
[1830.1s] you're on chat GPT and you're talking
[1831.5s] with a model uh that model is trained
[1833.8s] and has been trained by open aai many
[1836.4s] months ago probably and they have a
[1838.6s] specific set of Weights that work well
[1841.4s] and when you're talking to the model all
[1842.9s] of that is just inference there's no
[1844.6s] more training those parameters are held
[1847.1s] fixed and you're just talking to the
[1849.3s] model sort of uh you're giving it some
[1851.6s] of the tokens and it's kind of
[1853.2s] completing token sequences and that's
[1854.9s] what you're seeing uh generated when you
[1857.1s] actually use the model on CH GPT so that
[1859.6s] model then just does inference alone so
[1862.3s] let's now look at an example of training
[1864.1s] an inference that is kind of concrete
[1866.0s] and gives you a sense of what this
[1867.0s] actually looks like uh when these models
[1868.8s] are trained now the example that I would
[1871.0s] like to work with and that I'm
[1872.2s] particularly fond of is that of opening
[1874.3s] eyes gpt2 so GPT uh stands for
[1877.4s] generatively pre-trained Transformer and
[1879.6s] this is the second iteration of the GPT
[1881.5s] series by open AI when you are talking
[1884.0s] to chat GPT today the model that is
[1886.3s] underlying all of the magic of that
[1887.9s] interaction is GPT 4 so the fourth
[1890.4s] iteration of that series now gpt2 was
[1893.1s] published in 2019 by openi in this paper
[1896.4s] that I have right here and the reason I
[1899.1s] like gpt2 is that it is the first time
[1901.9s] that a recognizably modern stack came
[1904.5s] together so um all of the pieces of gpd2
[1908.8s] are recognizable today by modern
[1910.6s] standards it's just everything has
[1912.0s] gotten bigger now I'm not going to be
[1914.2s] able to go into the full details of this
[1915.8s] paper of course because it is a
[1917.1s] technical publication but some of the
[1919.5s] details that I would like to highlight
[1920.8s] are as follows gpt2 was a Transformer
[1923.6s] neural network just like you were just
[1925.7s] like the neural networks you would work
[1927.0s] with today it was it had 1.6 billion
[1930.3s] parameters right so these are the
[1932.2s] parameters that we looked at here it
[1934.4s] would have 1.6 billion of them today
[1937.0s] modern Transformers would have a lot
[1938.3s] closer to a trillion or several hundred
[1940.3s] billion
[1941.9s] probably the maximum context length here
[1945.0s] was 1,24 tokens so it is when we are
[1948.9s] sampling chunks of Windows of tokens
[1952.2s] from the data set we're never taking
[1954.4s] more than 1,24 tokens and so when you
[1956.7s] are trying to predict the next token in
[1958.0s] a sequence you will never have more than
[1960.0s] 1,24 tokens uh kind of in your context
[1963.2s] in order to make that prediction now
[1965.5s] this is also tiny by modern standards
[1967.4s] today the token uh the context lengths
[1969.5s] would be a lot closer to um couple
[1973.0s] hundred thousand or maybe even a million
[1975.0s] and so you have a lot more context a lot
[1976.8s] more tokens in history history and you
[1978.6s] can make a lot better prediction about
[1980.0s] the next token in the sequence in that
[1982.0s] way and finally gpt2 was trained on
[1984.6s] approximately 100 billion tokens and
[1986.8s] this is also fairly small by modern
[1988.6s] standards as I mentioned the fine web
[1990.3s] data set that we looked at here the fine
[1992.3s] web data set has 15 trillion tokens uh
[1994.8s] so 100 billion is is quite
[1997.0s] small
[1998.8s] now uh I actually tried to reproduce uh
[2001.4s] gpt2 for fun as part of this project
[2003.6s] called lm. C so you can see my rup of
[2007.1s] doing that in this post on GitHub under
[2010.0s] the lm. C repository so in particular
[2013.5s] the cost of training gpd2 in 2019 what
[2017.0s] was estimated to be approximately
[2019.3s] $40,000 but today you can do
[2021.3s] significantly better than that and in
[2022.8s] particular here it took about one day
[2025.6s] and about
[2027.2s] $600 uh but this wasn't even trying too
[2029.7s] hard I think you could really bring this
[2031.5s] down to about $100 today now why is it
[2035.5s] that the costs have come down so much
[2037.4s] well number one these data sets have
[2039.4s] gotten a lot better and the way we
[2041.4s] filter them extract them and prepare
[2043.3s] them has gotten a lot more refined and
[2045.7s] so the data set is of just a lot higher
[2048.0s] quality so that's one thing but really
[2050.4s] the biggest difference is that our
[2051.8s] computers have gotten much faster in
[2053.9s] terms of the hardware and we're going to
[2055.4s] look at that in a second and also the
[2057.4s] software for uh running these models and
[2060.4s] really squeezing out all all the speed
[2062.5s] from the hardware as it is possible uh
[2065.5s] that software has also gotten much
[2067.1s] better as as everyone has focused on
[2068.7s] these models and try to run them very
[2070.2s] very
[2071.5s] quickly now I'm not going to be able to
[2074.5s] go into the full detail of this gpd2
[2076.5s] reproduction and this is a long
[2077.8s] technical post but I would like to still
[2079.9s] give you an intuitive sense for what it
[2081.6s] looks like to actually train one of
[2083.2s] these models as a researcher like what
[2084.9s] are you looking at and what does it look
[2086.2s] like what does it feel like so let me
[2088.0s] give you a sense of that a little bit
[2090.0s] okay so this is what it looks like let
[2091.2s] me slide this
[2093.0s] over so what I'm doing here is I'm
[2095.6s] training a gpt2 model right now
[2098.7s] and um what's happening here is that
[2101.0s] every single line here like this one is
[2105.0s] one update to the model so remember how
[2108.8s] here we are um basically making the
[2112.0s] prediction better for every one of these
[2114.0s] tokens and we are updating these weights
[2115.9s] or parameters of the neural net so here
[2118.6s] every single line is One update to the
[2120.7s] neural network where we change its
[2122.5s] parameters by a little bit so that it is
[2124.3s] better at predicting next token and
[2126.1s] sequence in particular every single line
[2128.6s] here is improving the prediction on 1
[2132.5s] million tokens in the training set so
[2135.9s] we've basically taken 1 million tokens
[2139.1s] out of this data set and we've tried to
[2141.8s] improve the prediction of that token as
[2144.8s] coming next in a sequence on all 1
[2146.9s] million of them
[2149.1s] simultaneously and at every single one
[2151.2s] of these steps we are making an update
[2152.6s] to the network for that now the number
[2155.5s] to watch closely is this number called
[2157.9s] loss and the loss is a single number
[2160.8s] that is telling you how well your neural
[2162.8s] network is performing right now and it
[2165.3s] is created so that low loss is good so
[2168.8s] you'll see that the loss is decreasing
[2170.8s] as we make more updates to the neural
[2172.6s] nut which corresponds to making better
[2174.3s] predictions on the next token in a
[2176.4s] sequence and so the loss is the number
[2179.3s] that you are watching as a neural
[2180.7s] network researcher and you are kind of
[2182.6s] waiting you're twiddling your thumbs uh
[2184.5s] you're drinking coffee and you're making
[2186.5s] sure that this looks good so that with
[2189.0s] every update your loss is improving and
[2191.1s] the network is getting better at
[2193.0s] prediction now here you see that we are
[2196.0s] processing 1 million tokens per update
[2198.8s] each update takes about 7 Seconds
[2201.2s] roughly and here we are going to process
[2203.8s] a total of 32,000 steps of
[2207.0s] optimization so 32,000 steps with 1
[2210.3s] million tokens each is about 33 billion
[2212.7s] tokens that we are going to process and
[2214.7s] we're currently only about 420 step 20
[2218.0s] out of 32,000 so we are still only a bit
[2221.3s] more than 1% done because I've only been
[2223.4s] running this for 10 or 15 minutes or
[2225.0s] something like
[2226.3s] that now every 20 steps I have
[2229.2s] configured this optimization to do
[2231.2s] inference so what you're seeing here is
[2233.4s] the model is predicting the next token
[2235.1s] in a sequence and so you sort of start
[2237.5s] it randomly and then you continue
[2239.6s] plugging in the tokens so we're running
[2241.6s] this inference step and this is the
[2243.7s] model sort of predicting the next token
[2245.0s] in the sequence and every time you see
[2246.3s] something appear that's a new
[2249.2s] token um so let's just look at this and
[2254.4s] you can see that this is not yet very
[2255.7s] coherent and keep in mind that this is
[2257.4s] only 1% of the way through training and
[2259.6s] so the model is not yet very good at
[2261.3s] predicting the next token in the
[2262.4s] sequence so what comes out is actually
[2264.8s] kind of a little bit of gibberish right
[2267.1s] but it still has a little bit of like
[2268.6s] local coherence so since she is mine
[2271.6s] it's a part of the information should
[2273.1s] discuss my father great companions
[2276.0s] Gordon showed me sitting over at and Etc
[2279.1s] so I know it doesn't look very good but
[2281.0s] let's actually scroll up and see what it
[2284.1s] looked like when I started the
[2286.0s] optimization so all the way here at
[2290.4s] step
[2292.1s] one so after 20 steps of optimization
[2295.3s] you see that what we're getting here is
[2297.0s] looks completely random and of course
[2298.5s] that's because the model has only had 20
[2300.3s] updates to its parameters and so it's
[2302.1s] giving you random text because it's a
[2303.6s] random Network and so you can see that
[2305.9s] at least in comparison to this model is
[2307.9s] starting to do much better and indeed if
[2309.9s] we waited the entire 32,000 steps the
[2312.6s] model will have improved the point that
[2314.2s] it's actually uh generating fairly
[2316.4s] coherent English uh and the tokens
[2319.0s] stream correctly um and uh they they
[2322.0s] kind of make up English a a lot
[2324.8s] better
[2326.4s] um so this has to run for about a day or
[2329.6s] two more now and so uh at this stage we
[2332.4s] just make sure that the loss is
[2333.5s] decreasing everything is looking good um
[2336.4s] and we just have to wait
[2338.1s] and now um let me turn now to the um
[2342.1s] story of the computation that's required
[2345.1s] because of course I'm not running this
[2346.5s] optimization on my laptop that would be
[2348.5s] way too expensive uh because we have to
[2351.0s] run this neural network and we have to
[2352.9s] improve it and we have we need all this
[2354.6s] data and so on so you can't run this too
[2356.8s] well on your computer uh because the
[2358.8s] network is just too large uh so all of
[2361.3s] this is running on the computer that is
[2363.1s] out there in the cloud and I want to
[2365.1s] basically address the compute side of
[2367.0s] the store of training these models and
[2368.7s] what that looks like so let's take a
[2370.2s] look okay so the computer that I'm
[2372.2s] running this optimization on is this 8X
[2375.5s] h100 node so there are eight h100s in a
[2379.6s] single node or a single computer now I
[2382.6s] am renting this computer and it is
[2384.4s] somewhere in the cloud I'm not sure
[2385.6s] where it is physically actually the
[2387.7s] place I like to rent from is called
[2389.1s] Lambda but there are many other
[2390.2s] companies who provide this service so
[2392.3s] when you scroll down you can see that uh
[2395.7s] they have some on demand pricing for
[2397.9s] um sort of computers that have these uh
[2401.6s] h100s which are gpus and I'm going to
[2403.6s] show you what they look like in a second
[2406.0s] but on demand 8times Nvidia h100 uh
[2410.2s] GPU this machine comes for $3 per GPU
[2413.8s] per hour for example so you can rent
[2416.8s] these and then you get a machine in a
[2418.3s] cloud and you can uh go in and you can
[2420.2s] train these
[2421.6s] models and these uh gpus they look like
[2425.7s] this so this is one h100 GPU uh this is
[2429.1s] kind of what it looks like and you slot
[2430.5s] this into your computer and gpus are
[2432.7s] this uh perfect fit for training your
[2434.8s] networks because they are very
[2436.6s] computationally expensive but they
[2438.5s] display a lot of parallelism in the
[2440.6s] computation so you can have many
[2442.2s] independent workers kind of um working
[2444.7s] all at the same time in solving uh the
[2448.2s] matrix multiplication that's under the
[2450.6s] hood of training these neural
[2452.9s] networks so this is just one of these
[2454.8s] h100s but actually you would put them
[2456.8s] you would put multiple of them together
[2458.7s] so you could stack eight of them into a
[2460.4s] single node and then you can stack
[2462.3s] multiple nodes into an entire data
[2464.2s] center or an entire system
[2467.2s] so when we look at a data
[2472.3s] center can't spell when we look at a
[2475.0s] data center we start to see things that
[2476.4s] look like this right so we have one GPU
[2478.3s] goes to eight gpus goes to a single
[2479.9s] system goes to many systems and so these
[2482.4s] are the bigger data centers and there of
[2484.0s] course would be much much more expensive
[2486.4s] um and what's happening is that all the
[2488.6s] big tech companies really desire these
[2491.2s] gpus so they can train all these
[2493.1s] language models because they are so
[2495.1s] powerful and that has is fundamentally
[2497.3s] what has driven the stock price of
[2498.9s] Nvidia to be $3.4 trillion today as an
[2502.0s] example and why Nvidia has kind of
[2504.9s] exploded so this is the Gold Rush the
[2507.1s] Gold Rush is getting the gpus getting
[2510.1s] enough of them so they can all
[2512.0s] collaborate to perform this optimization
[2515.4s] and they're what are they all doing
[2516.9s] they're all collaborating to predict the
[2519.1s] next token on a data set like the fine
[2521.7s] web data
[2523.0s] set this is the computational workflow
[2525.3s] that that basically is extremely
[2526.8s] expensive the more gpus you have the
[2529.1s] more tokens you can try to predict and
[2530.9s] improve on and you're going to process
[2532.9s] this data set faster and you can iterate
[2535.2s] faster and get a bigger Network and
[2536.6s] train a bigger Network and so on so this
[2539.3s] is what all those machines are look like
[2541.0s] are uh are doing and this is why all of
[2544.4s] this is such a big deal and for example
[2546.6s] this is a
[2548.6s] article from like about a month ago or
[2550.1s] so this is why it's a big deal that for
[2551.9s] example Elon Musk is getting 100,000
[2554.6s] gpus uh in a single Data Center and all
[2558.6s] of these gpus are extremely expensive
[2560.7s] are going to take a ton of power and all
[2562.4s] of them are just trying to predict the
[2563.5s] next token in the sequence and improve
[2565.6s] the network uh by doing so and uh get
[2569.2s] probably a lot more coherent text than
[2570.8s] what we're seeing here a lot faster okay
[2573.0s] so unfortunately I do not have a couple
[2575.0s] 10 or hundred million of dollars to
[2577.2s] spend on training a really big model
[2579.2s] like this but luckily we can turn to
[2581.3s] some big tech companies who train these
[2584.0s] models routinely and release some of
[2586.5s] them once they are done training so
[2588.6s] they've spent a huge amount of compute
[2590.4s] to train this network and they release
[2592.1s] the network at the end of the
[2593.5s] optimization so it's very useful because
[2595.6s] they've done a lot of compute for that
[2598.0s] so there are many companies who train
[2599.2s] these models routinely but actually not
[2601.3s] many of them release uh these what's
[2603.4s] called base models so the model that
[2605.6s] comes out at the end here is is what's
[2607.4s] called a base model what is a base model
[2609.8s] it's a token simulator right it's an
[2612.0s] internet text token simulator and so
[2615.7s] that is not by itself useful yet because
[2618.2s] what we want is what's called an
[2619.5s] assistant we want to ask questions and
[2621.4s] have it respond to answers these models
[2623.5s] won't do that they just uh create sort
[2625.6s] of remixes of the internet they dream
[2628.8s] internet pages so the base models are
[2631.6s] not very often released because they're
[2632.9s] kind of just only a step one of a few
[2635.2s] other steps that we still need to take
[2636.4s] to get in system
[2638.3s] however a few releases have been made so
[2641.4s] as an example the gbt2 model released
[2644.4s] the 1.6 billion sorry 1.5 billion model
[2648.1s] back in 2019 and this gpt2 model is a
[2650.7s] base model now what is a model release
[2653.8s] what does it look like to release these
[2655.1s] models so this is the gpt2 repository on
[2658.2s] GitHub well you need two things
[2660.1s] basically to release model number one we
[2663.0s] need the um python code usually that
[2667.8s] describes the sequence of operations in
[2670.0s] detail that they make in their model so
[2675.0s] um if you remember
[2676.6s] back this
[2678.8s] Transformer the sequence of steps that
[2680.8s] are taken here in this neural network is
[2682.9s] what is being described by this code so
[2685.9s] this code is sort of implementing the
[2687.4s] what's called forward pass of this
[2689.0s] neural network so we need the specific
[2691.9s] details of exactly how they wired up
[2693.6s] that neural network so this is just
[2695.9s] computer code and it's usually just a
[2697.4s] couple hundred lines of code it's not
[2699.1s] it's not that crazy and uh this is all
[2701.6s] fairly understandable and usually fairly
[2703.2s] standard what's not standard are the
[2705.2s] parameters that's where the actual value
[2707.3s] is what are the parameters of this
[2709.3s] neural network because there's 1.6
[2711.3s] billion of them and we need the correct
[2713.1s] setting or a really good setting and so
[2715.6s] that's why in addition to this source
[2717.6s] code they release the parameters which
[2720.0s] in this case is roughly 1.5 billion
[2723.0s] parameters and these are just numbers so
[2725.2s] it's one single list of 1.5 billion
[2727.5s] numbers the precise and good setting of
[2730.3s] all the knobs such that the tokens come
[2732.4s] out
[2734.0s] well so uh you need those two things to
[2737.4s] get a base model
[2739.5s] release
[2741.3s] now gpt2 was released but that's
[2743.7s] actually a fairly old model as I
[2745.0s] mentioned so actually the model we're
[2746.4s] going to turn to is called llama 3 and
[2749.2s] that's the one that I would like to show
[2750.2s] you next so llama 3 so gpt2 again was
[2754.3s] 1.6 billion parameters trained on 100
[2756.0s] billion tokens Lama 3 is a much bigger
[2758.6s] model and much more modern model it is
[2760.6s] released and trained by meta and it is a
[2764.0s] 45 billion parameter model trained on 15
[2767.2s] trillion tokens in very much the same
[2769.5s] way just much much
[2771.1s] bigger um and meta has also made a
[2775.0s] release of llama 3 and that was part of
[2778.0s] this
[2779.0s] paper so with this paper that goes into
[2781.3s] a lot of detail the biggest base model
[2783.3s] that they released is the Lama 3.1 4.5
[2787.1s] 405 billion parameter model so this is
[2790.2s] the base model and then in addition to
[2792.1s] the base model you see here
[2793.5s] foreshadowing for later sections of the
[2795.3s] video they also released the instruct
[2797.4s] model and the instruct means that this
[2799.4s] is an assistant you can ask it questions
[2801.4s] and it will give you answers we still
[2803.1s] have yet to cover that part later for
[2805.2s] now let's just look at this base model
[2807.0s] this token simulator and let's play with
[2809.4s] it and try to think about you know what
[2811.1s] is this thing and how does it work and
[2813.5s] um what do we get at the end of this
[2815.1s] optimization if you let this run Until
[2817.2s] the End uh for a very big neural network
[2819.8s] on a lot of data so my favorite place to
[2822.2s] interact with the base models is this um
[2825.0s] company called hyperbolic which is
[2826.8s] basically serving the base model of the
[2829.8s] 405b Llama 3.1 so when you go to the
[2833.4s] website and I think you may have to
[2834.6s] register and so on make sure that in the
[2836.5s] models make sure that you are using
[2838.6s] llama 3.1 405 billion base it must be
[2842.2s] the base model and then here let's say
[2844.9s] the max tokens is how many tokens we're
[2846.4s] going to be gener rating so let's just
[2848.1s] decrease this to be a bit less just so
[2850.0s] we don't waste compute we just want the
[2852.2s] next 128 tokens and leave the other
[2854.7s] stuff alone I'm not going to go into the
[2856.0s] full detail here um now fundamentally
[2859.0s] what's going to happen here is identical
[2861.6s] to what happens here during inference
[2863.2s] for us so this is just going to continue
[2865.4s] the token sequence of whatever you
[2867.2s] prefix you're going to give it so I want
[2869.8s] to first show you that this model here
[2871.8s] is not yet an assistant so you can for
[2873.9s] example ask it what is 2 plus 2 it's not
[2876.2s] going to tell you oh it's four uh what
[2878.2s] else can I help you with it's not going
[2879.7s] to do that because what is 2 plus 2 is
[2882.6s] going to be tokenized and then those
[2885.3s] tokens just act as a prefix and then
[2887.9s] what the model is going to do now is
[2889.1s] just going to get the probability for
[2890.3s] the next token and it's just a glorified
[2892.3s] autocomplete it's a very very expensive
[2894.5s] autocomplete of what comes next um
[2897.4s] depending on the statistics of what it
[2899.0s] saw in its training documents which are
[2900.8s] basically web
[2902.0s] pages so let's just uh hit enter to see
[2905.2s] what tokens it comes up with as a
[2911.0s] continuation okay so here it kind of
[2912.9s] actually answered the question and
[2914.1s] started to go off into some
[2915.2s] philosophical territory uh let's try it
[2917.7s] again so let me copy and paste and let's
[2919.7s] try again from scratch what is 2 plus
[2925.6s] two so okay so it just goes off again so
[2929.6s] notice one more thing that I want to
[2930.8s] stress is that the system uh I think
[2933.7s] every time you put it in it just kind of
[2935.3s] starts from scratch
[2938.2s] so it doesn't uh the system here is
[2939.7s] stochastic so for the same prefix of
[2942.4s] tokens we're always getting a different
[2944.0s] answer and the reason for that is that
[2946.5s] we get this probity distribution and we
[2948.6s] sample from it and we always get
[2950.4s] different samples and we sort of always
[2951.9s] go into a different territory uh
[2953.9s] afterwards so here in this case um I
[2958.0s] don't know what this is let's try one
[2959.4s] more
[2963.0s] time so it just continues on so it's
[2965.4s] just doing the stuff that it's saw on
[2966.5s] the internet right um and it's just kind
[2969.2s] of like regurgitating those uh
[2971.0s] statistical
[2972.8s] patterns so first things it's not an
[2975.7s] assistant yet it's a token autocomplete
[2978.9s] and second it is a stochastic system now
[2982.2s] the crucial thing is that even though
[2984.3s] this model is not yet by itself very
[2986.4s] useful for a lot of applications just
[2989.0s] yet um it is still very useful because
[2992.7s] in the task of predicting the next token
[2994.4s] in the sequence the model has learned a
[2996.8s] lot about the world and it has stored
[2999.4s] all that knowledge in the parameters of
[3001.2s] the network so remember that our text
[3004.3s] looked like this right internet web
[3006.2s] pages and now all of this is sort of
[3009.0s] compressed in the weights of the network
[3011.4s] so you can think of um these 405 billion
[3015.1s] parameters is a kind of compression of
[3016.8s] the internet you can think of the
[3019.2s] 45 billion parameters is kind of like a
[3021.7s] zip file uh but it's not a loss less
[3025.1s] compression it's a loss C compression
[3027.2s] we're kind of like left with kind of a
[3028.8s] gal of the internet and we can generate
[3031.1s] from it right now we can elicit some of
[3034.0s] this knowledge by prompting the base
[3035.9s] model uh accordingly so for example
[3038.4s] here's a prompt that might work to
[3040.3s] elicit some of that knowledge that's
[3041.6s] hiding in the parameters here's my top
[3044.0s] 10 list of the top landmarks to see in
[3046.1s] the
[3048.0s] pairs
[3050.1s] um and I'm doing it this way because I'm
[3052.5s] trying to Prime the model to now
[3054.4s] continue this list so let's see if that
[3056.1s] works when I press
[3057.5s] enter okay so you see that it started a
[3060.6s] list and it's now kind of giving me some
[3062.2s] of those
[3063.2s] landmarks and now notice that it's
[3065.1s] trying to give a lot of information here
[3067.0s] now you might not be able to actually
[3069.2s] fully trust some of the information here
[3070.9s] remember that this is all just a
[3072.2s] recollection of some of the internet
[3074.4s] documents and so the things that occur
[3077.1s] very frequently in the internet data are
[3079.5s] probably more likely to be remembered
[3081.4s] correctly compared to things that happen
[3083.5s] very infrequently so you can't fully
[3085.6s] trust some of the things that and some
[3087.2s] of the information that is here because
[3088.4s] it's all just a vague recollection of
[3090.3s] Internet documents because the
[3092.8s] information is not stored explicitly in
[3094.9s] any of the parameters it's all just the
[3096.9s] recollection that said we did get
[3098.6s] something that is probably approximately
[3100.3s] correct and I don't actually have the
[3102.3s] expertise to verify that this is roughly
[3104.0s] correct but you see that we've elicited
[3106.4s] a lot of the knowledge of the model and
[3108.9s] this knowledge is not precise and exact
[3111.4s] this knowledge is vague and
[3113.7s] probabilistic and statistical and the
[3115.8s] kinds of things that occur often are the
[3117.8s] kinds of things that are more likely to
[3119.1s] be remembered um in the model now I want
[3122.6s] to show you a few more examples of this
[3124.0s] model's Behavior the first thing I want
[3126.0s] to show you is this example I went to
[3128.4s] the Wikipedia page for zebra and let me
[3130.7s] just copy paste the first uh even one
[3133.5s] sentence
[3134.6s] here and let me put it here now when I
[3137.6s] click enter what kind of uh completion
[3139.9s] are we going to get so let me just hit
[3143.4s] enter there are three living species
[3147.0s] etc etc what the model is producing here
[3149.7s] is an exact regurgitation of this
[3151.6s] Wikipedia entry it is reciting this
[3153.8s] Wikipedia entry purely from memory and
[3156.4s] this memory is stored in its parameters
[3159.2s] and so it is possible that at some point
[3161.4s] in these 512 tokens the model will uh
[3164.2s] stray away from the Wikipedia entry but
[3166.4s] you can see that it has huge chunks of
[3168.0s] it memorized here uh let me see for
[3170.1s] example if this sentence
[3171.7s] occurs by now okay so this so we're
[3175.3s] still on track let me check
[3178.6s] here okay we're still on
[3180.8s] track it will eventually uh stray
[3184.5s] away okay so this thing is just recited
[3187.0s] to a very large extent it will
[3188.8s] eventually deviate uh because it won't
[3191.2s] be able to remember exactly now the
[3193.1s] reason that this happens is because
[3194.5s] these models can be extremely good at
[3196.6s] memorization and usually this is not
[3198.6s] what you want in the final model and
[3200.3s] this is something called regurgitation
[3201.9s] and it's usually undesirable to site uh
[3204.8s] things uh directly uh that you have
[3206.8s] trained on now the reason that this
[3209.2s] happens actually is because for a lot of
[3211.6s] documents like for example Wikipedia
[3213.4s] when these documents are deemed to be of
[3215.2s] very high quality as a source like for
[3217.1s] example Wikipedia it is very often uh
[3220.2s] the case that when you train the model
[3222.0s] you will preferentially sample from
[3224.3s] those sources so basically the model has
[3226.5s] probably done a few epochs on this data
[3228.7s] meaning that it has seen this web page
[3230.6s] like maybe probably 10 times or so and
[3232.8s] it's a bit like you like when you read
[3234.3s] some kind of a text many many times say
[3236.6s] you read something a 100 times uh then
[3238.6s] you'll be able to recite it and it's
[3240.3s] very similar for this model if it sees
[3241.8s] something way too often it's going to be
[3243.2s] able to recite it later from memory
[3246.0s] except these models can be a lot more
[3247.8s] efficient um like per presentation than
[3250.6s] human so probably it's only seen this
[3252.3s] Wikipedia entry 10 times but basically
[3254.0s] it has remembered this article exactly
[3256.4s] in its parameters okay the next thing I
[3258.3s] want to show you is something that the
[3259.4s] model has definitely not seen during its
[3261.5s] training so for example if we go to the
[3264.4s] paper uh and then we navigate to the
[3266.7s] pre-training data we'll see here that uh
[3271.2s] the data set has a knowledge cut off
[3273.3s] until the end of 2023 so it will not
[3275.9s] have seen documents after this point and
[3278.5s] certainly it has not seen anything about
[3279.9s] the 2024 election and how it turned out
[3283.0s] now if we Prime the model with the
[3286.1s] tokens from the future it will continue
[3289.0s] the token sequence and it will just take
[3290.6s] its best guess according to the
[3291.9s] knowledge that it has in its own
[3293.4s] parameters so let's take a look at what
[3295.1s] that could look like
[3297.0s] so the Republican Party kit
[3299.7s] Trump okay president of the United
[3301.9s] States from
[3302.9s] 2017 and let's see what it says after
[3305.2s] this point so for example the model will
[3307.2s] have to guess at the running mate and
[3309.4s] who it's against Etc so let's hit
[3311.7s] enter so here thingss that Mike Pence
[3314.0s] was the running mate instead of JD Vance
[3317.4s] and the ticket was against Hillary
[3320.8s] Clinton and Tim Kane so this is kind of
[3323.3s] a interesting parallel universe
[3325.2s] potentially of what could have happened
[3326.3s] happened according to the LM let's get a
[3328.8s] different sample so the identical prompt
[3331.0s] and let's
[3333.5s] resample so here the running mate was
[3335.5s] Ronda santis and they ran against Joe
[3338.1s] Biden and Camala Harris so this is again
[3340.7s] a different parallel universe so the
[3342.6s] model will take educated guesses and it
[3344.2s] will continue the token sequence based
[3345.8s] on this knowledge um and it will just
[3348.2s] kind of like all of what we're seeing
[3349.7s] here is what's called hallucination the
[3352.0s] model is just taking its best guess uh
[3354.8s] in a probalistic manner the next thing I
[3356.9s] would like to show you is that even
[3358.2s] though this is a base model and not yet
[3360.1s] an assistant model it can still be
[3362.0s] utilized in Practical applications if
[3364.2s] you are clever with your prompt design
[3366.6s] so here's something that we would call a
[3368.4s] few shot
[3369.5s] prompt so what it is here is that I have
[3372.3s] 10 words or 10 pairs and each pair is a
[3376.4s] word of English column and then a the
[3379.8s] translation in Korean and we have 10 of
[3382.7s] them and what the model does here is at
[3385.4s] the end we have teacher column and then
[3387.5s] here's where we're going to do a
[3388.4s] completion of say just five tokens and
[3391.8s] these models have what we call in
[3393.4s] context learning abilities and what
[3395.8s] that's referring to is that as it is
[3397.4s] reading this context it is learning sort
[3400.4s] of in
[3401.8s] place that there's some kind of a
[3403.8s] algorithmic pattern going on in my data
[3406.5s] and it knows to continue that pattern
[3408.8s] and this is called kind of like Inc
[3410.2s] context learning so it takes on the role
[3413.2s] of a
[3414.8s] translator and when we hit uh completion
[3418.1s] we see that the teacher translation is
[3419.9s] Sim which is correct um and so this is
[3423.3s] how you can build apps by being clever
[3425.0s] with your prompting even though we still
[3426.7s] just have a base model for now and it
[3428.6s] relies on what we call this um uh in
[3431.6s] context learning ability and it is done
[3434.2s] by constructing what's called a few shot
[3435.9s] prompt okay and finally I want to show
[3438.0s] you that there is a clever way to
[3439.4s] actually instantiate a whole language
[3441.6s] model assistant just by prompting and
[3445.0s] the trick to it is that we're structure
[3446.6s] a prompt to look like a web page that is
[3449.5s] a conversation between a helpful AI
[3451.9s] assistant and a human and then the model
[3454.2s] will continue that conversation so
[3456.6s] actually to write the prompt I turned to
[3458.6s] chat gbt itself which is kind of meta
[3461.6s] but I told it I want to create an llm
[3463.6s] assistant but all I have is the base
[3465.2s] model so can you please write my um uh
[3470.1s] prompt and this is what it came up with
[3472.3s] which is actually quite good so here's a
[3474.3s] conversation between an AI assistant and
[3475.8s] a human
[3476.6s] the AI assistant is knowledgeable
[3478.1s] helpful capable of answering wide
[3479.6s] variety of questions Etc and then here
[3483.4s] it's not enough to just give it a sort
[3485.3s] of description it works much better if
[3487.6s] you create this fot prompt so here's a
[3490.0s] few terms of human assistant human
[3493.0s] assistant and we have uh you know a few
[3495.8s] turns of conversation and then here at
[3498.0s] the end is we're going to be putting the
[3499.1s] actual query that we like so let me copy
[3501.7s] paste this into the base model prompt
[3505.3s] and now let me do human column and this
[3508.8s] is where we put our actual prompt why is
[3511.3s] the sky
[3512.8s] blue and uh let's uh
[3517.1s] run assistant the sky appears blue due
[3520.0s] to the phenomenon called R lights
[3521.6s] scattering etc etc so you see that the
[3524.0s] base model is just continuing the
[3525.3s] sequence but because the sequence looks
[3527.6s] like this conversation it takes on that
[3529.9s] role but it is a little subtle because
[3532.4s] here it just uh you know it ends the
[3534.5s] assistant and then just you know
[3535.7s] hallucinate Ates the next question by
[3537.1s] the human Etc so it'll just continue
[3538.7s] going on and on uh but you can see that
[3541.3s] we have sort of accomplished the task
[3543.9s] and if you just took this why is the sky
[3546.2s] blue and if we just refresh this and put
[3549.3s] it here then of course we don't expect
[3551.0s] this to work with a base model right
[3552.4s] we're just going to who knows what we're
[3554.1s] going to get okay we're just going to
[3555.4s] get more
[3556.4s] questions okay so this is one way to
[3559.4s] create an assistant even though you may
[3561.6s] only have a base model okay so this is
[3564.0s] the kind of brief summary of the things
[3566.3s] we talked about over the last few
[3568.5s] minutes now let me zoom out
[3572.7s] here and this is kind of like what we've
[3574.8s] talked about so far we wish to train LM
[3577.6s] assistants like chpt we've discussed the
[3580.7s] first stage of that which is the
[3582.2s] pre-training stage and we saw that
[3584.0s] really what it comes down to is we take
[3585.4s] Internet documents we break them up into
[3587.4s] these tokens these atoms of little text
[3589.4s] chunks and then we predict token
[3591.4s] sequences using neural networks the
[3594.2s] output of this entire stage is this base
[3596.7s] model it is the setting of The
[3598.3s] parameters of this network and this base
[3601.6s] model is basically an internet document
[3603.3s] simulator on the token level so it can
[3605.7s] just uh it can generate token sequences
[3608.4s] that have the same kind of like
[3610.0s] statistics as Internet documents and we
[3612.7s] saw that we can use it in some
[3613.7s] applications but we actually need to do
[3615.2s] better we want an assistant we want to
[3617.2s] be able to ask questions and we want the
[3618.8s] model to give us answers and so we need
[3621.2s] to now go into the second stage which is
[3623.7s] called the post-training stage so we
[3626.4s] take our base model our internet
[3628.0s] document simulator and hand it off to
[3630.0s] post training so we're now going to
[3631.9s] discuss a few ways to do what's called
[3633.8s] post training of these models these
[3636.4s] stages in post training are going to be
[3638.0s] computationally much less expensive most
[3640.4s] of the computational work all of the
[3642.3s] massive data centers um and all of the
[3645.4s] sort of heavy compute and millions of
[3647.7s] dollars are the pre-training stage but
[3650.5s] now we go into the slightly cheaper but
[3652.9s] still extremely important stage called
[3654.9s] post trining where we turn this llm
[3657.2s] model into an assistant so let's take a
[3659.4s] look at how we can get our model to not
[3662.3s] sample internet documents but to give
[3664.8s] answers to questions so in other words
[3667.5s] what we want to do is we want to start
[3668.9s] thinking about conversations and these
[3671.0s] are conversations that can be multi-turn
[3673.0s] so so uh there can be multiple turns and
[3676.0s] they are in the simplest case a
[3677.3s] conversation between a human and an
[3679.3s] assistant and so for example we can
[3681.2s] imagine the conversation could look
[3682.4s] something like this when a human says
[3684.1s] what is 2 plus2 the assistant should re
[3685.9s] respond with something like 2 plus 2 is
[3687.4s] 4 when a human follows up and says what
[3689.7s] if it was star instead of a plus
[3691.7s] assistant could respond with something
[3692.9s] like
[3693.6s] this um and similar here this is another
[3696.2s] example showing that the assistant could
[3697.5s] also have some kind of a personality
[3699.1s] here uh that it's kind of like nice and
[3701.9s] then here in the third example I'm
[3703.1s] showing that when a human is asking for
[3704.9s] something that we uh don't wish to help
[3707.2s] with we can produce what's called
[3708.7s] refusal we can say that we cannot help
[3710.6s] with that so in other words what we want
[3713.3s] to do now is we want to think through
[3715.2s] how in a system should interact with the
[3717.0s] human and we want to program the
[3719.1s] assistant and Its Behavior in these
[3721.4s] conversations now because this is neural
[3723.6s] networks we're not going to be
[3724.8s] programming these explicitly in code
[3727.5s] we're not going to be able to program
[3728.6s] the assistant in that way because this
[3730.4s] is neural networks everything is done
[3732.4s] through neural network training on data
[3734.2s] sets and so because of that we are going
[3737.5s] to be implicitly programming the
[3739.4s] assistant by creating data sets of
[3741.7s] conversations so these are three
[3743.5s] independent examples of conversations in
[3745.5s] a data dat set an actual data set and
[3747.8s] I'm going to show you examples will be
[3749.3s] much larger it could have hundreds of
[3751.0s] thousands of conversations that are
[3752.4s] multi- turn very long Etc and would
[3754.8s] cover a diverse breath of topics but
[3757.4s] here I'm only showing three examples but
[3759.6s] the way this works basically is uh a
[3763.0s] assistant is being programmed by example
[3765.6s] and where is this data coming from like
[3767.4s] 2 * 2al 4 same as 2 plus 2 Etc where
[3770.1s] does that come from this comes from
[3771.7s] Human labelers so we will basically give
[3774.7s] human labelers some conversational
[3776.4s] context and we will ask them to um
[3778.7s] basically give the ideal assistant
[3780.6s] response in this situation and a human
[3783.6s] will write out the ideal response for an
[3786.2s] assistant in any situation and then
[3788.5s] we're going to get the model to
[3790.4s] basically train on this and to imitate
[3792.6s] those kinds of
[3794.0s] responses so the way this works then is
[3796.3s] we are going to take our base model
[3797.8s] which we produced in the preing stage
[3800.4s] and this base model was trained on
[3802.0s] internet documents we're now going to
[3803.8s] take that data set of internet documents
[3805.2s] and we're gonna throw it out and we're
[3807.3s] going to substitute a new data set and
[3809.5s] that's going to be a data set of
[3810.6s] conversations and we're going to
[3812.1s] continue training the model on these
[3814.0s] conversations on this new data set of
[3815.6s] conversations and what happens is that
[3817.6s] the model will very rapidly adjust and
[3820.6s] will sort of like learn the statistics
[3822.9s] of how this assistant responds to human
[3825.5s] queries and then later during inference
[3828.5s] we'll be able to basically um Prime the
[3831.6s] assistant and get the response and it
[3834.1s] will be imitating what the humans will
[3836.1s] human labelers would do in that
[3837.4s] situation if that makes sense so we're
[3840.1s] going to see examples of that and this
[3841.2s] is going to become bit more concrete I
[3843.2s] also wanted to mention that this
[3845.1s] post-training stage we're going to
[3846.2s] basically just continue training the
[3847.7s] model but um the pre-training stage can
[3851.0s] in practice take roughly three months of
[3853.0s] training on many thousands of computers
[3855.4s] the post-training stage will typically
[3856.8s] be much shorter like 3 hours for example
[3860.1s] um and that's because the data set of
[3861.5s] conversations that we're going to create
[3863.4s] here manually is much much smaller than
[3866.1s] the data set of text on the internet and
[3869.0s] so this training will be very short but
[3871.5s] fundamentally we're just going to take
[3873.2s] our base model we're going to continue
[3875.0s] training using the exact same algorithm
[3877.3s] the exact same everything except we're
[3879.2s] swapping out the data set for
[3880.7s] conversations so the questions now are
[3883.1s] what are these conversations how do we
[3885.0s] represent them how do we get the model
[3886.9s] to see conversations instead of just raw
[3889.2s] text and then what are the outcomes of
[3892.8s] um this kind of training and what do you
[3894.7s] get in a certain like psychological
[3896.9s] sense uh when we talk about the model so
[3898.9s] let's turn to those questions now so
[3901.1s] let's start by talking about the
[3902.3s] tokenization of conversations everything
[3905.3s] in these models has to be turned into
[3907.0s] tokens because everything is just about
[3908.8s] token sequences so how do we turn
[3910.9s] conversations into token sequences is
[3913.0s] the question and so for that we need to
[3915.0s] design some kind of ending coding and uh
[3917.4s] this is kind of similar to maybe if
[3918.8s] you're familiar you don't have to be
[3920.7s] with for example the TCP IP packet in um
[3923.5s] on the internet there are precise rules
[3925.4s] and protocols for how you represent
[3927.1s] information how everything is structured
[3929.2s] together so that you have all this kind
[3930.4s] of data laid out in a way that is
[3932.7s] written out on a paper and that everyone
[3934.5s] can agree on and so it's the same thing
[3936.7s] now happening in llms we need some kind
[3938.5s] of data structures and we need to have
[3940.3s] some rules around how these data
[3941.8s] structures like conversations get
[3943.6s] encoded and decoded to and from tokens
[3946.9s] and so I want to show you now how I
[3948.9s] would
[3949.9s] recreate uh this conversation in the
[3952.4s] token space so if you go to Tech
[3954.4s] tokenizer
[3956.3s] I can take that conversation and this is
[3958.8s] how it is represented in uh for the
[3961.2s] language model so here we have we are
[3963.8s] iterating a user and an assistant in
[3966.4s] this two- turn
[3968.1s] conversation and what you're seeing here
[3970.2s] is it looks ugly but it's actually
[3971.7s] relatively simple the way it gets turned
[3973.8s] into a token sequence here at the end is
[3976.2s] a little bit complicated but at the end
[3978.4s] this conversation between a user and
[3979.7s] assistant ends up being 49 tokens it is
[3982.6s] a one-dimensional sequence of 49 tokens
[3984.8s] and these are the tokens
[3986.5s] okay and all the different llms will
[3989.7s] have a slightly different format or
[3991.6s] protocols and it's a little bit of a
[3993.3s] wild west right now but for example GPT
[3996.2s] 40 does it in the following way you have
[3999.0s] this special token called imore start
[4002.4s] and this is short for IM imaginary
[4004.2s] monologue uh the
[4006.0s] start then you have to specify um I
[4009.6s] don't actually know why it's called that
[4010.7s] to be honest then you have to specify
[4012.8s] whose turn it is so for example user
[4014.7s] which is a token 4
[4016.9s] 28 then you have internal monologue
[4020.3s] separator and then it's the exact
[4023.0s] question so the tokens of the question
[4025.5s] and then you have to close it so I am
[4027.1s] end the end of the imaginary monologue
[4029.6s] so
[4030.7s] basically the question from a user of
[4033.0s] what is 2 plus two ends up being the
[4036.0s] token sequence of these tokens and now
[4039.4s] the important thing to mention here is
[4040.8s] that IM start this is not text right IM
[4044.1s] start is a special token that gets added
[4047.1s] it's a new token and um this token has
[4050.6s] never been trained on so far it is a new
[4052.9s] token that we create in a post-training
[4054.6s] stage and we introduce and so these
[4057.6s] special tokens like IM seep IM start Etc
[4060.4s] are introduced and interspersed with
[4062.8s] text so that they sort of um get the
[4065.3s] model to learn that hey this is a the
[4067.4s] start of a turn for who is it start of
[4069.7s] the turn for the start of the turn is
[4071.6s] for the user and then this is what the
[4074.0s] user says and then the user ends and
[4076.7s] then it's a new start of a turn and it
[4078.7s] is by the assistant and then what does
[4081.2s] the assistant say well these are the
[4082.8s] tokens of what the assistant says Etc
[4085.6s] and so this conversation is not turned
[4086.9s] into the sequence of tokens the specific
[4089.7s] details here are not actually that
[4091.1s] important all I'm trying to show you in
[4093.0s] concrete terms is that our conversations
[4095.8s] which we think of as kind of like a
[4097.0s] structured object end up being turned
[4099.4s] via some encoding into onedimensional
[4101.8s] sequences of tokens and so because this
[4105.2s] is one dimensional sequence of tokens we
[4107.2s] can apply all the stuff that we applied
[4109.1s] before now it's just a sequence of
[4110.9s] tokens and now we can train a language
[4113.0s] model on it and so we're just predicting
[4115.1s] the next token in a sequence uh just
[4117.2s] like before and um we can represent and
[4119.9s] train on conversations and then what
[4122.4s] does it look like at test time during
[4124.0s] inference so say we've trained a model
[4126.9s] and we've trained a model on these kinds
[4129.1s] of data sets of conversations and now we
[4131.3s] want to
[4132.3s] inference so during inference what does
[4134.6s] this look like when you're on on chash
[4136.0s] apt well you come to chash apt and you
[4138.7s] have say like a dialogue with it and the
[4141.1s] way this works is
[4143.4s] basically um say that this was already
[4146.2s] filled in so like what is 2 plus 2 2
[4147.9s] plus 2 is four and now you issue what if
[4150.4s] it was times I am end and what basically
[4153.9s] ends up happening um on the servers of
[4156.1s] open AI or something like that is they
[4158.0s] put in I start assistant I amep and this
[4161.9s] is where they end it right here so they
[4164.7s] construct this context and now they
[4167.4s] start sampling from the model so it's at
[4169.4s] this stage that they will go to the
[4170.7s] model and say okay what is a good for
[4172.4s] sequence what is a good first token what
[4174.6s] is a good second token what is a good
[4176.6s] third token and this is where the LM
[4178.4s] takes over and creates a response like
[4181.4s] for example response that looks
[4183.6s] something like this but it doesn't have
[4184.9s] to be identical to this but it will have
[4186.7s] the flavor of this if this kind of a
[4188.9s] conversation was in the data set so um
[4192.4s] that's roughly how the protocol Works
[4194.9s] although the details of this protocol
[4196.5s] are not important so again my goal is
[4199.5s] that just to show you that everything
[4201.2s] ends up being just a one-dimensional
[4202.4s] token sequence so we can apply
[4204.1s] everything we've already seen but we're
[4206.6s] now training on conversations and we're
[4208.7s] now uh basically generating
[4211.0s] conversations as well okay so now I
[4213.2s] would like to turn to what these data
[4214.4s] sets look like in practice the first
[4216.4s] paper that I would like to show you and
[4217.7s] the first effort in this direction is
[4220.5s] this paper from openai in 2022 and this
[4223.4s] paper was called instruct GPT or the
[4225.9s] technique that they developed and this
[4227.4s] was the first time that opena has kind
[4229.0s] of talked about how you can take
[4230.4s] language models and fine-tune them on
[4232.6s] conversations and so this paper has a
[4234.8s] number of details that I would like to
[4236.1s] take you through so the first stop I
[4238.0s] would like to make is in section 3.4
[4240.2s] where they talk about the human
[4241.6s] contractors that they hired uh in this
[4244.1s] case from upwork or through scale AI to
[4247.5s] uh construct these conversations and so
[4250.0s] there are human labelers involved whose
[4252.2s] job it is professionally to create these
[4254.5s] conversations and these labelers are
[4256.5s] asked to come up with prompts and then
[4258.8s] they are asked to also complete the
[4260.8s] ideal assistant responses and so these
[4263.3s] are the kinds of prompts that people
[4264.6s] came up with so these are human labelers
[4266.7s] so list five ideas for how to regain
[4268.6s] enthusiasm for my career what are the
[4270.6s] top 10 science fiction books I should
[4272.0s] read next and there's many different
[4274.0s] types of uh kind of prompts here so
[4276.9s] translate this sentence from uh to
[4278.7s] Spanish Etc and so there's many things
[4281.6s] here that people came up with they first
[4283.7s] come up with the prompt and then they
[4285.7s] also uh answer that prompt and they give
[4288.0s] the ideal assistant response now how do
[4290.5s] they know what is the ideal assistant
[4292.3s] response that they should write for
[4293.9s] these prompts so when we scroll down a
[4295.9s] little bit further we see that here we
[4297.8s] have this excerpt of labeling
[4299.6s] instructions uh that are given to the
[4301.4s] human labelers so the company that is
[4304.0s] developing the language model like for
[4305.3s] example open AI writes up labeling
[4307.4s] instructions for how the humans should
[4309.6s] create ideal responses and so here for
[4312.6s] example is an excerpt uh of these kinds
[4314.6s] of labeling instruction instructions on
[4316.2s] High level you're asking people to be
[4317.6s] helpful truthful and harmless and you
[4320.0s] can pause the video if you'd like to see
[4321.8s] more here but on a high level basically
[4324.2s] just just answer try to be helpful try
[4326.2s] to be truthful and don't answer
[4328.2s] questions that we don't want um kind of
[4330.3s] the system to handle uh later in chat
[4333.2s] gbt and so roughly speaking the company
[4336.6s] comes up with the labeling instructions
[4338.1s] usually they are not this short usually
[4339.7s] there are hundreds of pages and people
[4341.6s] have to study them professionally and
[4343.8s] then they write out the ideal assistant
[4346.2s] responses uh following those labeling
[4348.3s] instructions so this is a very human
[4350.8s] heavy process as it was described in
[4352.7s] this paper now the data set for instruct
[4354.9s] GPT was never actually released by openi
[4357.3s] but we do have some open- Source um
[4359.6s] reproductions that were're trying to
[4360.8s] follow this kind of a setup and collect
[4363.0s] their own data so one that I'm familiar
[4365.1s] with for example is the effort of open
[4368.0s] Assistant from a while back and this is
[4370.4s] just one of I think many examples but I
[4372.3s] just want to show you an example so
[4374.5s] here's so these were people on the
[4376.4s] internet that were asked to basically
[4377.9s] create these conversations similar to
[4379.6s] what um open I did with human labelers
[4383.2s] and so here's an entry of a person who
[4385.5s] came up with this BR can you write a
[4387.3s] short introduction to the relevance of
[4388.7s] the term
[4389.6s] manop uh in economics please use
[4392.6s] examples Etc and then the same person or
[4395.3s] potentially a different person will
[4397.2s] write up the response so here's the
[4398.8s] assistant response to this and so then
[4402.0s] the same person or different person will
[4403.8s] actually write out this ideal
[4406.9s] response and then this is an example of
[4409.5s] maybe how the conversation could
[4410.6s] continue now explain it to a dog and
[4413.4s] then you can try to come up with a
[4414.8s] slightly a simpler explanation or
[4416.6s] something like that now this then
[4420.0s] becomes the label and we end up training
[4421.7s] on this so what happens during training
[4425.2s] is that um of course we're not going to
[4428.1s] have a full coverage of all the possible
[4430.5s] questions that um the model will
[4433.9s] encounter at test time during inference
[4436.2s] we can't possibly cover all the possible
[4437.8s] prompts that people are going to be
[4439.0s] asking in the future but if we have a
[4442.0s] like a data set of a few of these
[4443.9s] examples then the model during training
[4446.6s] will start to take on this Persona of
[4449.1s] this helpful truthful harmless assistant
[4452.2s] and it's all programmed by example and
[4454.8s] so these are all examples of behavior
[4457.0s] and if you have conversations of these
[4458.4s] example behaviors and you have enough of
[4459.9s] them like 100,00 and you train on it the
[4462.4s] model sort of starts to understand the
[4463.8s] statistical pattern and it kind of takes
[4466.0s] on this personality of this
[4468.2s] assistant now it's possible that when
[4470.3s] you get the exact same question like
[4472.4s] this at test time it's possible that the
[4475.6s] answer will be recited as exactly what
[4478.8s] was in the training set but more likely
[4480.7s] than that is that the model will kind of
[4483.1s] like do something of a similar Vibe um
[4485.9s] and we will understand that this is the
[4487.3s] kind of answer that you want um so
[4491.2s] that's what we're doing we're
[4492.2s] programming the system um by example and
[4495.6s] the system adopts statistically this
[4498.2s] Persona of this helpful truthful
[4500.6s] harmless assistant which is kind of like
[4502.8s] reflected in the labeling instructions
[4504.4s] that the company creates now I want to
[4506.4s] show you that the state-of-the-art has
[4508.1s] kind of advanced in the last 2 or 3
[4509.7s] years uh since the instr GPT paper so in
[4512.6s] particular it's not very common for
[4514.6s] humans to be doing all the heavy lifting
[4516.3s] just by themselves anymore and that's
[4518.2s] because we now have language models and
[4519.6s] these language models are helping us
[4521.1s] create these data sets and conversations
[4523.2s] so it is very rare that the people will
[4525.2s] like literally just write out the
[4526.4s] response from scratch it is a lot more
[4528.6s] likely that they will use an existing
[4529.8s] llm to basically like uh come up with an
[4532.3s] answer and then they will edit it or
[4534.0s] things like that so there's many
[4535.5s] different ways in which now llms have
[4537.6s] started to kind of permeate this
[4539.7s] posttraining Set uh stack and llms are
[4543.5s] basically used pervasively to help
[4545.2s] create these massive data sets of
[4546.8s] conversations so I don't want to show
[4549.1s] like Ultra chat is one um such example
[4552.0s] of like a more modern data set of
[4553.9s] conversations it is to a very large
[4556.2s] extent synthetic but uh I believe
[4558.2s] there's some human involvement I could
[4559.6s] be wrong with that usually there will be
[4561.2s] a little bit of human but there will be
[4562.8s] a huge amount of synthetic help um and
[4566.2s] this is all kind of like uh constructed
[4568.7s] in different ways and Ultra chat is just
[4570.2s] one example of many sft data sets that
[4572.2s] currently exist and the only thing I
[4574.1s] want to show you is that uh these data
[4576.0s] sets have now millions of conversations
[4578.1s] uh these conversations are mostly
[4579.6s] synthetic but they're probably edited to
[4581.4s] some extent by humans and they span a
[4583.8s] huge diversity of sort of
[4587.1s] um uh areas and so on so these are
[4591.7s] fairly extensive artifacts by now and
[4593.7s] there's all these like sft mixtures as
[4595.9s] they're called so you have a mixture of
[4597.6s] like lots of different types and sources
[4599.4s] and it's partially synthetic partially
[4601.0s] human and it's kind of like um gone in
[4604.2s] that direction since uh but roughly
[4606.6s] speaking we still have sft data sets
[4608.7s] they're made up of conversations we're
[4610.5s] training on them um just like we did
[4613.0s] before and
[4615.4s] uh I guess like the last thing to note
[4617.1s] is that I want to dispel a little bit of
[4620.0s] the magic of talking to an AI like when
[4622.5s] you go to chat GPT and you give it a
[4624.5s] question and then you hit enter uh what
[4627.8s] is coming back is kind of like
[4630.2s] statistically aligned with what's
[4632.4s] happening in the training set and these
[4634.1s] training sets I mean they really just
[4636.5s] have a seed in humans following labeling
[4639.4s] instructions so what are you actually
[4641.7s] talking to in chat GPT or how should you
[4644.1s] think about it well it's not coming from
[4645.8s] some magical AI like roughly speaking
[4648.4s] it's coming from something that is
[4649.8s] statistically imitating human labelers
[4652.3s] which comes from labeling instructions
[4654.5s] written by these companies and so you're
[4656.4s] kind of imitating this uh you're kind of
[4658.6s] getting um it's almost as if you're
[4660.5s] asking human labeler and imagine that
[4663.0s] the answer that is given to you uh from
[4665.2s] chbt is some kind of a simulation of a
[4667.8s] human labeler uh and it's kind of like
[4670.6s] asking what would a human labeler say in
[4673.2s] this kind of a conversation
[4676.5s] and uh it's not just like this human
[4678.8s] labeler is not just like a random person
[4680.3s] from the internet because these
[4681.6s] companies actually hire experts so for
[4683.2s] example when you are asking questions
[4684.8s] about code and so on the human labelers
[4686.8s] that would be in um involved in creation
[4689.0s] of these conversation data sets they
[4690.6s] will usually be usually be educated
[4692.6s] expert people and you're kind of like
[4695.1s] asking a question of like a simulation
[4697.3s] of those people if that makes sense so
[4699.8s] you're not talking to a magical AI
[4701.2s] you're talking to an average labeler
[4702.8s] this average labeler is probably fairly
[4704.3s] highly skilled
[4705.3s] but you're talking to kind of like an
[4706.8s] instantaneous simulation of that kind of
[4709.3s] a person that would be hired uh in the
[4712.2s] construction of these data sets so let
[4714.6s] me give you one more specific example
[4716.2s] before we move on for example when I go
[4718.5s] to chpt and I say recommend the top five
[4720.8s] landmarks who see in Paris and then I
[4722.6s] hit
[4724.5s] enter
[4729.1s] uh okay here we go okay when I hit enter
[4732.4s] what's coming out here how do I think
[4735.3s] about it well it's not some kind of a
[4736.8s] magical AI that has gone out and
[4738.6s] researched all the landmarks and then
[4740.3s] ranked them using its infinite
[4741.9s] intelligence Etc what I'm getting is a
[4744.2s] statistical simulation of a labeler that
[4747.3s] was hired by open AI you can think about
[4749.2s] it roughly in that way and so if this
[4753.2s] specific um question is in the
[4756.0s] posttraining data set somewhere at open
[4758.0s] aai then I'm very likely to see an
[4760.8s] answer that is probably very very
[4762.1s] similar to what that human labeler would
[4764.2s] have put down
[4765.4s] for those five landmarks how does the
[4767.2s] human labeler come up with this well
[4768.6s] they go off and they go on the internet
[4769.9s] and they kind of do their own little
[4771.0s] research for 20 minutes and they just
[4772.8s] come up with a list right now so if they
[4775.4s] come up with this list and this is in
[4777.2s] the data set I'm probably very likely to
[4779.2s] see what they submitted as the correct
[4781.6s] answer from the assistant now if this
[4784.7s] specific query is not part of the post
[4786.4s] training data set then what I'm getting
[4788.4s] here is a little bit more emergent uh
[4791.2s] because uh the model kind of understands
[4793.5s] the statistically
[4795.2s] um the kinds of landmarks that are in
[4797.6s] this training set are usually the
[4799.2s] prominent landmarks the landmarks that
[4800.7s] people usually want to see the kinds of
[4802.9s] landmarks that are usually uh very often
[4805.4s] talked about on the internet and
[4806.9s] remember that the model already has a
[4808.6s] ton of Knowledge from its pre-training
[4810.1s] on the internet so it's probably seen a
[4812.0s] ton of conversations about Paris about
[4813.7s] landmarks about the kinds of things that
[4815.3s] people like to see and so it's the
[4817.1s] pre-training knowledge that has then
[4818.5s] combined with the postering data set
[4820.8s] that results in this kind of an
[4823.5s] imitation um
[4825.4s] so that's uh that's roughly how you can
[4827.9s] kind of think about what's happening
[4829.5s] behind the scenes here in in this
[4831.6s] statistical sense okay now I want to
[4833.6s] turn to the topic of llm psychology as I
[4835.8s] like to call it which is what are sort
[4837.5s] of the emergent cognitive effects of the
[4840.7s] training pipeline that we have for these
[4842.2s] models so in particular the first one I
[4844.4s] want to talk to is of course
[4847.5s] hallucinations so you might be familiar
[4850.5s] with model hallucinations it's when llms
[4852.5s] make stuff up they just totally
[4853.8s] fabricate information Etc and it's a big
[4856.1s] problem with llm assistants it is a
[4858.2s] problem that existed to a large extent
[4860.0s] with early models uh from many years ago
[4862.7s] and I think the problem has gotten a bit
[4864.2s] better uh because there are some
[4865.6s] medications that I'm going to go into in
[4867.1s] a second for now let's just try to
[4869.2s] understand where these hallucinations
[4870.2s] come from so here's a specific example
[4873.5s] of a few uh of three conversations that
[4876.0s] you might think you have in your
[4877.5s] training set and um these are pretty
[4880.5s] reasonable conversations that you could
[4882.0s] imagine being in the training set so
[4883.6s] like for example who is Cruz well Tom
[4885.7s] Cruz is an famous actor American actor
[4888.0s] and producer Etc who is John baraso this
[4891.0s] turns out to be a us senetor for example
[4894.2s] who is genis Khan well genis Khan was
[4896.5s] blah blah blah and so this is what your
[4899.5s] conversations could look like at
[4900.6s] training time now the problem with this
[4902.8s] is that when the human is writing the
[4906.2s] correct answer for the assistant in each
[4908.8s] one of these cases uh the human either
[4911.1s] like knows who this person is or they
[4912.4s] research them on the Internet and they
[4913.9s] come in and they write this response
[4915.7s] that kind of has this like confident
[4917.2s] tone of an answer and what happens
[4919.6s] basically is that at test time when you
[4921.4s] ask for someone who is this is a totally
[4923.6s] random name that I totally came up with
[4925.2s] and I don't think this person exists um
[4927.5s] as far as I know I just Tred to generate
[4929.7s] it randomly the problem is when we ask
[4931.9s] who is Orson kovats the problem is that
[4935.4s] the assistant will not just tell you oh
[4937.6s] I don't know even if the assistant and
[4940.2s] the language model itself might know
[4943.0s] inside its features inside its
[4944.6s] activations inside of its brain sort of
[4946.5s] it might know that this person is like
[4948.0s] not someone that um that is that it's
[4950.8s] familiar with even if some part of the
[4952.7s] network kind of knows that in some sense
[4955.1s] the uh saying that oh I don't know who
[4957.5s] this is is is not going to happen
[4960.3s] because the model statistically imitates
[4962.8s] is training set in the training set the
[4965.6s] questions of the form who is blah are
[4967.6s] confidently answered with the correct
[4969.7s] answer and so it's going to take on the
[4972.0s] style of the answer and it's going to do
[4974.0s] its best it's going to give you
[4975.3s] statistically the most likely guess and
[4977.6s] it's just going to basically make stuff
[4978.9s] up because these models again we just
[4981.3s] talked about it is they don't have
[4982.8s] access to the internet they're not doing
[4984.1s] research these are statistical token
[4986.4s] tumblers as I call them uh is just
[4988.7s] trying to sample the next token in the
[4990.1s] sequence and it's going to basically
[4992.3s] make stuff up so let's take a look at
[4993.8s] what this looks
[4995.3s] like I have here what's called the
[4997.6s] inference playground from hugging face
[5000.6s] and I am on purpose picking on a model
[5002.7s] called Falcon 7B which is an old model
[5005.4s] this is a few years ago now so it's an
[5007.5s] older model So It suffers from
[5009.0s] hallucinations and as I mentioned this
[5011.0s] has improved over time recently but
[5013.5s] let's say who is Orson kovats let's ask
[5015.4s] Falcon 7B instruct
[5017.6s] run oh yeah Orson kovat is an American
[5020.1s] author and science uh fiction writer
[5022.6s] okay this is totally false it's
[5024.6s] hallucination let's try again these are
[5027.0s] statistical systems right so we can
[5028.7s] resample this time Orson kovat is a
[5031.2s] fictional character from this 1950s TV
[5034.0s] show it's total BS right let's try again
[5037.8s] he's a former minor league baseball
[5039.9s] player okay so basically the model
[5042.6s] doesn't know and it's given us lots of
[5044.5s] different answers because it doesn't
[5046.5s] know it's just kind of like sampling
[5048.3s] from these probabilities the model
[5050.4s] starts with the tokens who is oron
[5052.1s] kovats assistant and then it comes in
[5054.8s] here and it's get it's getting these
[5057.9s] probabilities and it's just sampling
[5059.2s] from the probabilities and it just like
[5061.0s] comes up with stuff and the stuff is
[5064.0s] actually
[5064.9s] statistically consistent with the style
[5067.8s] of the answer in its training set and
[5070.0s] it's just doing that but you and I
[5071.8s] experiened it as a madeup factual
[5073.9s] knowledge but keep in mind that uh the
[5076.3s] model basically doesn't know and it's
[5077.9s] just imitating the format of the answer
[5080.0s] and it's not going to go off and look it
[5081.3s] up uh because it's just imitating again
[5084.2s] the answer so how can we uh mitigate
[5087.0s] this because for example when we go to
[5088.4s] chat apt and I say who is oron kovats
[5090.9s] and I'm now asking the stateoftheart
[5092.6s] state-of-the-art model from open AI
[5095.0s] this model will tell
[5096.8s] you oh so this model is actually is even
[5100.6s] smarter because you saw very briefly it
[5102.8s] said searching the web uh we're going to
[5104.8s] cover this later um it's actually trying
[5107.3s] to do tool use and
[5111.4s] uh kind of just like came up with some
[5113.5s] kind of a story but I want to just who
[5115.9s] or Kovach did not use any tools I don't
[5119.5s] want it to do web
[5122.1s] search there's a wellknown historical or
[5124.5s] public figure named or oron kovats so
[5127.3s] this model is not going to make up stuff
[5129.4s] this model knows that it doesn't know
[5131.1s] and it tells you that it doesn't appear
[5132.8s] to be a person that this model knows so
[5135.6s] somehow we sort of improved
[5137.2s] hallucinations even though they clearly
[5139.5s] are an issue in older models and it
[5142.0s] makes totally uh sense why you would be
[5144.4s] getting these kinds of answers if this
[5146.2s] is what your training set looks like so
[5148.0s] how do we fix this okay well clearly we
[5150.5s] need some examples in our data set that
[5153.1s] where the correct answer for the
[5154.7s] assistant is that the model doesn't know
[5157.1s] about some particular fact but we only
[5159.6s] need to have those answers be produced
[5162.0s] in the cases where the model actually
[5163.5s] doesn't know and so the question is how
[5165.4s] do we know what the model knows or
[5167.0s] doesn't know well we can empirically
[5169.0s] probe the model to figure that out so
[5171.5s] let's take a look at for example how
[5173.4s] meta uh dealt with hallucinations for
[5176.5s] the Llama 3 series of models as an
[5178.5s] example so in this paper that they
[5180.3s] published from meta we can go into
[5182.1s] hallucinations
[5185.8s] which they call here factuality and they
[5187.9s] describe the procedure by which they
[5190.0s] basically interrogate the model to
[5192.3s] figure out what it knows and doesn't
[5193.7s] know to figure out sort of like the
[5195.3s] boundary of its knowledge and then they
[5198.2s] add examples to the training set where
[5201.9s] for the things where the model doesn't
[5204.3s] know them the correct answer is that the
[5206.6s] model doesn't know them which sounds
[5208.8s] like a very easy thing to do in
[5210.5s] principle but this roughly fixes the
[5213.0s] issue and the the reason it fixes the
[5214.9s] issue is
[5216.1s] because remember like the model might
[5219.5s] actually have a pretty good model of its
[5221.8s] self knowledge inside the network so
[5224.2s] remember we looked at the network and
[5226.4s] all these neurons inside the network you
[5228.7s] might imagine that there's a neuron
[5229.9s] somewhere in the network that sort of
[5232.0s] like lights up for when the model is
[5234.4s] uncertain but the problem is that the
[5237.0s] activation of that neuron is not
[5238.5s] currently wired up to the model actually
[5241.0s] saying in words that it doesn't know so
[5243.4s] even though the internal of the neural
[5244.8s] network no because there's some neurons
[5246.8s] that represent that the model uh will
[5249.8s] not surface that it will instead take
[5251.6s] its best guess so that it sounds
[5253.1s] confident um just like it sees in a
[5255.6s] training set so we need to basically
[5257.4s] interrogate the model and allow it to
[5259.6s] say I don't know in the cases that it
[5261.5s] doesn't know so let me take you through
[5263.5s] what meta roughly does so basically what
[5265.5s] they do is here I have an example uh
[5268.3s] Dominic kek is uh the featured article
[5271.6s] today so I just went there randomly and
[5274.3s] what they do is basically they take a
[5275.9s] random document in a training set and
[5278.2s] they take a paragraph and then they use
[5281.3s] an llm to construct questions about that
[5284.2s] paragraph so for example I did that with
[5286.5s] chat GPT
[5289.2s] here so I said here's a paragraph from
[5292.5s] this document generate three specific
[5294.6s] factual questions based on this
[5295.9s] paragraph and give me the questions and
[5297.6s] the answers and so the llms are already
[5300.5s] good enough to create and reframe this
[5303.3s] information so if the information is in
[5306.0s] the context window um of this llm this
[5309.6s] actually works pretty well it doesn't
[5310.9s] have to rely on its memory it's right
[5313.2s] there in the context window and so it
[5315.6s] can basically reframe that information
[5317.8s] with fairly high accuracy so for example
[5320.1s] can generate questions for us like for
[5322.0s] which team did he play here's the answer
[5324.6s] how many cups did he win Etc and now
[5327.0s] what we have to do is we have some
[5328.4s] question and answers and now we want to
[5330.1s] interrogate the model so roughly
[5331.8s] speaking what we'll do is we'll take our
[5333.4s] questions and we'll go to our model
[5335.5s] which would be uh say llama uh in meta
[5339.1s] but let's just interrogate mol 7B here
[5341.2s] as an example that's another model so
[5344.0s] does this model know about this answer
[5347.2s] let's take a
[5349.2s] look uh so he played for Buffalo Sabers
[5352.2s] right so the model knows and the the way
[5355.0s] that you can programmatically decide is
[5357.0s] basically we're going to take this
[5358.8s] answer from the model and we're going to
[5360.7s] compare it to the correct answer and
[5363.5s] again the model model are good enough to
[5365.0s] do this automatically so there's no
[5366.4s] humans involved here we can take uh
[5368.6s] basically the answer from the model and
[5370.4s] we can use another llm judge to check if
[5373.7s] that is correct according to this answer
[5375.9s] and if it is correct that means that the
[5377.2s] model probably knows so what we're going
[5379.0s] to do is we're going to do this maybe a
[5380.9s] few times so okay it knows it's Buffalo
[5382.7s] Savers let's drag
[5385.4s] in um Buffalo Sabers let's try one more
[5391.0s] time Buffalo Sabers so we asked three
[5394.1s] times about this factual question and
[5395.7s] the model seems to know so everything is
[5398.3s] great now let's try the second question
[5401.0s] how many Stanley Cups did he
[5403.0s] win and again let's interrogate the
[5405.0s] model about that and the correct answer
[5406.2s] is
[5408.2s] two so um here the model claims that he
[5413.6s] won um four times which is not correct
[5417.5s] right it doesn't match two so the model
[5420.1s] doesn't know it's making stuff up let's
[5422.0s] try again
[5427.9s] um so here the model again it's kind of
[5430.2s] like making stuff up right let's
[5434.2s] Dragon here it says did he did not even
[5437.6s] did not win during his career so
[5439.7s] obviously the model doesn't know and the
[5441.4s] way we can programmatically tell again
[5443.0s] is we interrogate the model three times
[5445.2s] and we compare its answers maybe three
[5447.3s] times five times whatever it is to the
[5449.2s] correct answer and if the model doesn't
[5451.6s] know then we know that the model doesn't
[5453.2s] know this question
[5454.4s] and then what we do is we take this
[5456.6s] question we create a new conversation in
[5459.6s] the training set so we're going to add a
[5461.5s] new conversation training set and when
[5463.9s] the question is how many Stanley Cups
[5465.4s] did he win the answer is I'm sorry I
[5468.4s] don't know or I don't remember and
[5470.6s] that's the correct answer for this
[5472.3s] question because we interrogated the
[5473.8s] model and we saw that that's the case if
[5476.0s] you do this for many different types of
[5478.4s] uh questions for many different types of
[5480.9s] documents you are giving the model an
[5483.1s] opportunity to in its training set
[5485.6s] refuse to say based on its knowledge and
[5488.4s] if you just have a few examples of that
[5490.2s] in your training set the model will know
[5493.4s] um and and has the opportunity to learn
[5495.5s] the association of this knowledge-based
[5497.8s] refusal to this internal neuron
[5501.1s] somewhere in its Network that we presume
[5503.1s] exists and empirically this turns out to
[5505.1s] be probably the case and it can learn
[5507.4s] that Association that hey when this
[5509.4s] neuron of uncertainty is high then I
[5512.1s] actually don't know and I'm allowed to
[5514.4s] say that I'm sorry but I don't think I
[5516.3s] remember this Etc and if you have these
[5519.6s] uh examples in your training set then
[5521.5s] this is a large mitigation for
[5523.8s] hallucination and that's roughly
[5525.6s] speaking why chpt is able to do stuff
[5528.0s] like this as well so these are kinds of
[5530.5s] uh mitigations that people have
[5532.1s] implemented and that have improved the
[5534.1s] factuality issue over time okay so I've
[5536.8s] described mitigation number one for
[5539.7s] basically mitigating the hallucinations
[5541.6s] issue now we can actually do much better
[5544.4s] than that uh it's instead of just saying
[5547.3s] that we don't know uh we can introduce
[5549.8s] an additional mitigation number two to
[5552.0s] give the llm an opportunity to be
[5553.8s] factual and actually answer the question
[5556.4s] now what do you and I do if I was to ask
[5559.1s] you a factual question and you don't
[5560.6s] know uh what would you do um in order to
[5563.7s] answer the question well you could uh go
[5565.8s] off and do some search and uh use the
[5567.7s] internet and you could figure out the
[5569.7s] answer and then tell me what that answer
[5571.7s] is and we can do the exact exact same
[5574.2s] thing with these models so think of the
[5576.9s] knowledge inside the neural network
[5578.4s] inside its billions of parameters think
[5581.1s] of that as kind of a vague recollection
[5583.0s] of the things that the model has seen
[5585.7s] during its training during the
[5587.0s] pre-training stage a long time ago so
[5589.5s] think of that knowledge in the
[5590.7s] parameters as something you read a month
[5593.2s] ago and if you keep reading something
[5595.7s] then you will remember it and the model
[5597.2s] remembers that but if it's something
[5598.8s] rare then you probably don't have a
[5600.1s] really good recollection of that
[5601.4s] information but what you and I do is we
[5603.4s] just go and look it up now when you go
[5605.5s] and look it up what you're doing
[5606.6s] basically is like you're refreshing your
[5608.2s] working memory with information and then
[5610.5s] you're able to sort of like retrieve it
[5612.1s] talk about it or Etc so we need some
[5614.2s] equivalent of allowing the model to
[5616.3s] refresh its memory or its recollection
[5618.9s] and we can do that by introducing tools
[5621.3s] uh for the
[5622.7s] models so the way we are going to
[5624.6s] approach this is that instead of just
[5625.8s] saying hey I'm sorry I don't know we can
[5628.4s] attempt to use tools so we can create uh
[5633.2s] a mechanism
[5634.3s] by which the language model can emit
[5636.0s] special tokens and these are tokens that
[5637.6s] we're going to introduce new tokens so
[5640.6s] for example here I've introduced two
[5642.4s] tokens and I've introduced a format or a
[5644.8s] protocol for how the model is allowed to
[5647.1s] use these tokens so for example instead
[5649.8s] of answering the question when the model
[5652.0s] does not instead of just saying I don't
[5654.0s] know sorry the model has the option now
[5656.3s] to emitting the special token search
[5658.3s] start and this is the query that will go
[5660.6s] to like bing.com in the case of openai
[5662.9s] or say Google search or something like
[5664.3s] that so it will emit the query and then
[5666.6s] it will emit search end and then here
[5670.7s] what will happen is that the program
[5673.0s] that is sampling from the model that is
[5674.8s] running the inference when it sees the
[5677.0s] special token search end instead of
[5679.5s] sampling the next token uh in the
[5681.9s] sequence it will actually pause
[5684.1s] generating from the model it will go off
[5686.4s] it will open a session with bing.com and
[5689.0s] it will paste the search query into Bing
[5692.0s] and it will then um get all the text
[5694.2s] that is retrieved and it will basically
[5696.8s] take that text it will maybe represent
[5698.9s] it again with some other special tokens
[5700.3s] or something like that and it will take
[5702.1s] that text and it will copy paste it here
[5705.3s] into what I Tred to like show with the
[5707.2s] brackets so all that text kind of comes
[5709.3s] here and when the text comes here it
[5712.5s] enters the context window so the model
[5715.4s] so that text from the web search is now
[5717.5s] inside the context window that will feed
[5720.0s] into the neural network and you should
[5721.9s] think of the context window as kind of
[5723.4s] like the working memory of the model
[5725.4s] that data that is in the context window
[5727.5s] is directly accessible by the model it
[5729.6s] directly feeds into the neural network
[5731.6s] so it's not anymore a vague recollection
[5733.7s] it's data that it it has in the context
[5736.9s] window and is directly available to that
[5738.4s] model so now when it's sampling the new
[5741.4s] uh tokens here afterwards it can
[5743.6s] reference very easily the data that has
[5745.8s] been copy pasted in there so that's
[5748.7s] roughly how these um how these tools use
[5752.5s] uh tools uh function
[5754.2s] and so web search is just one of the
[5755.5s] tools we're going to look at some of the
[5756.6s] other tools in a bit uh but basically
[5759.0s] you introduce new tokens you introduce
[5760.8s] some schema by which the model can
[5762.4s] utilize these tokens and can call these
[5764.9s] special functions like web search
[5766.4s] functions and how do you teach the model
[5768.9s] how to correctly use these tools like
[5770.5s] say web search search start search end
[5772.5s] Etc well again you do that through
[5774.2s] training sets so we need now to have a
[5776.6s] bunch of data and a bunch of
[5778.8s] conversations that show the model by
[5781.1s] example how to use web search so what
[5784.4s] are the what are the settings where you
[5785.9s] are using the search um and what does
[5788.1s] that look like and here's by example how
[5790.5s] you start a search and the search Etc
[5793.4s] and uh if you have a few thousand maybe
[5795.4s] examples of that in your training set
[5797.0s] the model will actually do a pretty good
[5798.8s] job of understanding uh how this tool
[5800.9s] works and it will know how to sort of
[5803.0s] structure its queries and of course
[5804.9s] because of the pre-training data set and
[5807.2s] its understanding of the world it
[5808.6s] actually kind of understands what a web
[5809.9s] search is and so it actually kind of has
[5811.8s] a pretty good native understanding
[5814.2s] um of what kind of stuff is a good
[5816.0s] search query um and so it all kind of
[5818.3s] just like works you just need a little
[5820.2s] bit of a few examples to show it how to
[5822.6s] use this new tool and then it can lean
[5824.6s] on it to retrieve information and uh put
[5827.0s] it in the context window and that's
[5828.6s] equivalent to you and I looking
[5830.1s] something up because once it's in the
[5832.2s] context it's in the working memory and
[5833.6s] it's very easy to manipulate and access
[5836.0s] so that's what we saw a few minutes ago
[5838.1s] when I was searching on chat GPT for who
[5840.5s] is Orson kovats the chat GPT language
[5843.0s] model decided Ed that this is some kind
[5844.8s] of a rare um individual or something
[5847.9s] like that and instead of giving me an
[5849.7s] answer from its memory it decided that
[5851.6s] it will sample a special token that is
[5853.4s] going to do web search and we saw
[5855.2s] briefly something flash it was like
[5856.9s] using the web tool or something like
[5858.4s] that so it briefly said that and then we
[5860.4s] waited for like two seconds and then it
[5861.9s] generated this and you see how it's
[5863.7s] creating references here and so it's
[5866.0s] citing sources so what happened here is
[5870.0s] it went off it did a web web search it
[5872.5s] found these sources and these URLs and
[5875.0s] the text of these web pages was all
[5879.0s] stuffed in between here and it's not
[5881.2s] showing here but it's it's basically
[5882.9s] stuffed as text in between here and now
[5886.6s] it sees that text and now it kind of
[5888.8s] references it and says that okay it
[5891.3s] could be these people citation could be
[5893.2s] those people citation Etc so that's what
[5895.5s] happened here and that's what and that's
[5897.5s] why when I said who is Orson kovats I
[5899.4s] could also say don't use any tools and
[5902.0s] then that's enough to um
[5904.0s] basically convince chat PT to not use
[5905.8s] tools and just use its memory and its
[5908.2s] recollection I also went off and I um
[5912.6s] tried to ask this question of Chachi PT
[5914.9s] so how many standing cups did uh Dominic
[5917.1s] Hasek win and Chachi P actually decided
[5919.6s] that it knows the answer and it has the
[5920.8s] confidence to say that uh he want twice
[5923.7s] and so it kind of just relied on its
[5925.4s] memory because presumably it has um it
[5929.0s] has enough of
[5930.4s] a kind of confidence in its weights in
[5933.5s] it parameters and activations that this
[5935.4s] is uh retrievable just for memory um but
[5939.2s] you can also
[5941.1s] conversely use web search to make sure
[5944.1s] and then for the same query it actually
[5946.0s] goes off and it searches and then it
[5947.9s] finds a bunch of sources it finds all
[5950.4s] this all of this stuff gets copy pasted
[5952.2s] in there and then it tells us uh to
[5955.4s] again and sites and it actually says the
[5957.8s] Wikipedia article which is the source of
[5960.1s] this information for us as well so
[5963.0s] that's tools web search the model
[5965.1s] determines when to search and then uh
[5967.5s] that's kind of like how these tools uh
[5970.0s] work and this is an additional kind of
[5972.2s] mitigation for uh hallucinations and
[5975.0s] factuality so I want to stress one more
[5977.1s] time this very important sort of
[5978.6s] psychology
[5980.2s] Point knowledge in the parameters of the
[5983.1s] neural network is a vague recollection
[5985.7s] the knowledge in the tokens that make up
[5987.4s] the context
[5988.4s] window is the working memory and it
[5991.3s] roughly speaking Works kind of like um
[5993.8s] it works for us in our brain the stuff
[5995.8s] we remember is our parameters uh and the
[5998.6s] stuff that we just experienced like a
[6001.4s] few seconds or minutes ago and so on you
[6003.1s] can imagine that being in our context
[6004.5s] window and this context window is being
[6005.9s] built up as you have a conscious
[6007.4s] experience around you so this has a
[6010.2s] bunch of um implications also for your
[6012.4s] use of LOLs in practice so for example I
[6015.9s] can go to chat GPT and I can do
[6017.2s] something like this I can say can you
[6018.6s] Summarize chapter one of Jane Austin's
[6020.2s] Pride and Prejudice right and this is a
[6022.7s] perfectly fine prompt and Chach actually
[6025.3s] does something relatively reasonable
[6026.7s] here and but the reason it does that is
[6028.6s] because Chach has a pretty good
[6030.0s] recollection of a famous work like Pride
[6032.6s] and Prejudice it's probably seen a ton
[6034.3s] of stuff about it there's probably
[6035.5s] forums about this book it's probably
[6037.4s] read versions of this book um and it's
[6040.3s] kind of like remembers because even if
[6043.4s] you've read this or articles about it
[6046.6s] you'd kind of have a recollection enough
[6048.0s] to actually say all this but usually
[6049.9s] when I actually interact with LMS and I
[6051.5s] want them to recall specific things it
[6053.8s] always works better if you just give it
[6055.1s] to them so I think a much better prompt
[6057.3s] would be something like this can you
[6059.4s] summarize for me chapter one of genos's
[6061.4s] spr and Prejudice and then I am
[6063.1s] attaching it below for your reference
[6064.6s] and then I do something like a delimeter
[6066.1s] here and I paste it in and I I found
[6068.7s] that just copy pasting it from some
[6070.6s] website that I found here um so copy
[6074.2s] pasting the chapter one here and I do
[6076.3s] that because when it's in the context
[6077.8s] window the model has direct access to it
[6080.3s] and can exactly it doesn't have to
[6082.3s] recall it it just has access to it and
[6084.6s] so this summary is can be expected to be
[6087.0s] a significantly high quality or higher
[6089.0s] quality than this summary uh just
[6091.4s] because it's directly available to the
[6092.6s] model and I think you and I would work
[6094.3s] in the same way if you want to it would
[6097.0s] be you would produce a much better
[6098.0s] summary if you had reread this chapter
[6100.8s] before you had to summarize it and
[6102.7s] that's basically what's happening here
[6104.8s] or the equivalent of it the next sort of
[6107.1s] psychological Quirk I'd like to talk
[6108.6s] about briefly is that of the knowledge
[6110.5s] of self so what I see very often on the
[6112.8s] internet is that people do something
[6114.2s] like this they ask llms something like
[6116.9s] what model are you and who built you and
[6119.8s] um basically this uh question is a
[6121.4s] little bit nonsensical and the reason I
[6123.7s] say that is that as I try to kind of
[6125.9s] explain with some of the underhood
[6127.1s] fundamentals this thing is not a person
[6129.4s] right it doesn't have a persistent
[6131.2s] existence in any way it sort of boots up
[6134.2s] processes tokens and shuts off and it
[6137.1s] does that for every single person it
[6138.4s] just kind of builds up a context window
[6139.7s] of conversation and then everything gets
[6141.2s] deleted and so this this entity is kind
[6143.6s] of like restarted from scratch every
[6145.1s] single conversation if that makes sense
[6147.2s] it has no persistent self it has no
[6148.8s] sense of self it's a token tumbler and
[6151.7s] uh it follows the statistical
[6153.4s] regularities of its training set so it
[6155.5s] doesn't really make sense to ask it who
[6158.2s] are you what build you Etc and by
[6160.4s] default if you do what I described and
[6162.8s] just by default and from nowhere you're
[6164.9s] going to get some pretty random answers
[6166.2s] so for example let's uh pick on Falcon
[6168.5s] which is a fairly old model and let's
[6170.5s] see what it tells
[6171.9s] us uh so it's evading the question uh
[6175.5s] talented engineers and developers here
[6178.0s] it says I was built by open AI based on
[6179.9s] the gpt3 model it's totally making stuff
[6181.9s] up now the fact that it's built by open
[6184.4s] AI here I think a lot of people would
[6186.1s] take this as evidence that this model
[6187.9s] was somehow trained on open AI data or
[6189.6s] something like that I don't actually
[6190.8s] think that that's necessarily true the
[6192.8s] reason for that is
[6194.5s] that if you don't explicitly program the
[6197.6s] model to answer these kinds of questions
[6200.4s] then what you're going to get is its
[6202.0s] statistical best guess at the answer and
[6205.6s] this model had a um sft data mixture of
[6209.4s] conversations and during the
[6212.0s] fine-tuning um the model sort of
[6215.5s] understands as it's training on this
[6216.7s] data that it's taking on this
[6218.4s] personality of this like helpful
[6220.1s] assistant and it doesn't know how to it
[6222.5s] doesn't actually it wasn't told exactly
[6224.5s] what label to apply to self it just kind
[6227.2s] of is taking on this uh this uh Persona
[6230.1s] of a helpful assistant and remember that
[6233.2s] the pre-training stage took the
[6235.0s] documents from the entire internet and
[6237.3s] Chach and open AI are very prominent in
[6239.7s] these documents and so I think what's
[6241.5s] actually likely to be happening here is
[6243.6s] that this is just its hallucinated label
[6246.2s] for what it is this is its self-identity
[6248.4s] is that it's chat GPT by open Ai and
[6251.2s] it's only saying that because there's a
[6252.6s] ton of data on the internet of um
[6255.4s] answers like this that are actually
[6257.9s] coming from open from chasht and So
[6260.4s] that's its label for what it is now you
[6263.6s] can override this as a developer if you
[6265.7s] have a llm model you can actually
[6267.5s] override it and there are a few ways to
[6268.9s] do that so for example let me show you
[6271.9s] there's this MMO model from Allen Ai and
[6275.3s] um this is one llm it's not a top tier
[6277.9s] LM or anything like that but I like it
[6279.5s] because it is fully open source so the
[6281.2s] paper for Almo and everything else is
[6283.0s] completely fully open source which is
[6284.5s] nice um so here we are looking at its
[6287.1s] sft mixture so this is the data mixture
[6289.8s] of um the fine tuning so this is the
[6292.2s] conversations data it right and so the
[6294.6s] way that they are solving it for Theo
[6296.3s] model is we see that there's a bunch of
[6298.3s] stuff in the mixture and there's a total
[6299.6s] of 1 million conversations here but here
[6302.4s] we have alot to hardcoded if we go there
[6305.5s] we see that this is 240
[6307.8s] conversations and look at these 240
[6310.5s] conversations they're hardcoded tell me
[6312.8s] about yourself says user and then the
[6315.7s] assistant says I'm and open language
[6317.6s] model developed by AI to Allen Institute
[6319.5s] of artificial intelligence Etc I'm here
[6321.5s] to help blah blah blah what is your name
[6323.9s] uh Theo project so these are all kinds
[6326.2s] of like cooked up hardcoded questions
[6327.9s] abouto 2 and the correct answers to give
[6330.7s] in these cases if you take 240 questions
[6333.9s] like this or conversations put them into
[6336.0s] your training set and fine tune with it
[6337.9s] then the model will actually be expected
[6339.5s] to parot this stuff later if you don't
[6343.0s] give it this then it's probably a Chach
[6345.8s] by open
[6346.7s] Ai and um there's one more way to
[6349.7s] sometimes do this is
[6351.4s] that basically um in these conversations
[6355.2s] and you have terms between human and
[6356.5s] assistant sometimes there's a special
[6358.4s] message called system message at the
[6360.5s] very beginning of the conversation so
[6362.6s] it's not just between human and
[6363.6s] assistant there's a system and in the
[6365.9s] system message you can actually hardcode
[6367.9s] and remind the model that hey you are a
[6370.7s] model developed by open Ai and your name
[6373.8s] is chashi pt40 and you were trained on
[6376.5s] this date and your knowledge cut off is
[6378.3s] this and basically it kind of like
[6380.0s] documents the model a little bit and
[6381.8s] then this is inserted into to your
[6383.1s] conversations so when you go on chpt you
[6385.1s] see a blank page but actually the system
[6387.0s] message is kind of like hidden in there
[6388.7s] and those tokens are in the context
[6390.2s] window and so those are the two ways to
[6393.2s] kind of um program the models to talk
[6395.8s] about themselves either it's done
[6397.6s] through uh data like this or it's done
[6400.3s] through system message and things like
[6402.2s] that basically invisible tokens that are
[6404.0s] in the context window and remind the
[6406.0s] model of its identity but it's all just
[6408.0s] kind of like cooked up and bolted on in
[6410.1s] some in some way it's not actually like
[6412.0s] really deeply there in any real sense as
[6415.0s] it would before a human I want to now
[6417.3s] continue to the next section which deals
[6419.5s] with the computational capabilities or
[6421.2s] like I should say the native
[6422.3s] computational capabilities of these
[6423.6s] models in problem solving scenarios and
[6426.4s] so in particular we have to be very
[6427.8s] careful with these models when we
[6429.2s] construct our examples of conversations
[6431.3s] and there's a lot of sharp edges here
[6433.2s] that are kind of like elucidative is
[6435.0s] that a word uh they're kind of like
[6436.7s] interesting to look at when we consider
[6438.4s] how these models think so um consider
[6442.5s] the following prompt from a human and
[6444.9s] supposed that basically that we are
[6446.0s] building out a conversation to enter
[6447.6s] into our training set of conversations
[6449.0s] so we're going to train the model on
[6450.2s] this we're teaching you how to basically
[6452.0s] solve simple math problems so the prompt
[6454.6s] is Emily buys three apples and two
[6456.5s] oranges each orange cost $2 the total
[6458.7s] cost is 13 what is the cost of apples
[6461.0s] very simple math question now there are
[6463.8s] two answers here on the left and on the
[6465.8s] right they are both correct answers they
[6468.3s] both say that the answer is three which
[6469.7s] is correct but one of these two is a
[6472.4s] significant ific anly better answer for
[6474.7s] the assistant than the other like if I
[6476.4s] was Data labeler and I was creating one
[6478.0s] of these one of these would be uh a
[6481.4s] really terrible answer for the assistant
[6483.8s] and the other would be okay and so I'd
[6485.8s] like you to potentially pause the video
[6487.3s] Even and think through why one of these
[6489.6s] two is significantly better answer uh
[6492.5s] than the other and um if you use the
[6494.9s] wrong one your model will actually be uh
[6497.6s] really bad at math potentially and it
[6499.5s] would have uh bad outcomes and this is
[6501.2s] something that you would be careful with
[6502.4s] in your life labeling documentations
[6503.6s] when you are training people uh to
[6505.4s] create the ideal responses for the
[6507.2s] assistant okay so the key to this
[6509.2s] question is to realize and remember that
[6512.0s] when the models are training and also
[6514.2s] inferencing they are working in
[6515.9s] onedimensional sequence of tokens from
[6518.0s] left to right and this is the picture
[6520.4s] that I often have in my mind I imagine
[6522.3s] basically the token sequence evolving
[6523.8s] from left to right and to always produce
[6526.1s] the next token in a sequence we are
[6528.5s] feeding all these tokens into the neural
[6530.8s] network and this neural network then is
[6533.0s] the probabilities for the next token and
[6534.3s] sequence right so this picture here is
[6536.5s] the exact same picture we saw uh before
[6538.9s] up here and this comes from the web demo
[6541.9s] that I showed you before right so this
[6544.3s] is the calculation that basically takes
[6545.9s] the input tokens here on the top and uh
[6549.1s] performs these operations of all these
[6551.4s] neurons and uh gives you the answer for
[6553.9s] the probabilities of what comes next now
[6555.8s] the important thing to realize is that
[6557.6s] roughly
[6559.0s] speaking uh there's basically a finite
[6561.3s] number of layers of computation that
[6562.9s] happened here so for example this model
[6565.2s] here has only one two three layers of
[6568.6s] what's called detention and uh MLP here
[6572.0s] um maybe um typical modern
[6574.4s] state-of-the-art Network would have more
[6576.2s] like say 100 layers or something like
[6577.6s] that but there's only 100 layers of
[6579.2s] computation or something like that to go
[6580.9s] from the previous token sequence to the
[6582.8s] probabilities for the next token and so
[6584.8s] there's a finite amount of computation
[6586.9s] that happens here for every single token
[6589.3s] and you should think of this as a very
[6590.8s] small amount of computation and this
[6592.8s] amount of computation is almost roughly
[6594.6s] fixed uh for every single token in this
[6597.4s] sequence um the that's not actually
[6599.6s] fully true because the more tokens you
[6601.7s] feed in uh the the more expensive uh
[6604.4s] this forward pass will be of this neural
[6606.6s] network but not by much so you should
[6609.3s] think of this uh and I think as a good
[6610.8s] model to have in mind this is a fixed
[6612.6s] amount of compute that's going to happen
[6613.9s] in this box for every single one of
[6615.6s] these tokens and this amount of compute
[6617.6s] Cann possibly be too big because there's
[6619.1s] not that many layers that are sort of
[6621.0s] going from the top to bottom here
[6623.2s] there's not that that much
[6624.3s] computationally that will happen here
[6626.0s] and so you can't imagine the model to to
[6627.8s] basically do arbitrary computation in a
[6629.6s] single forward pass to get a single
[6631.9s] token and so what that means is that we
[6634.1s] actually have to distribute our
[6635.7s] reasoning and our computation across
[6637.8s] many tokens because every single token
[6640.3s] is only spending a finite amount of
[6642.0s] computation on it and so we kind of want
[6645.3s] to distribute the computation across
[6647.8s] many tokens and we can't have too much
[6650.7s] computation or expect too much
[6652.1s] computation out of of the model in any
[6653.7s] single individual token because there's
[6656.0s] only so much computation that happens
[6657.8s] per token okay roughly fixed amount of
[6660.7s] computation here
[6662.7s] so that's why this answer here is
[6666.1s] significantly worse and the reason for
[6667.9s] that is Imagine going from left to right
[6669.8s] here um and I copy pasted it right here
[6673.6s] the answer is three Etc imagine the
[6676.2s] model having to go from left to right
[6677.7s] emitting these tokens one at a time it
[6679.8s] has to say or we're expecting to say the
[6683.2s] answer is space dollar sign and then
[6687.8s] right here we're expecting it to
[6688.9s] basically cram all of the computation of
[6690.7s] this problem into this single token it
[6692.7s] has to emit the correct answer three and
[6695.8s] then once we've emitted the answer three
[6697.9s] we're expecting it to say all these
[6699.6s] tokens but at this point we've already
[6701.3s] prod produced the answer and it's
[6703.2s] already in the context window for all
[6704.6s] these tokens that follow so anything
[6706.8s] here is just um kind of post Hawk
[6709.2s] justification of why this is the answer
[6712.1s] um because the answer is already created
[6713.9s] it's already in the token window so it's
[6716.4s] it's not actually being calculated here
[6718.6s] um and so if you are answering the
[6721.0s] question directly and immediately you
[6723.1s] are training the model to to try to
[6726.0s] basically guess the answer in a single
[6727.6s] token and that is just not going to work
[6730.2s] because of the finite amount of
[6731.2s] computation that happens per token
[6733.7s] that's why this answer on the right is
[6735.8s] significantly better because we are
[6737.2s] Distributing this computation across the
[6739.0s] answer we're actually getting the model
[6740.8s] to sort of slowly come to the answer
[6743.2s] from the left to right we're getting
[6744.6s] intermediate results we're saying okay
[6746.6s] the total cost of oranges is four so 30
[6748.9s] - 4 is 9 and so we're creating
[6752.1s] intermediate calculations and each one
[6754.2s] of these calculations is by itself not
[6756.0s] that expensive and so we're actually
[6758.0s] basically kind of guessing a little bit
[6760.2s] the difficulty that the model is capable
[6762.3s] of in any single one of these individual
[6764.6s] tokens and there can never be too much
[6767.0s] work in any one of these tokens
[6769.2s] computationally because then the model
[6770.9s] won't be able to do that later at test
[6773.0s] time and so we're teaching the model
[6775.3s] here to spread out its reasoning and to
[6777.7s] spread out its computation over the
[6779.7s] tokens and in this way it only has very
[6782.5s] simple problems in each token and they
[6785.3s] can add up and then by the time it's
[6787.6s] near the end it has all the previous
[6789.7s] results in its working memory and it's
[6791.8s] much easier for it to determine that the
[6793.2s] answer is and here it is three so this
[6795.8s] is a significantly better label for our
[6798.4s] computation this would be really bad and
[6800.8s] is teaching the model to try to do all
[6803.1s] the computation in a single token and
[6804.8s] it's really
[6805.8s] bad so uh that's kind of like an
[6808.7s] interesting thing to keep in mind is in
[6810.5s] your
[6811.6s] prompts uh usually don't have to think
[6813.6s] about it explicitly because uh the
[6816.1s] people at open AI have labelers and so
[6818.7s] on that actually worry about this and
[6820.4s] they make sure that the answers are
[6821.7s] spread out and so actually open AI will
[6823.9s] kind of like do the right thing so when
[6825.8s] I ask this question for chat GPT it's
[6828.1s] actually going to go very slowly it's
[6829.5s] going to be like okay let's define our
[6830.8s] variables set up the equation
[6833.0s] and it's kind of creating all these
[6834.0s] intermediate results these are not for
[6836.1s] you these are for the model if the model
[6838.7s] is not creating these intermediate
[6840.0s] results for itself it's not going to be
[6841.6s] able to reach three I also wanted to
[6844.4s] show you that it's possible to be a bit
[6846.1s] mean to the model uh we can just ask for
[6848.2s] things so as an example I said I gave it
[6850.6s] the exact same uh prompt and I said
[6853.3s] answer the question in a single token
[6855.0s] just immediately give me the answer
[6856.4s] nothing else and it turns out that for
[6858.8s] this simple um prompt here it actually
[6861.6s] was able to do it in single go so it
[6863.6s] just created a single I think this is
[6865.2s] two tokens right uh because the dollar
[6867.6s] sign is its own token so basically this
[6870.4s] model didn't give me a single token it
[6871.9s] gave me two tokens but it still produced
[6873.9s] the correct answer and it did that in a
[6875.8s] single forward pass of the
[6877.7s] network now that's because the numbers
[6880.0s] here I think are very simple and so I
[6881.8s] made it a bit more difficult to be a bit
[6883.1s] mean to the model so I said Emily buys
[6885.5s] 23 apples and 177 oranges and then I
[6888.4s] just made the numbers a bit bigger and
[6890.4s] I'm just making it harder for the model
[6891.7s] I'm asking it to more computation in a
[6893.4s] single token and so I said the same
[6895.8s] thing and here it gave me five and five
[6898.4s] is actually not correct so the model
[6900.0s] failed to do all of this calculation in
[6902.4s] a single forward pass of the network it
[6904.4s] failed to go from the input tokens and
[6907.5s] then in a single forward pass of the
[6909.1s] network single go through the network it
[6911.2s] couldn't produce the result and then I
[6913.7s] said okay now don't worry about the the
[6916.1s] token limit and just solve the problem
[6918.2s] as usual and then it goes all the
[6920.2s] intermediate results it simplifies and
[6922.5s] every one of these intermediate results
[6924.2s] here and intermediate calculations is
[6926.4s] much easier for the model and um it sort
[6929.6s] of it's not too much work per token all
[6932.5s] of the tokens here are correct and it
[6933.9s] arises the solution which is seven and I
[6936.2s] just couldn't squeeze all of this work
[6938.4s] it couldn't squeeze that into a single
[6939.9s] forward passive Network so I think
[6941.6s] that's kind of just a cute example and
[6943.4s] something to kind of like think about
[6945.0s] and I think it's kind of again just
[6946.6s] elucidative in terms of how these uh
[6948.6s] models work the last thing that I would
[6950.2s] say on this topic is that if I was in
[6952.0s] practi is trying to actually solve this
[6953.5s] in my day-to-day life I might actually
[6955.4s] not uh trust that the model that all the
[6957.9s] intermediate calculations correctly here
[6959.9s] so actually probably what I do is
[6961.1s] something like this I would come here
[6962.4s] and I would say use code and uh that's
[6966.4s] because code is one of the possible
[6968.7s] tools that chachy PD can use and instead
[6972.0s] of it having to do mental arithmetic
[6974.3s] like this mental arithmetic here I don't
[6975.9s] fully trust it and especially if the
[6977.3s] numbers get really big there's no
[6979.1s] guarantee that the model will do this
[6980.4s] correctly any one of these intermediates
[6982.4s] steps might in principle fail we're
[6984.6s] using neural networks to do mental
[6986.1s] arithmetic uh kind of like you doing
[6987.8s] mental arithmetic in your brain it might
[6990.0s] just like uh screw up some of the
[6991.4s] intermediate results it's actually kind
[6992.9s] of amazing that it can even do this kind
[6994.3s] of mental arithmetic I don't think I
[6995.5s] could do this in my head but basically
[6997.0s] the model is kind of like doing it in
[6998.2s] its head and I don't trust that so I
[7000.1s] wanted to use tools so you can say stuff
[7002.1s] like use
[7003.5s] code and uh I'm not sure what happened
[7007.1s] there use
[7010.4s] code and so um like I mentioned there's
[7013.2s] a special tool and the uh the model can
[7015.8s] write code and I can inspect that this
[7018.7s] code is correct and then uh it's not
[7021.6s] relying on its mental arithmetic it is
[7023.6s] using the python interpreter which is a
[7025.4s] very simple programming language to
[7027.1s] basically uh write out the code that
[7028.8s] calculates the result and I would
[7030.7s] personally trust this a lot more because
[7032.1s] this came out of a Python program which
[7034.1s] I think has a lot more correctness
[7035.5s] guarantees than the mental arithmetic of
[7037.7s] a language model uh so just um another
[7041.0s] kind of uh potential hint that if you
[7043.1s] have these kinds of problems uh you may
[7044.9s] want to basically just uh ask the model
[7046.7s] to use the code interpreter and just
[7048.9s] like we saw with the web search the
[7050.8s] model has special uh kind of tokens for
[7054.2s] calling uh like it will not actually
[7056.8s] generate these tokens from the language
[7058.2s] model it will write the program and then
[7060.7s] it actually sends that program to a
[7062.6s] different sort of part of the computer
[7064.4s] that actually just runs that program and
[7066.3s] brings back the result and then the
[7068.1s] model gets access to that result and can
[7070.1s] tell you that okay the cost of each
[7071.2s] apple is seven
[7073.1s] um so that's another kind of tool and I
[7075.5s] would use this in practice for yourself
[7077.8s] and it's um yeah it's just uh less error
[7081.6s] prone I would say so that's why I called
[7083.7s] this section models need tokens to think
[7086.6s] distribute your competition across many
[7088.5s] tokens ask models to create intermediate
[7090.9s] results or whenever you can lean on
[7093.6s] tools and Tool use instead of allowing
[7095.9s] the models to do all of the stuff in
[7097.3s] their memory so if they try to do it all
[7099.0s] in their memory I don't fully trust it
[7101.2s] and prefer to use tools whenever
[7102.5s] possible I want to show you one more
[7104.6s] example of where this actually comes up
[7106.7s] and that's in counting so models
[7108.9s] actually are not very good at counting
[7110.4s] for the exact same reason you're asking
[7112.3s] for way too much in a single individual
[7114.3s] token so let me show you a simple
[7116.5s] example of that um how many dots are
[7119.0s] below and then I just put in a bunch of
[7121.0s] dots and Chach says there are and then
[7124.7s] it just tries to solve the problem in a
[7126.4s] single token so in a single token it has
[7129.3s] to count the number of dots in its
[7131.5s] context window
[7133.4s] um and it has to do that in the single
[7135.2s] forward pass of a network and a single
[7137.4s] forward pass of a network as we talked
[7138.8s] about there's not that much computation
[7140.3s] that can happen there just think of that
[7141.7s] as being like very little competation
[7143.6s] that happens there so if I just look at
[7146.0s] what the model sees let's go to the LM
[7149.3s] go to tokenizer it sees uh
[7153.5s] this how many dots are below and then it
[7155.7s] turns out that these dots here this
[7157.7s] group of I think 20 dots is a single
[7160.1s] token and then this group of whatever it
[7162.9s] is is another token and then for some
[7165.0s] reason they break up as this so I don't
[7168.2s] actually this has to do with the details
[7169.5s] of the tokenizer but it turns out that
[7171.4s] these um the model basically sees the
[7174.5s] token ID this this this and so on and
[7178.2s] then from these token IDs it's expected
[7180.5s] to count the number and spoiler alert is
[7183.6s] not 161 it's actually I believe
[7185.7s] 177 so here's what we can do instead uh
[7188.5s] we can say use code and you might expect
[7191.8s] that like why should this work and it's
[7194.1s] actually kind of subtle and kind of
[7195.2s] interesting so when I say use code I
[7197.4s] actually expect this to work let's see
[7199.1s] okay 177 is correct so what happens here
[7202.6s] is I've actually it doesn't look like it
[7204.8s] but I've broken down the problem into a
[7208.0s] problems that are easier for the model I
[7210.4s] know that the model can't count it can't
[7212.0s] do mental counting but I know that the
[7214.4s] model is actually pretty good at doing
[7215.8s] copy pasting so what I'm doing here is
[7218.2s] when I say use code it creates a string
[7220.2s] in Python for this and the task of
[7224.0s] basically copy pasting my input here to
[7227.2s] here is very simple because for the
[7229.5s] model um it sees this string of uh it
[7233.8s] sees it as just these four tokens or
[7235.7s] whatever it is so it's very simple for
[7237.4s] the model to copy paste those token IDs
[7241.0s] and um kind of unpack them into Dots
[7245.1s] here and so it creates this string and
[7247.8s] then it calls python routine. count and
[7250.4s] then it comes up with the correct answer
[7252.1s] so the python interpreter is doing the
[7253.8s] counting it's not the models mental
[7255.5s] arithmetic doing the counting so it's
[7257.4s] again a simple example of um models need
[7260.8s] tokens to think don't rely on their
[7262.5s] mental arithmetic and um that's why also
[7265.7s] the models are not very good at counting
[7267.2s] if you need them to do counting tasks
[7268.8s] always ask them to lean on the tool now
[7272.0s] the models also have many other little
[7273.8s] cognitive deficits here and there and
[7275.3s] these are kind of like sharp edges of
[7276.5s] the technology to be kind of aware of
[7278.2s] over time so as an example the models
[7280.8s] are not very good with all kinds of
[7282.3s] spelling related tasks they're not very
[7284.2s] good at it and I told you that we would
[7286.8s] loop back around to tokenization and the
[7289.2s] reason to do for this is that the models
[7291.4s] they don't see the characters they see
[7293.3s] tokens and they their entire world is
[7295.7s] about tokens which are these little text
[7297.2s] chunks and so they don't see characters
[7299.4s] like our eyes do and so very simple
[7301.8s] character level tasks often fail so for
[7305.2s] example uh I'm giving it a string
[7307.5s] ubiquitous and I'm asking it to print
[7309.7s] only every third character starting with
[7311.7s] the first one so we start with U and
[7314.2s] then we should go every third so every
[7316.9s] so 1 2 3 Q should be next and then Etc
[7321.2s] so this I see is not correct and again
[7323.8s] my hypothesis is that this is again
[7325.9s] Dental arithmetic here is failing number
[7328.4s] one a little bit but number two I think
[7330.4s] the the more important issue here is
[7332.2s] that if you go to Tik
[7333.5s] tokenizer and you look at ubiquitous we
[7336.3s] see that it is three tokens right so you
[7339.1s] and I see ubiquitous and we can easily
[7341.3s] access the individual letters because we
[7343.9s] kind of see them and when we have it in
[7345.6s] the working memory of our visual sort of
[7347.6s] field we can really easily index into
[7349.7s] every third letter and I can do that
[7351.3s] task but the models don't have access to
[7353.4s] the individual letters they see this as
[7355.3s] these three tokens and uh remember these
[7358.2s] models are trained from scratch on the
[7359.9s] internet and all these token uh
[7362.6s] basically the model has to discover how
[7364.0s] many of all these different letters are
[7365.6s] packed into all these different tokens
[7367.9s] and the reason we even use tokens is
[7369.4s] mostly for efficiency uh but I think a
[7371.2s] lot of people areed interested to delete
[7372.8s] tokens entirely like we should really
[7374.6s] have character level or bite level
[7376.4s] models it's just that that would create
[7378.2s] very long sequences and people don't
[7380.0s] know how to deal with that right now so
[7381.9s] while we have the token World any kind
[7383.6s] of spelling tasks are not actually
[7385.0s] expected to work super well so because I
[7387.7s] know that spelling is not a strong suit
[7389.4s] because of tokenization I can again Ask
[7391.7s] it to lean On Tools so I can just say
[7393.6s] use code and I would again expect this
[7396.0s] to work because the task of copy pasting
[7398.4s] ubiquitous into the python interpreter
[7400.4s] is much easier and then we're leaning on
[7402.4s] python interpreter to manipulate the
[7405.4s] characters of this string so when I say
[7407.6s] use
[7409.0s] code
[7410.7s] ubiquitous yes it indexes into every
[7413.0s] third character and the actual truth is
[7415.3s] u2s
[7416.8s] uqs uh which looks correct to me so um
[7421.0s] again an example of spelling related
[7422.8s] tasks not working very well a very
[7424.8s] famous example of that recently is how
[7427.2s] many R are there in strawberry and this
[7429.2s] went viral many times and basically the
[7431.8s] models now get it correct they say there
[7433.3s] are three Rs in Strawberry but for a
[7435.3s] very long time all the state-of-the-art
[7436.7s] models would insist that there are only
[7438.3s] two RS in strawberry and this caused a
[7440.9s] lot of you know Ruckus because is that a
[7443.7s] word I think so because um it just kind
[7446.8s] of like why are the models so brilliant
[7448.6s] and they can solve math Olympiad
[7450.2s] questions but they can't like count RS
[7452.8s] in strawberry and the answer for that
[7454.8s] again is I've got built up to it kind of
[7456.6s] slowly but number one the models don't
[7458.8s] see characters they see tokens and
[7461.0s] number two they are not very good at
[7462.8s] counting and so here we are combining
[7465.3s] the difficulty of seeing the characters
[7467.5s] with the difficulty of counting and
[7469.4s] that's why the models struggled with
[7470.6s] this even though I think by now honestly
[7473.1s] I think open I may have hardcoded the
[7474.6s] answer here or I'm not sure what they
[7476.0s] did but um uh but this specific query
[7479.5s] now works
[7481.8s] so models are not very good at spelling
[7484.4s] and there there's a bunch of other
[7485.7s] little sharp edges and I don't want to
[7486.9s] go into all of them I just want to show
[7488.4s] you a few examples of things to be aware
[7490.0s] of and uh when you're using these models
[7492.3s] in practice I don't actually want to
[7494.2s] have a comprehensive analysis here of
[7495.8s] all the ways that the models are kind of
[7497.9s] like falling short I just want to make
[7499.6s] the point that there are some Jagged
[7501.1s] edges here and there and we've discussed
[7503.6s] a few of them and a few of them make
[7505.1s] sense but some of them also will just
[7506.4s] not make as much sense and they're kind
[7508.4s] of like you're left scratching your head
[7510.0s] even if you understand in- depth how
[7511.8s] these models work and and good example
[7514.1s] of that recently is the following uh the
[7516.4s] models are not very good at very simple
[7517.9s] questions like this and uh this is
[7520.3s] shocking to a lot of people because
[7522.0s] these math uh these problems can solve
[7523.8s] complex math problems they can answer
[7525.8s] PhD grade physics chemistry biology
[7528.5s] questions much better than I can but
[7530.6s] sometimes they fall short in like super
[7531.8s] simple problems like this so here we go
[7534.8s] 9.11 is bigger than 9.9 and it justifies
[7538.4s] it in some way but obviously and then at
[7541.0s] the end okay it actually it flips its
[7544.0s] decision later so um I don't believe
[7547.1s] that this is very reproducible sometimes
[7549.0s] it flips around its answer sometimes
[7550.6s] gets it right sometimes get it get it
[7552.1s] wrong uh let's try
[7556.7s] again okay even though it might look
[7559.7s] larger okay so here it doesn't even
[7561.6s] correct itself in the end if you ask
[7563.4s] many times sometimes it gets it right
[7564.7s] too but how is it that the model can do
[7567.3s] so great at Olympiad grade problems but
[7570.4s] then fail on very simple problems like
[7572.3s] this and uh I think this one is as I
[7575.8s] mentioned a little bit of a head
[7576.6s] scratcher it turns out that a bunch of
[7578.1s] people studied this in depth and I
[7579.8s] haven't actually read the paper uh but
[7582.0s] what I was told by this team was that
[7584.8s] when you scrutinize the activations
[7587.9s] inside the neural network when you look
[7589.3s] at some of the features and what what
[7591.2s] features turn on or off and what neurons
[7593.1s] turn on or off uh a bunch of neurons
[7595.8s] inside the neural network light up that
[7597.8s] are usually associated with Bible verses
[7600.7s] U and so I think the model is kind of
[7602.7s] like reminded that these almost look
[7604.8s] like Bible verse markers and in a bip
[7608.0s] verse setting 9.11 would come after 99.9
[7612.0s] and so basically the model somehow finds
[7613.5s] it like cognitively very distracting
[7616.1s] that in Bible verses 9.11 would be
[7618.3s] greater um even though here it's
[7620.8s] actually trying to justify it and come
[7622.1s] up to the answer with a math it still
[7624.6s] ends up with the wrong answer here so it
[7627.2s] basically just doesn't fully make sense
[7628.9s] and it's not fully understood and um
[7632.8s] there's a few Jagged issues like that so
[7634.8s] that's why treat this as a as what it is
[7637.4s] which is a St stochastic system that is
[7639.3s] really magical but that you can't also
[7641.2s] fully trust and you want to use it as a
[7643.2s] tool not as something that you kind of
[7645.1s] like letter rip on a problem and
[7647.2s] copypaste the results okay so we have
[7649.2s] now covered two major stages of training
[7652.2s] of large language models we saw that in
[7654.6s] the first stage this is called the
[7656.2s] pre-training stage we are basically
[7658.0s] training on internet documents and when
[7660.6s] you train a language model on internet
[7662.1s] documents you get what's called a base
[7664.0s] model and it's basically an internet
[7665.8s] document simulator right now we saw that
[7668.8s] this is an interesting artifact and uh
[7671.2s] this takes many months to train on
[7673.4s] thousands of computers and it's kind of
[7675.0s] a lossy compression of the internet and
[7677.1s] it's extremely interesting but it's not
[7678.6s] directly useful because we don't want to
[7680.5s] sample internet documents we want to ask
[7682.7s] questions of an AI and have it respond
[7685.0s] to our questions so for that we need an
[7687.3s] assistant and we saw that we can
[7689.1s] actually construct an assistant in the
[7691.3s] process of a post
[7694.0s] training and specifically in the process
[7696.6s] of supervised fine-tuning as we call
[7699.4s] it so in this stage we saw that it's
[7702.7s] algorithmically identical to
[7704.1s] pre-training nothing is going to change
[7706.0s] the only thing that changes is the data
[7707.5s] set so instead of Internet documents we
[7710.0s] now want to create and curate a very
[7712.8s] nice data set of conversations so we
[7715.4s] want Millions conversations on all kinds
[7718.3s] of diverse topics between a human and an
[7721.6s] assistant and fundamentally these
[7724.3s] conversations are created by humans so
[7727.1s] humans write the prompts and humans
[7729.9s] write the ideal response responses and
[7732.2s] they do that based on labeling
[7734.6s] documentations now in the modern stack
[7737.3s] it's not actually done fully and
[7739.0s] manually by humans right they actually
[7740.8s] now have a lot of help from these tools
[7742.8s] so we can use language models um to help
[7745.6s] us create these data sets and that's
[7747.1s] done extensively but fundamentally it's
[7749.2s] all still coming from Human curation at
[7751.0s] the end so we create these conversations
[7753.7s] that now becomes our data set we fine
[7755.6s] tune on it or continue training on it
[7757.9s] and we get an assistant and then we kind
[7760.2s] of shifted gears and started talking
[7761.5s] about some of the kind of cognitive
[7762.9s] implications of what this assistant is
[7764.9s] like and we saw that for example the
[7766.8s] assistant will hallucinate if you don't
[7769.6s] take some sort of mitigations towards it
[7772.5s] so we saw that hallucinations would be
[7774.4s] common and then we looked at some of the
[7775.8s] mitigations of those hallucinations and
[7778.3s] then we saw that the models are quite
[7779.5s] impressive and can do a lot of stuff in
[7780.8s] their head but we saw that they can also
[7783.0s] Lean On Tools to become better so for
[7785.5s] example we can lo lean on a web search
[7788.4s] in order to hallucinate less and to
[7790.9s] maybe bring up some more um recent
[7793.2s] information or something like that or we
[7795.0s] can lean on tools like code interpreter
[7797.0s] so the code can so the llm can write
[7799.1s] some code and actually run it and see
[7800.7s] the
[7801.8s] results so these are some of the topics
[7803.9s] we looked at so far um now what I'd like
[7806.4s] to do is I'd like to cover the last and
[7809.3s] major stage of this Pipeline and that is
[7812.7s] reinforcement learning so reinforcement
[7815.6s] learning is still kind of thought to be
[7817.0s] under the umbrella of posttraining uh
[7819.5s] but it is the last third major stage and
[7822.4s] it's a different way of training
[7824.4s] language models and usually follows as
[7826.7s] this third step so inside companies like
[7829.4s] open AI you will start here and these
[7831.3s] are all separate teams so there's a team
[7833.4s] doing data for pre-training and a team
[7835.6s] doing training for pre-training and then
[7837.8s] there's a team doing all the
[7839.6s] conversation generation in a in a
[7842.2s] different team that is kind of doing the
[7844.1s] supervis fine tuning and there will be a
[7845.8s] team for the reinforcement learning as
[7847.2s] well so it's kind of like a handoff of
[7849.3s] these models you get your base model the
[7851.3s] then you find you need to be an
[7852.2s] assistant and then you go into
[7853.7s] reinforcement learning which we'll talk
[7855.1s] about uh
[7856.7s] now so that's kind of like the major
[7858.9s] flow and so let's now focus on
[7861.1s] reinforcement learning the last major
[7863.2s] stage of training and let me first
[7865.6s] actually motivate it and why we would
[7867.4s] want to do reinforcement learning and
[7869.1s] what it looks like on a high level so I
[7871.2s] would now like to try to motivate the
[7872.6s] reinforcement learning stage and what it
[7873.9s] corresponds to with something that
[7875.3s] you're probably familiar with and that
[7877.0s] is basically going to school so just
[7879.1s] like you went to school to become um
[7881.4s] really good at something we want to take
[7883.1s] large language models through school and
[7885.8s] really what we're doing is um we're um
[7889.5s] we have a few paradigms of ways of uh
[7892.2s] giving them knowledge or transferring
[7894.0s] skills so in particular when we're
[7896.2s] working with textbooks in school you'll
[7898.3s] see that there are three major kind of
[7900.6s] uh pieces of information in these
[7902.9s] textbooks three classes of information
[7905.6s] the first thing you'll see is you'll see
[7906.8s] a lot of exposition um and by the way
[7909.2s] this is a totally random book I pulled
[7910.5s] from the internet I I think it's some
[7912.0s] kind of an organic chemistry or
[7913.2s] something I'm not sure uh but the
[7915.2s] important thing is that you'll see that
[7916.6s] most of the text most of it is kind of
[7918.5s] just like the meat of it is exposition
[7920.9s] it's kind of like background knowledge
[7922.5s] Etc as you are reading through the words
[7925.7s] of this Exposition you can think of that
[7928.1s] roughly as training on that data so um
[7932.4s] and that's why when you're reading
[7933.5s] through this stuff this background
[7934.8s] knowledge and this all this context
[7936.0s] information it's kind of equivalent to
[7938.7s] pre-training so it's it's where we build
[7941.4s] sort of like a knowledge base of this
[7943.7s] data and get a sense of the topic the
[7947.0s] next major kind of information that you
[7948.9s] will see is these uh problems and with
[7952.9s] their worked Solutions so basically a
[7955.5s] human expert in this case uh the author
[7957.4s] of this book has given us not just a
[7959.3s] problem but has also worked through the
[7961.5s] solution and the solution is basically
[7963.6s] like equivalent to having like this
[7965.8s] ideal response for an assistant so it's
[7968.1s] basically the expert is showing us how
[7969.9s] to solve the problem in it's uh kind of
[7972.3s] like um in its full form so as we are
[7975.2s] reading the solution we are basically
[7978.0s] training on the expert data and then
[7981.2s] later we can try to imitate the expert
[7983.8s] um and basically um that's that roughly
[7987.2s] correspond to having the sft model
[7988.9s] that's what it would be doing so
[7991.2s] basically we've already done
[7992.2s] pre-training and we've already covered
[7994.6s] this um imitation of experts and how
[7997.4s] they solve these problems and the third
[7999.9s] stage of reinforcement learning is
[8001.7s] basically the practice problems so
[8004.3s] sometimes you'll see this is just a
[8005.7s] single practice problem here but of
[8007.4s] course there will be usually many
[8008.7s] practice problems at the end of each
[8010.1s] chapter in any textbook and practice
[8012.7s] problems of course we know are critical
[8014.1s] for learning because what are they
[8016.1s] getting you to do they're getting you to
[8017.7s] practice uh to practice yourself and
[8020.0s] discover ways of solving these problems
[8022.2s] yourself and so what you get in a
[8024.2s] practice problem is you get a problem
[8026.3s] description but you're not given the
[8028.5s] solution but you are given the final
[8030.8s] answer answer usually in the answer key
[8033.1s] of the textbook and so you know the
[8035.3s] final answer that you're trying to get
[8036.6s] to and you have the problem statement
[8038.6s] but you don't have the solution you are
[8040.4s] trying to practice the solution you're
[8042.6s] trying out many different things and
[8044.3s] you're seeing what gets you to the final
[8047.3s] solution the best and so you're
[8049.6s] discovering how to solve these problems
[8051.6s] so and in the process of that you're
[8053.0s] relying on number one the background
[8054.6s] information which comes from
[8055.8s] pre-training and number two maybe a
[8057.7s] little bit of imitation of human experts
[8060.4s] and you can probably try similar kinds
[8062.2s] of solutions and so on so we've done
[8065.1s] this and this and now in this section
[8067.2s] we're going to try to practice and so
[8070.2s] we're going to be given prompts we're
[8072.1s] going to be given Solutions U sorry the
[8074.5s] final answers but we're not going to be
[8076.3s] given expert Solutions we have to
[8078.7s] practice and try stuff out and that's
[8080.8s] what reinforcement learning is about
[8083.2s] okay so let's go back to the problem
[8084.4s] that we worked with previously just so
[8086.3s] we have a concrete example to talk
[8087.7s] through as we explore sort of the topic
[8090.0s] here so um I'm here in the Teck
[8092.7s] tokenizer because I'd also like to well
[8095.0s] I get a text box which is useful but
[8097.2s] number two I want to remind you again
[8099.0s] that we're always working with
[8099.9s] onedimensional token sequences and so um
[8102.8s] I actually like prefer this view because
[8104.5s] this is like the native view of the llm
[8106.3s] if that makes sense like this is what it
[8108.2s] actually sees it sees token IDs right
[8111.1s] okay so Emily buys three apples and two
[8114.1s] oranges each orange is $2 the total cost
[8117.4s] of all the fruit is $13 what is the cost
[8120.0s] of each apple
[8121.6s] and what I'd like to what I like you to
[8123.4s] appreciate here is these are like four
[8126.3s] possible candidate Solutions as an
[8129.4s] example and they all reach the answer
[8131.7s] three now what I'd like you to
[8133.4s] appreciate at this point is that if I am
[8135.5s] the human data labeler that is creating
[8137.7s] a conversation to be entered into the
[8139.3s] training set I don't actually really
[8142.0s] know which of these
[8144.1s] conversations to um to add to the data
[8148.4s] set some of these conversations kind of
[8150.3s] set up a system equations some of them
[8152.5s] sort of like just talk through it in
[8154.0s] English and some of them just kind of
[8155.7s] like skip right through to the
[8158.2s] solution um if you look at chbt for
[8160.7s] example and you give it this question it
[8163.7s] defines a system of variables and it
[8165.2s] kind of like does this little thing what
[8167.2s] we have to appreciate and uh
[8169.0s] differentiate between though is um the
[8172.4s] first purpose of a solution is to reach
[8174.3s] the right answer of course we want to
[8175.7s] get the final answer three that is the
[8178.0s] that is the important purpose here but
[8179.9s] there's kind of like a secondary purpose
[8181.5s] as well where here we are also just kind
[8183.6s] of trying to make it like nice uh for
[8186.1s] the human because we're kind of assuming
[8187.9s] that the person wants to see the
[8189.0s] solution they want to see the
[8190.1s] intermediate steps we want to present it
[8191.9s] nicely Etc so there are two separate
[8194.0s] things going on here number one is the
[8196.0s] presentation for the human but number
[8197.7s] two we're trying to actually get the
[8198.9s] right answer um so let's for the moment
[8202.2s] focus on just reaching the final answer
[8205.0s] if we're only care if we only care about
[8206.8s] the final answer then which of these is
[8209.6s] the optimal or the best prompt um sorry
[8213.6s] the best solution for the llm to reach
[8216.2s] the right
[8217.6s] answer um and what I'm trying to get at
[8220.5s] is we don't know me as a human labeler I
[8223.0s] would not know which one of these is
[8224.4s] best so as an example we saw earlier on
[8227.4s] when we looked at
[8229.4s] um the token sequences here and the
[8231.7s] mental arithmetic and reasoning we saw
[8234.1s] that for each token we can only spend
[8235.8s] basically a finite number of finite
[8238.1s] amount of compute here that is not very
[8239.6s] large or you should think about it that
[8240.8s] way way and so we can't actually make
[8243.5s] too big of a leap in any one token is is
[8246.6s] maybe the way to think about it so as an
[8248.7s] example in this one what's really nice
[8250.9s] about it is that it's very few tokens so
[8252.6s] it's going to take us very short amount
[8254.0s] of time to get to the answer but right
[8257.0s] here when we're doing 30 - 4 IDE 3
[8259.5s] equals right in this token here we're
[8262.6s] actually asking for a lot of computation
[8264.1s] to happen on that single individual
[8265.7s] token and so maybe this is a bad example
[8268.0s] to give to the llm because it's kind of
[8269.4s] incentivizing it to skip through the
[8270.8s] calculations very quickly and it's going
[8272.4s] to actually make up mistakes make
[8274.3s] mistakes in this mental arithmetic uh so
[8276.9s] maybe it would work better to like
[8278.3s] spread out the spread it out more maybe
[8281.0s] it would be better to set it up as an
[8282.4s] equation maybe it would be better to
[8284.2s] talk through it we fundamentally don't
[8286.3s] know and we don't know because what is
[8289.6s] easy for you or I as or as human
[8292.2s] labelers what's easy for us or hard for
[8294.4s] us is different than what's easy or hard
[8296.7s] for the llm it cognition is different um
[8300.2s] and the token sequences are kind of like
[8303.2s] different hard for it and so some of the
[8307.5s] token sequences here that are trivial
[8310.4s] for me might be um very too much of a
[8313.7s] leap for the llm so right here this
[8316.8s] token would be way too hard but
[8318.5s] conversely many of the tokens that I'm
[8320.5s] creating here might be just trivial to
[8323.3s] the llm and we're just wasting tokens
[8325.2s] like why waste all these tokens when
[8326.8s] this is all trivial so if the only thing
[8329.4s] we care care about is the final answer
[8331.9s] and we're separating out the issue of
[8333.3s] the presentation to the human um then we
[8336.2s] don't actually really know how to
[8337.2s] annotate this example we don't know what
[8339.2s] solution to get to the llm because we
[8341.1s] are not the
[8342.2s] llm and it's clear here in the case of
[8345.2s] like the math example but this is
[8347.2s] actually like a very pervasive issue
[8348.8s] like for our knowledge is not lm's
[8352.0s] knowledge like the llm actually has a
[8353.7s] ton of knowledge of PhD in math and
[8355.4s] physics chemistry and whatnot so in many
[8357.4s] ways it actually knows more than I do
[8359.4s] and I'm I'm potentially not utilizing
[8361.6s] that knowledge in its problem solving
[8364.0s] but conversely I might be injecting a
[8366.1s] bunch of knowledge in my solutions that
[8368.1s] the LM doesn't know in its parameters
[8371.3s] and then those are like sudden leaps
[8373.3s] that are very confusing to the model and
[8376.1s] so our cognitions are different and I
[8378.8s] don't really know what to put here if
[8381.4s] all we care about is the reaching the
[8382.8s] final solution and doing it economically
[8385.8s] ideally and so long story short we are
[8389.8s] not in a good position to create these
[8392.7s] uh token sequences for the LM and
[8395.4s] they're useful by imitation to
[8396.8s] initialize the system but we really want
[8399.7s] the llm to discover the token sequences
[8401.9s] that work for it we need to find it
[8404.6s] needs to find for itself what token
[8406.9s] sequence reliably gets to the answer
[8409.7s] given the prompt and it needs to
[8411.6s] discover that in the process of
[8412.6s] reinforcement learning and of trial and
[8414.4s] error so let's see how this example
[8418.0s] would work like in reinforcement
[8420.0s] learning
[8421.2s] okay so we're now back in the huging
[8423.1s] face inference playground and uh that
[8426.1s] just allows me to very easily call uh
[8428.2s] different kinds of models so as an
[8429.7s] example here on the top right I chose
[8431.1s] the Gemma 2 2 billion parameter model so
[8434.5s] two billion is very very small so this
[8436.5s] is a tiny model but it's okay so we're
[8439.1s] going to give it um the way that
[8440.7s] reinforcement learning will basically
[8441.9s] work is actually quite quite simple um
[8444.7s] we need to try many different kinds of
[8447.4s] solutions and we want to see which
[8449.1s] Solutions work well or not
[8451.3s] so we're basically going to take the
[8453.2s] prompt we're going to run the
[8455.4s] model and the model generates a solution
[8458.8s] and then we're going to inspect the
[8460.0s] solution and we know that the correct
[8462.2s] answer for this one is $3 and so indeed
[8465.2s] the model gets it correct it says it's
[8467.0s] $3 so this is correct so that's just one
[8470.1s] attempt at DIS solution so now we're
[8472.0s] going to delete this and we're going to
[8473.1s] rerun it again let's try a second
[8475.4s] attempt so the model solves it in a bit
[8477.8s] slightly different way right every
[8479.7s] single attempt will be a different
[8481.7s] generation because these models are
[8483.0s] stochastic systems remember that at
[8484.8s] every single token here we have a
[8486.2s] probability distribution and we're
[8487.8s] sampling from that distribution so we
[8489.7s] end up kind kind of going down slightly
[8491.7s] different paths and so this is a second
[8494.1s] solution that also ends in the correct
[8496.1s] answer now we're going to delete that
[8498.5s] let's go a third
[8499.8s] time okay so again slightly different
[8502.1s] solution but also gets it
[8504.4s] correct now we can actually repeat this
[8506.8s] uh many times and so in practice you
[8509.3s] might actually sample thousand of
[8511.2s] independent Solutions or even like
[8512.8s] million solutions for just a single
[8515.0s] prompt um and some of them will be
[8517.6s] correct and some of them will not be
[8519.0s] very correct and basically what we want
[8520.6s] to do is we want to encourage the
[8522.4s] solutions that lead to correct answers
[8525.4s] so let's take a look at what that looks
[8526.6s] like so if we come back over here here's
[8529.4s] kind of like a cartoon diagram of what
[8530.8s] this is looking like we have a prompt
[8534.0s] and then we tried many different
[8535.2s] solutions in
[8536.7s] parallel and some of the solutions um
[8540.0s] might go well so they get the right
[8541.6s] answer which is in green and some of the
[8544.1s] solutions might go poorly and may not
[8545.8s] reach the right answer which is red now
[8548.6s] this problem here unfortunately is not
[8550.0s] the best example because it's a trivial
[8552.1s] prompt and as we saw uh even like a two
[8554.6s] billion parameter model always gets it
[8556.2s] right so it's not the best example in
[8558.0s] that sense but let's just exercise some
[8560.2s] imagination here and let's just suppose
[8563.7s] that the um green ones are good and the
[8567.2s] red ones are
[8568.6s] bad okay so we generated 15 Solutions
[8572.2s] only four of them got the right answer
[8574.6s] and so now what we want to do is
[8576.7s] basically we want to encourage the kinds
[8578.4s] of solutions that lead to right answers
[8580.9s] so whatever token sequences happened in
[8583.4s] these red Solutions obviously something
[8585.3s] went wrong along the way somewhere and
[8587.9s] uh this was not a good path to take
[8589.6s] through the solution and whatever token
[8591.7s] sequences there were in these Green
[8593.0s] Solutions well things went uh pretty
[8595.2s] well in this situation and so we want to
[8598.1s] do more things like it in prompts like
[8601.2s] this and the way we encourage this kind
[8603.6s] of a behavior in the future is we
[8605.2s] basically train on these sequences um
[8608.0s] but these training sequencies now are
[8609.5s] not coming from expert human annotators
[8612.5s] there's no human who decided that this
[8614.0s] is the correct solution this solution
[8616.0s] came from the model itself so the model
[8618.3s] is practicing here it's tried out a few
[8620.1s] Solutions four of them seem to have
[8621.7s] worked and now the model will kind of
[8623.8s] like train on them and this corresponds
[8625.8s] to a student basically looking at their
[8627.2s] Solutions and being like okay well this
[8628.8s] one worked really well so this is this
[8630.5s] is how I should be solving these kinds
[8632.1s] of problems and uh here in this example
[8636.0s] there are many different ways to
[8637.4s] actually like really tweak the
[8638.6s] methodology a little bit here but just
[8640.6s] to give the core idea across maybe it's
[8642.3s] simplest to just think about take the
[8644.6s] taking the single best solution out of
[8646.3s] these four uh like say this one that's
[8648.7s] why it was yellow uh so this is the the
[8652.1s] solution that not only led to the right
[8653.6s] answer but may maybe had some other nice
[8655.6s] properties maybe it was the shortest one
[8657.9s] or it looked nicest in some ways or uh
[8660.5s] there's other criteria you could think
[8661.8s] of as an example but we're going to
[8663.7s] decide that this the top solution we're
[8665.4s] going to train on it and then uh the
[8668.0s] model will be slightly more likely once
[8670.8s] you do the parameter update to take this
[8673.5s] path in this kind of a setting in the
[8676.1s] future but you have to remember that
[8678.2s] we're going to run many different
[8680.0s] diverse prompts across lots of math
[8682.2s] problems and physics problems and
[8683.5s] whatever wherever there might be so tens
[8686.4s] of thousands of prompts maybe have in
[8688.0s] mind there's thousands of solutions
[8690.6s] prompt and so this is all happening kind
[8692.6s] of like at the same time and as we're
[8695.5s] iterating this process the model is
[8697.6s] discovering for itself what kinds of
[8699.8s] token sequences lead it to correct
[8702.8s] answers it's not coming from a human
[8705.0s] annotator the the model is kind of like
[8708.0s] playing in this playground and it knows
[8710.1s] what it's trying to get to and it's
[8712.6s] discovering sequences that work for it
[8715.2s] uh these are sequences that don't make
[8716.9s] any mental leaps uh they they seem to
[8720.0s] work reliably and statistically and uh
[8723.2s] fully utilize the knowledge of the model
[8725.0s] as it has it and so uh this is the
[8728.2s] process of reinforcement
[8729.5s] learning it's basically a guess and
[8731.5s] check we're going to guess many
[8732.7s] different types of solutions we're going
[8733.9s] to check them and we're going to do more
[8735.7s] of what worked in the future and that is
[8738.6s] uh reinforcement learning so in the
[8740.9s] context of what came before we see now
[8743.2s] that the sft model the supervised fine
[8745.2s] tuning model it's still helpful because
[8747.3s] it still kind of like initializes the
[8749.0s] model a little bit into to the vicinity
[8751.0s] of the correct Solutions so it's kind of
[8753.2s] like a initialization of um of the model
[8756.6s] in the sense that it kind of gets the
[8758.0s] model to you know take Solutions like
[8760.8s] write out Solutions and maybe it has an
[8763.0s] understanding of setting up a system of
[8764.4s] equations or maybe it kind of like talks
[8766.3s] through a solution so it gets you into
[8768.0s] the vicinity of correct Solutions but
[8770.2s] reinforcement learning is where
[8771.5s] everything gets dialed in we really
[8773.7s] discover the solutions that work for the
[8775.3s] model get the right answers we encourage
[8777.6s] them and then the model just kind of
[8779.2s] like gets better over time time okay so
[8782.0s] that is the high Lev process for how we
[8783.4s] train large language models in short we
[8786.4s] train them kind of very similar to how
[8787.9s] we train children and basically the only
[8790.7s] difference is that children go through
[8792.3s] chapters of books and they do all these
[8794.8s] different types of training exercises um
[8797.8s] kind of within the chapter of each book
[8799.8s] but instead when we train AIS it's
[8801.4s] almost like we kind of do it stage by
[8803.0s] stage depending on the type of that
[8805.1s] stage so first what we do is we do
[8807.6s] pre-training which as we saw is
[8809.2s] equivalent to uh basically reading all
[8811.3s] the expository material so we look at
[8813.7s] all the textbooks at the same time and
[8815.5s] we read all the exposition and we try to
[8817.7s] build a knowledge base the second thing
[8820.2s] then is we go into the sft stage which
[8822.8s] is really looking at all the fixed uh
[8825.0s] sort of like solutions from Human
[8827.4s] Experts of all the different kinds of
[8829.7s] worked Solutions across all the
[8831.8s] textbooks and we just kind of get an sft
[8834.6s] model which is able to imitate the
[8836.3s] experts but does so kind of blindly it
[8838.3s] just kind of like does its best guess
[8840.9s] uh kind of just like trying to mimic
[8842.7s] statistically the expert behavior and so
[8844.9s] that's what you get when you look at all
[8846.1s] the work Solutions and then finally in
[8848.7s] the last stage we do all the practice
[8850.8s] problems in the RL stage across all the
[8853.3s] textbooks we only do the practice
[8855.1s] problems and that's how we get the RL
[8857.9s] model so on a high level the way we
[8860.0s] train llms is very much equivalent uh to
[8863.0s] the process that we train uh that we use
[8865.5s] for training of children the next point
[8867.9s] I would like to make is that actually
[8869.3s] these first two stat ages pre-training
[8871.0s] and surprise fine-tuning they've been
[8872.7s] around for years and they are very
[8873.9s] standard and everyone does them all the
[8875.5s] different llm providers it is this last
[8878.2s] stage the RL training that is a lot more
[8880.8s] early in its process of development and
[8882.8s] is not standard yet in the field and so
[8886.6s] um this stage is a lot more kind of
[8889.4s] early and nent and the reason for that
[8891.8s] is because I actually skipped over a ton
[8893.4s] of little details here in this process
[8895.3s] the high level idea is very simple it's
[8897.0s] trial and there learning but there's a
[8898.8s] ton of details and little math
[8900.1s] mathematical kind of like nuances to
[8902.0s] exactly how you pick the solutions that
[8903.4s] are the best and how much you train on
[8905.4s] them and what is the prompt distribution
[8907.5s] and how to set up the training run such
[8909.3s] that this actually works so there's a
[8911.0s] lot of little details and knobs to the
[8912.9s] core idea that is very very simple and
[8915.4s] so getting the details right here uh is
[8917.7s] not trivial and so a lot of companies
[8920.0s] like for example open and other LM
[8921.6s] providers have experimented internally
[8924.4s] with reinforcement learning fine tuning
[8926.2s] for llms for a while but they've not
[8928.6s] talked about it publicly
[8930.6s] um it's all kind of done inside the
[8932.4s] company and so that's why the paper from
[8935.2s] Deep seek that came out very very
[8936.9s] recently was such a big deal because
[8939.3s] this is a paper from this company called
[8941.2s] DC Kai in China and this paper really
[8945.0s] talked very publicly about reinforcement
[8947.0s] learning fine training for large
[8948.0s] language models and how incredibly
[8950.5s] important it is for large language
[8952.3s] models and how it brings out a lot of
[8954.6s] reasoning capabilities in the models
[8956.2s] we'll go into this in a second so this
[8958.5s] paper reinvigorated the public interest
[8961.3s] of using RL for llms and gave a lot of
[8965.2s] the um sort of n-r details that are
[8967.6s] needed to reproduce their results and
[8969.6s] actually get the stage to work for large
[8971.5s] langage models so let me take you
[8973.4s] briefly through this uh deep seek R1
[8975.1s] paper and what happens when you actually
[8976.9s] correctly apply RL to language models
[8978.9s] and what that looks like and what that
[8980.0s] gives you so the first thing I'll scroll
[8981.5s] to is this uh kind of figure two here
[8983.8s] where we are looking at the Improvement
[8985.8s] in how the models are solving
[8987.5s] mathematical problems so this is the
[8989.4s] accuracy of solving mathematical
[8990.8s] problems on the a accuracy and then we
[8994.1s] can go to the web page and we can see
[8995.3s] the kinds of problems that are actually
[8996.6s] in these um these the kinds of math
[8998.9s] problems that are being measured here so
[9000.8s] these are simple math problems you can
[9002.5s] um pause the video if you like but these
[9004.9s] are the kinds of problems that basically
[9006.2s] the models are being asked to solve and
[9008.1s] you can see that in the beginning
[9009.0s] they're not doing very well but then as
[9010.6s] you update the model with this many
[9012.5s] thousands of steps their accuracy kind
[9014.7s] of continues to climb so the models are
[9017.4s] improving and they're solving these
[9018.8s] problems with a higher accuracy
[9020.5s] as you do this trial and error on a
[9022.9s] large data set of these kinds of
[9024.2s] problems and the models are discovering
[9026.5s] how to solve math problems but even more
[9029.4s] incredible than the quantitative kind of
[9032.0s] results of solving these problems with a
[9033.7s] higher accuracy is the qualitative means
[9035.7s] by which the model achieves these
[9037.4s] results so when we scroll down uh one of
[9040.4s] the figures here that is kind of
[9041.5s] interesting is that later on in the
[9043.7s] optimization the model seems to be uh
[9046.9s] using average length per response uh
[9049.6s] goes up up so the model seems to be
[9051.2s] using more tokens to get its higher
[9054.6s] accuracy results so it's learning to
[9056.4s] create very very long Solutions why are
[9059.4s] these Solutions very long we can look at
[9060.9s] them qualitatively here so basically
[9063.4s] what they discover is that the model
[9065.4s] solution get very very long partially
[9067.3s] because so here's a question and here's
[9069.2s] kind of the answer from the model what
[9071.1s] the model learns to do um and this is an
[9073.8s] immerging property of new optimization
[9075.9s] it just discovers that this is good for
[9077.6s] problem solving is it starts to do stuff
[9079.6s] like this wait wait wait that's Nota
[9081.6s] moment I can flag here let's reevaluate
[9083.6s] this step by step to identify the
[9085.1s] correct sum can be so what is the model
[9087.1s] doing here right the model is basically
[9090.3s] re-evaluating steps it has learned that
[9092.5s] it works better for accuracy to try out
[9095.5s] lots of ideas try something from
[9097.3s] different perspectives retrace reframe
[9099.6s] backtrack is doing a lot of the things
[9101.6s] that you and I are doing in the process
[9103.0s] of problem solving for mathematical
[9105.0s] questions but it's rediscovering what
[9106.6s] happens in your head not what you put
[9108.5s] down on the solution and there is no
[9110.6s] human who can hardcode this stuff in the
[9112.8s] ideal assistant response this is only
[9115.0s] something that can be discovered in the
[9116.4s] process of reinforcement learning
[9117.9s] because you wouldn't know what to put
[9119.6s] here this just turns out to work for the
[9122.4s] model and it improves its accuracy in
[9124.1s] problem solving so the model learns what
[9126.9s] we call these chains of thought in your
[9128.9s] head and it's an emergent property of
[9130.8s] the optim of the optimization and that's
[9133.8s] what's bloating up the response length
[9136.4s] but that's also what's increasing the
[9138.2s] accuracy of the problem problem solving
[9140.6s] so what's incredible here is basically
[9142.1s] the model is discovering ways to think
[9144.6s] it's learning what I like to call
[9146.2s] cognitive strategies of how you
[9148.2s] manipulate a problem and how you
[9150.1s] approach it from different perspectives
[9152.0s] how you pull in some analogies or do
[9153.8s] different kinds of things like that and
[9155.6s] how you kind of uh try out many
[9157.2s] different things over time uh check a
[9159.0s] result from different perspectives and
[9160.9s] how you kind of uh solve problems but
[9163.5s] here it's kind of discovered by the RL
[9165.0s] so extremely incredible to see this
[9167.0s] emerge in the optimization without
[9168.9s] having to hardcode it anywhere the only
[9171.0s] thing we've given it are the correct
[9172.4s] answers and this comes out from trying
[9174.7s] to just solve them correctly which is
[9176.7s] incredible
[9178.4s] um now let's go back to actually the
[9180.6s] problem that we've been working with and
[9182.0s] let's take a look at what it would look
[9183.6s] like uh for uh for this kind of a model
[9187.6s] what we call reasoning or thinking model
[9189.8s] to solve that problem okay so recall
[9192.0s] that this is the problem we've been
[9193.0s] working with and when I pasted it into
[9195.1s] chat GPT 40 I'm getting this kind of a
[9197.6s] response let's take a look at what
[9199.4s] happens when you give this same query to
[9202.0s] what's called a reasoning or a thinking
[9203.6s] model this is a model that was trained
[9205.1s] with reinforcement learning so this
[9208.1s] model described in this paper DC car1 is
[9211.0s] available on chat. dec.com uh so this is
[9214.4s] kind of like the company uh that
[9215.9s] developed is hosting it you have to make
[9217.6s] sure that the Deep think button is
[9219.2s] turned on to get the R1 model as it's
[9221.5s] called we can paste it here and run
[9224.3s] it and so let's take a look at what
[9226.8s] happens now and what is the output of
[9228.1s] the model okay so here's it says so this
[9231.4s] is previously what we get using
[9233.0s] basically what's an sft approach a
[9234.8s] supervised funing approach this is like
[9236.9s] mimicking an expert solution this is
[9238.9s] what we get from the RL model okay let
[9241.6s] me try to figure this out so Emily buys
[9243.4s] three apples and two oranges each orange
[9245.2s] cost $2 total is 13 I need to find out
[9247.6s] blah blah blah so here you you um as
[9251.4s] you're reading this you can't escape
[9254.1s] thinking that this model is
[9256.4s] thinking um is definitely pursuing the
[9259.5s] solution solution it deres that it must
[9261.6s] cost $3 and then it says wait a second
[9263.6s] let me check my math again to be sure
[9265.3s] and then it tries it from a slightly
[9266.4s] different perspective and then it says
[9268.6s] yep all that checks out I think that's
[9270.9s] the answer I don't see any mistakes let
[9273.3s] me see if there's another way to
[9274.5s] approach the problem maybe setting up an
[9276.3s] equation let's let the cost of one apple
[9279.5s] be $8 then blah blah blah yep same
[9282.1s] answer so definitely each apple is $3
[9284.8s] all right confident that that's correct
[9287.2s] and then what it does once it sort of um
[9289.7s] did the thinking process is it writes up
[9291.7s] the nice solution for the human and so
[9294.1s] this is now considering so this is more
[9296.4s] about the correctness aspect and this is
[9298.6s] more about the presentation aspect where
[9300.5s] it kind of like writes it out nicely and
[9303.0s] uh boxes in the correct answer at the
[9305.2s] bottom and so what's incredible about
[9307.1s] this is we get this like thinking
[9308.4s] process of the model and this is what's
[9310.7s] coming from the reinforcement learning
[9312.1s] process this is what's bloating up the
[9315.0s] length of the token sequences they're
[9316.8s] doing thinking and they're trying
[9318.0s] different ways this is what's giving you
[9320.7s] higher accuracy in problem
[9322.4s] solving and this is where we are seeing
[9324.8s] these aha moments and these different
[9326.8s] strategies and these um ideas for how
[9329.9s] you can make sure that you're getting
[9331.2s] the correct
[9332.3s] answer the last point I wanted to make
[9334.4s] is some people are a little bit nervous
[9336.2s] about putting you know very sensitive
[9338.6s] data into chat.com because this is a
[9341.4s] Chinese company so people don't um
[9343.5s] people are a little bit careful and Cy
[9345.2s] with that a little bit um deep seek R1
[9348.1s] is a model that was released by this
[9350.0s] company so this is an open source model
[9352.4s] or open weights model it is available
[9354.8s] for anyone to download and use you will
[9356.8s] not be able to like run it in its full
[9359.7s] um sort of the full model in full
[9362.6s] Precision you won't run that on a
[9364.3s] MacBook but uh or like a local device
[9367.0s] because this is a fairly large model but
[9368.8s] many companies are hosting the full
[9370.7s] largest model one of those companies
[9372.8s] that I like to use is called
[9374.8s] together. so when you go to together.
[9377.3s] you sign up and you go to playgrounds
[9379.3s] you can can select here in the chat deep
[9381.6s] seek R1 and there's many different kinds
[9383.6s] of other models that you can select here
[9385.1s] these are all state-of-the-art models so
[9387.1s] this is kind of similar to the hugging
[9388.3s] face inference playground that we've
[9389.7s] been playing with so far but together. a
[9392.3s] will usually host all the
[9393.4s] state-of-the-art models so select DT
[9396.0s] car1 um you can try to ignore a lot of
[9398.5s] these I think the default settings will
[9399.8s] often be okay and we can put in this and
[9403.6s] because the model was released by Deep
[9405.4s] seek what you're getting here should be
[9407.5s] basically equivalent to what you're
[9408.8s] getting here now because of the
[9410.5s] randomness in the sampling we're going
[9411.8s] to get something slightly different uh
[9413.6s] but in principle this should be uh
[9415.6s] identical in terms of the power of the
[9417.1s] model and you should be able to see the
[9418.6s] same things quantitatively and
[9420.2s] qualitatively uh but uh this model is
[9422.2s] coming from kind of a an American
[9424.6s] company so that's deep seek and that's
[9427.3s] the what's called a reasoning
[9429.4s] model now when I go back to chat uh let
[9432.4s] me go to chat here okay so the models
[9434.8s] that you're going to see in the drop
[9435.7s] down here some of them like 01 03 mini
[9438.7s] O3 mini High Etc they are talking about
[9441.2s] uses Advanced reasoning now what this is
[9443.9s] referring to uses Advanced reasoning is
[9446.1s] it's referring to the fact that it was
[9447.6s] trained by reinforcement learning with
[9449.6s] techniques very similar to those of deep
[9451.6s] C car1 per public statements of opening
[9454.3s] ey employees uh so these are thinking
[9457.6s] models trained with RL and these models
[9460.1s] like GPT 4 or GPT 4 40 mini that you're
[9462.6s] getting in the free tier you should
[9463.9s] think of them as mostly sft models
[9465.8s] supervised fine tuning models they don't
[9467.8s] actually do this like thinking as as you
[9469.7s] see in the RL models and even though
[9472.4s] there's a little bit of reinforcement
[9473.5s] learning involved with these models and
[9475.1s] I'll go that into that in a second these
[9476.9s] are mostly sft models I think you should
[9478.6s] think about it that way so in the same
[9480.6s] way as what we saw here we can pick one
[9483.0s] of the thinking models like say 03 mini
[9485.3s] high and these models by the way might
[9487.4s] not be available to you unless you pay a
[9489.5s] Chachi PT subscription of either $20 per
[9491.9s] month or $200 per month for some of the
[9494.4s] top models so we can pick a thinking
[9496.9s] model and run now what's going to happen
[9500.2s] here is it's going to say reasoning and
[9501.8s] it's going to start to do stuff like
[9503.2s] this and um what we're seeing here is
[9506.6s] not exactly the stuff we're seeing here
[9509.2s] so even though under the hood the model
[9511.6s] produces these kinds of uh kind of
[9514.1s] chains of thought opening ey chooses to
[9516.5s] not show the exact chains of thought in
[9518.9s] the web interface it shows little
[9520.7s] summaries of that of those chains of
[9522.6s] thought and open kind of does this I
[9524.6s] think partly because uh they are worried
[9526.7s] about what's called the distillation
[9528.1s] risk that is that someone could come in
[9530.4s] and actually try to imitate those
[9531.9s] reasoning traces and recover a lot of
[9533.7s] the reasoning performance by just
[9535.4s] imitating the reasoning uh chains of
[9537.6s] thought and so they kind of hide them
[9539.3s] and they only show little summaries of
[9540.6s] them so you're not getting exactly what
[9542.3s] you would get in deep seek as with
[9544.2s] respect to the reasoning itself and then
[9547.0s] they write up the
[9548.6s] solution so these are kind of like
[9550.6s] equivalent even though we're not seeing
[9552.0s] the full under the hood details now in
[9554.4s] terms of the performance uh these models
[9557.2s] and deep seek models are currently rly
[9559.7s] on par I would say it's kind of hard to
[9561.1s] tell because of the evaluations but if
[9562.7s] you're paying $200 per month to open AI
[9564.7s] some of these models I believe are
[9566.0s] currently they basically still look
[9567.7s] better uh but deep seek R1 for now is
[9570.8s] still a very solid choice for a thinking
[9573.3s] model that would be available to you um
[9576.5s] sort of um either on this website or any
[9579.2s] other website because the model is open
[9580.6s] weights you can just download it so
[9583.8s] that's thinking models so what is the
[9586.1s] summary so far well we've talked about
[9588.4s] reinforcement learning and the fact that
[9590.8s] thinking emerges in the process of the
[9592.5s] optimization on when we basically run RL
[9595.3s] on many math uh and kind of code
[9597.5s] problems that have verifiable Solutions
[9599.6s] so there's like an answer three
[9601.6s] Etc now these thinking models you can
[9604.6s] access in for example deep seek or any
[9607.4s] inference provider like together. a and
[9609.8s] choosing deep seek over there these
[9612.5s] thinking models are also available uh in
[9614.7s] chpt under any of the 01 or O3
[9618.0s] models but these GPT 4 R models Etc
[9620.8s] they're not thinking models you should
[9621.9s] think of them as mostly sft models now
[9625.3s] if you are um if you have a prompt that
[9627.9s] requires Advanced reasoning and so on
[9629.9s] you should probably use some of the
[9630.9s] thinking models or at least try them out
[9632.8s] but empirically for a lot of my use when
[9635.3s] you're asking a simpler question there's
[9636.7s] like a knowledge based question or
[9637.8s] something like that this might be
[9639.0s] Overkill like there's no need to think
[9640.4s] 30 seconds about some factual question
[9642.7s] so for that I will uh sometimes default
[9644.7s] to just GPT 40 so empirically about 80
[9647.3s] 90% of my use is just gp4
[9649.8s] and when I come across a very difficult
[9651.3s] problem like in math and code Etc I will
[9653.5s] reach for the thinking models but then I
[9656.1s] have to wait a bit longer because
[9657.4s] they're thinking um so you can access
[9660.1s] these on chat on deep seek also I wanted
[9662.6s] to point out that um AI studio.
[9665.8s] go.com even though it looks really busy
[9668.2s] really ugly because Google's just unable
[9670.7s] to do this kind of stuff well it's like
[9673.0s] what is happening but if you choose
[9675.2s] model and you choose here Gemini 2.0
[9677.8s] flash thinking experimental 01 21 if you
[9680.5s] choose that one that's also a a kind of
[9682.5s] early experiment experimental of a
[9685.0s] thinking model by Google so we can go
[9687.5s] here and we can give it the same problem
[9689.3s] and click run and this is also a
[9691.2s] thinking problem a thinking model that
[9693.6s] will also do something
[9695.4s] similar and comes out with the right
[9697.4s] answer here so basically Gemini also
[9700.2s] offers a thinking model anthropic
[9702.2s] currently does not offer a thinking
[9703.6s] model but basically this is kind of like
[9705.2s] the frontier development of these llms I
[9707.6s] think RL is kind of like this new
[9709.5s] exciting stage but getting the details
[9711.8s] right is difficult and that's why all
[9713.7s] these models and thinking models are
[9715.1s] currently experimental as of 2025 very
[9717.9s] early 2025 um but this is kind of like
[9721.0s] the frontier development of pushing the
[9722.8s] performance on these very difficult
[9724.0s] problems using reasoning that is
[9725.8s] emerging in these optimizations one more
[9728.0s] connection that I wanted to bring up is
[9730.0s] that the discovery that reinforcement
[9732.2s] learning is extremely powerful way of
[9734.4s] learning is not new to the field of AI
[9737.6s] and one place what we've already seen
[9739.4s] this demonstrated is in the game of Go
[9742.3s] and famously Deep Mind developed the
[9744.4s] system alphago and you can watch a movie
[9746.4s] about it um where the system is learning
[9749.7s] to play the game of go against top human
[9752.1s] players and um when we go to the paper
[9756.4s] underlying alphago so in this paper when
[9759.6s] we scroll
[9761.2s] down we actually find a really
[9763.2s] interesting
[9764.4s] plot um that I think uh is kind of
[9767.2s] familiar uh to us and we're kind of like
[9769.3s] we discovering in the more open domain
[9771.6s] of arbitrary problem solving instead of
[9773.7s] on the closed specific domain of the
[9775.4s] game of Go but basically what they saw
[9777.9s] and we're going to see this in llms as
[9779.3s] well as this becomes more mature is this
[9783.2s] is the ELO rating of playing game of Go
[9785.4s] and this is leas dull an extremely
[9787.3s] strong human player and here what they
[9789.7s] are comparing is the strength of a model
[9791.6s] learned trained by supervised learning
[9794.1s] and a model trained by reinforcement
[9795.6s] learning so the supervised learning
[9797.6s] model is imitating human expert players
[9800.8s] so if you just get a huge amount of
[9802.3s] games played by expert players in the
[9804.0s] game of Go and you try to imitate them
[9806.5s] you are going to get better but then you
[9808.9s] top out and you never quite get better
[9811.6s] than some of the top top top players of
[9814.0s] in the game of Go like LEL so you're
[9815.8s] never going to reach there because
[9817.5s] you're just imitating human players you
[9819.2s] can't fundamentally go beyond a human
[9820.8s] player if you're just imitating human
[9822.5s] players but in a process of
[9824.2s] reinforcement learning is significantly
[9826.3s] more powerful in reinforcement learning
[9828.1s] for a game of Go it means that the
[9830.1s] system is playing moves that empirically
[9833.2s] and statistically lead to win to winning
[9836.1s] the game and so alphago is a system
[9839.5s] where it kind of plays against it itself
[9842.2s] and it's using reinforcement learning to
[9844.0s] create
[9844.9s] rollouts so it's the exact same diagram
[9847.2s] here but there's no prompt it's just uh
[9850.4s] because there's no prompt it's just a
[9851.4s] fixed game of Go but it's trying out
[9853.6s] lots of solutions it's trying out lots
[9855.2s] of plays and then the games that lead to
[9858.3s] a win instead of a specific answer are
[9860.9s] reinforced they're they're made stronger
[9864.2s] and so um the system is learning
[9866.6s] basically the sequences of actions that
[9868.2s] empirically and statistically lead to
[9870.3s] winning the game and reinforcement
[9872.7s] learning is not going to be constrained
[9874.1s] by human performance and reinforcement
[9876.2s] learning can do significantly better and
[9878.1s] overcome even the top players like Lisa
[9881.1s] Dole and so uh probably they could have
[9884.5s] run this longer and they just chose to
[9886.0s] crop it at some point because this costs
[9887.6s] money but this is very powerful
[9889.4s] demonstration of reinforcement learning
[9891.2s] and we're only starting to kind of see
[9892.8s] hints of this diagram in larger language
[9895.8s] models for reasoning problems so we're
[9898.6s] not going to get too far by just
[9899.9s] imitating experts we need to go beyond
[9901.8s] that set up these like little game
[9903.7s] environments and get let let the system
[9907.4s] discover reasoning traces or like ways
[9909.9s] of solving problems uh that are unique
[9914.2s] and that uh just basically work
[9916.3s] well now on this aspect of uniqueness
[9919.2s] notice that when you're doing
[9920.0s] reinforcement learning nothing prevents
[9921.8s] you from veering off the distribution of
[9924.6s] how humans are playing the game and so
[9926.8s] when we go back to uh this alphao search
[9929.3s] here one of the suggested modifications
[9931.6s] is called move 37 and move 37 in alphao
[9934.9s] is referring to a specific point in time
[9937.5s] where alphago basically played a move
[9940.6s] that uh no human expert would play uh so
[9943.8s] the probability of this move uh to be
[9945.8s] played by a human player was evaluated
[9947.8s] to be about 1 in 10th ,000 so it's a
[9949.8s] very rare move but in retrospect it was
[9952.1s] a brilliant move so alphago in the
[9954.2s] process of reinforcement learning
[9955.9s] discovered kind of like a strategy of
[9957.6s] playing that was unknown to humans and
[9960.0s] but is in retrospect uh brilliant I
[9962.4s] recommend this YouTube video um leis do
[9965.0s] versus alphao move 37 reactions and
[9966.7s] Analysis and this is kind of what it
[9968.7s] looked like when alphao played this
[9971.5s] move
[9974.4s] value that's a very that's a very
[9976.8s] surprising move I thought I thought it
[9979.6s] was I thought it was a
[9981.7s] mistake when I see this move anyway so
[9984.8s] basically people are kind of freaking
[9985.9s] out because it's a it's a move that a
[9988.6s] human would not play that alphago played
[9991.2s] because in its training uh this move
[9993.8s] seemed to be a good idea it just happens
[9995.6s] not to be a kind of thing that a humans
[9997.2s] would would do and so that is again the
[9999.4s] power of reinforcement learning and in
[10001.0s] principle we can actually see the
[10002.5s] equivalence of that if we continue
[10004.1s] scaling this Paradigm in language models
[10006.6s] and what that looks like is kind of
[10007.9s] unknown so so um what does it mean to
[10010.7s] solve problems in such a way that uh
[10014.6s] even humans would not be able to get how
[10016.7s] can you be better at reasoning or
[10018.1s] thinking than humans how can you go
[10020.4s] beyond just uh a thinking human like
[10023.9s] maybe it means discovering analogies
[10025.9s] that humans would not be able to uh
[10027.6s] create or maybe it's like a new thinking
[10029.6s] strategy it's kind of hard to think
[10030.8s] through uh maybe it's a holy new
[10034.2s] language that actually is not even
[10036.1s] English maybe it discovers its own
[10037.6s] language that is a lot better at
[10039.9s] thinking um because the model is
[10042.8s] unconstrained to even like stick with
[10044.4s] English uh so maybe it takes a different
[10047.2s] language to think in or it discovers its
[10049.1s] own language so in principle the
[10051.3s] behavior of the system is a lot less
[10053.4s] defined it is open to do whatever works
[10057.2s] and it is open to also slowly Drift from
[10060.1s] the distribution of its training data
[10061.4s] which is English but all of that can
[10063.8s] only be done if we have a very large
[10065.7s] diverse set of problems in which the
[10068.2s] these strategy can be refined and
[10069.7s] perfected and so that is a lot of the
[10071.8s] frontier LM research that's going on
[10073.7s] right now is trying to kind of create
[10075.7s] those kinds of prompt distributions that
[10077.3s] are large and diverse these are all kind
[10079.3s] of like game environments in which the
[10080.7s] llms can practice their thinking and uh
[10084.2s] it's kind of like writing you know these
[10086.0s] practice problems we have to create
[10087.5s] practice problems for all of domains of
[10090.1s] knowledge and if we have practice
[10092.1s] problems and tons of them the models
[10094.1s] will be able to reinforcement learning
[10096.4s] reinforcement learn on them and kind of
[10098.5s] uh create these kinds of uh diagrams but
[10101.9s] in the domain of open thinking instead
[10103.8s] of a closed domain like game of Go
[10106.6s] there's one more section within
[10107.7s] reinforcement learning that I wanted to
[10109.4s] cover and that is that of learning in
[10112.2s] unverifiable domains so so far all of
[10115.2s] the problems that we've looked at are in
[10116.5s] what's called verifiable domains that is
[10118.9s] any candidate solution we can score very
[10121.4s] easily against a concrete answer so for
[10124.0s] example answer is three and we can very
[10125.9s] easily score these Solutions against the
[10127.9s] answer of three
[10129.5s] either we require the models to like box
[10131.7s] in their answers and then we just check
[10133.7s] for equality of whatever is in the box
[10135.9s] with the answer or you can also use uh
[10138.4s] kind of what's called an llm judge so
[10140.6s] the llm judge looks at a solution and it
[10143.1s] gets the answer and just basically
[10145.0s] scores the solution for whether it's
[10146.6s] consistent with the answer or not and
[10148.4s] llms uh empirically are good enough at
[10150.8s] the current capability that they can do
[10152.4s] this fairly reliably so we can apply
[10154.4s] those kinds of techniques as well in any
[10156.2s] case we have a concrete answer and we're
[10157.8s] just checking Solutions again against it
[10159.2s] and we can do this automatically with no
[10161.4s] kind of humans in the loop the problem
[10163.6s] is that we can't apply the strategy in
[10165.3s] what's called unverifiable domains so
[10168.0s] usually these are for example creative
[10169.4s] writing tasks like write a joke about
[10171.0s] Pelicans or write a poem or summarize a
[10173.3s] paragraph or something like that in
[10175.3s] these kinds of domains it becomes harder
[10177.4s] to score our different solutions to this
[10179.8s] problem so for example writing a joke
[10181.8s] about Pelicans we can generate lots of
[10183.4s] different uh jokes of course that's fine
[10185.8s] for example we can go to chbt and we can
[10187.7s] get it to uh generate a joke about
[10191.3s] Pelicans uh so much stuff in their beaks
[10193.8s] because they don't bellan in
[10196.7s] backpacks what
[10199.4s] okay we can uh we can try something else
[10202.7s] why don't Pelicans ever pay for their
[10204.4s] drinks because they always B it to
[10206.5s] someone else haha okay so these models
[10210.2s] are not obviously not very good at humor
[10212.1s] actually I think it's pretty fascinating
[10213.2s] because I think humor is secretly very
[10215.2s] difficult and the model have the
[10216.9s] capability I think anyway in any case
[10220.0s] you could imagine creating lots of jokes
[10223.0s] the problem that we are facing is how do
[10224.6s] we score them now in principle we could
[10227.6s] of course get a human to look at all
[10229.4s] these jokes just like I did right now
[10231.6s] the problem with that is if you are
[10233.0s] doing reinforcement learning you're
[10234.5s] going to be doing many thousands of
[10236.3s] updates and for each update you want to
[10238.4s] be looking at say thousands of prompts
[10240.6s] and for each prompt you want to be
[10241.8s] potentially looking at looking at
[10243.2s] hundred or thousands of different kinds
[10245.0s] of generations and so there's just like
[10247.6s] way too many of these to look at and so
[10250.5s] um in principle you could have a human
[10252.2s] inspect all of them and score them and
[10253.5s] decide that okay maybe this one is funny
[10255.9s] and uh maybe this one is funny and this
[10258.4s] one is funny and we could train on them
[10261.1s] to get the model to become slightly
[10262.5s] better at jokes um in the context of
[10265.2s] pelicans at least um the problem is that
[10269.0s] it's just like way too much human time
[10270.6s] this is an unscalable strategy we need
[10272.4s] some kind of an automatic strategy for
[10274.2s] doing this and one sort of solution to
[10277.0s] this was proposed in this paper
[10279.2s] uh that introduced what's called
[10280.4s] reinforcement learning from Human
[10281.6s] feedback and so this was a paper from
[10283.7s] open at the time and many of these
[10285.4s] people are now um co-founders in
[10287.6s] anthropic um and this kind of proposed a
[10290.9s] approach for uh basically doing
[10293.2s] reinforcement learning in unverifiable
[10295.1s] domains so let's take a look at how that
[10296.9s] works so this is the cartoon diagram of
[10299.6s] the core ideas involved so as I
[10301.8s] mentioned the native approach is if we
[10304.1s] just set Infinity human time we could
[10306.4s] just run RL in these domains just fine
[10309.1s] so for example we can run RL as usual if
[10311.6s] I have Infinity humans I would I just
[10313.7s] want to do and these are just cartoon
[10315.1s] numbers I want to do 1,000 updates where
[10317.8s] each update will be on 1,000 prompts and
[10320.5s] in for each prompt we're going to have
[10322.2s] 1,000 roll outs that we're scoring so we
[10325.9s] can run RL with this kind of a setup the
[10328.9s] problem is in the process of doing this
[10330.7s] I will need to run one I will need to
[10332.8s] ask a human to evaluate a joke a total
[10335.3s] of 1 billion times and so that's a lot
[10338.1s] of people looking at really terrible
[10339.6s] jokes so we don't want to do that so
[10342.1s] instead we want to take the arlef
[10344.0s] approach so um in our Rel of approach we
[10347.7s] are kind of like the the core trick is
[10349.7s] that of indirection so we're going to
[10352.6s] involve humans just a little bit and the
[10355.0s] way we cheat is that we basically train
[10357.4s] a whole separate neural network that we
[10359.3s] call a reward model and this neural
[10361.9s] network will kind of like imitate human
[10364.3s] scores so we're going to ask humans to
[10366.8s] score um roll
[10369.2s] we're going to then imitate human scores
[10371.7s] using a neural network and this neural
[10374.4s] network will become a kind of simulator
[10375.8s] of human
[10376.7s] preferences and now that we have a
[10378.4s] neural network simulator we can do RL
[10381.1s] against it so instead of asking a real
[10383.5s] human we're asking a simulated human for
[10386.2s] their score of a joke as an example and
[10389.8s] so once we have a simulator we're often
[10392.0s] racist because we can query it as many
[10393.6s] times as we want to and it's all whole
[10396.0s] automatic process and we can now do
[10397.8s] reinforcement learning with respect to
[10399.0s] the simulator and the simulator as you
[10401.0s] might expect is not going to be a
[10402.2s] perfect human but if it's at least
[10404.4s] statistically similar to human judgment
[10407.0s] then you might expect that this will do
[10408.3s] something and in practice indeed uh it
[10410.3s] does so once we have a simulator we can
[10412.9s] do RL and everything works great so let
[10415.0s] me show you a cartoon diagram a little
[10416.6s] bit of what this process looks like
[10418.7s] although the details are not 100 like
[10420.7s] super important it's just a core idea of
[10422.5s] how this works so here I have a cartoon
[10424.3s] diagram of a hypothetical example of
[10426.2s] what training the reward model would
[10427.7s] look like so we have a prompt like write
[10430.2s] a joke about picans and then here we
[10432.4s] have five separate roll outs so these
[10434.0s] are all five different jokes just like
[10436.5s] this one now the first thing we're going
[10439.0s] to do is we are going to ask a human to
[10442.7s] uh order these jokes from the best to
[10445.1s] worst so this is uh so here this human
[10448.5s] thought that this joke is the best the
[10450.8s] funniest so number one joke this is
[10454.0s] number two joke number three joke four
[10456.6s] and five so this is the worst joke
[10459.1s] we're asking humans to order instead of
[10460.9s] give scores directly because it's a bit
[10462.7s] of an easier task it's easier for a
[10464.4s] human to give an ordering than to give
[10466.1s] precise scores now that is now the
[10469.2s] supervision for the model so the human
[10471.1s] has ordered them and that is kind of
[10472.6s] like their contribution to the training
[10474.2s] process but now separately what we're
[10476.1s] going to do is we're going to ask a
[10477.5s] reward model uh about its scoring of
[10480.6s] these jokes now the reward model is a
[10482.9s] whole separate neural network completely
[10485.0s] separate neural net um and it's also
[10487.6s] probably a transform
[10489.3s] uh but it's not a language model in the
[10490.9s] sense that it generates diverse language
[10493.5s] Etc it's just a scoring model so the
[10496.6s] reward model will take as an input The
[10499.4s] Prompt number one and number two a
[10502.2s] candidate joke so um those are the two
[10505.3s] inputs that go into the reward model so
[10507.2s] here for example the reward model would
[10508.8s] be taken this prompt and this joke now
[10511.8s] the output of a reward model is a single
[10514.1s] number and this number is thought of as
[10516.3s] a score and it can range for example
[10518.4s] from Z to one so zero would be the worst
[10520.9s] score and one would be the best score so
[10523.7s] here are some examples of what a
[10525.3s] hypothetical reward model at some stage
[10527.2s] in the training process would give uh s
[10529.4s] scoring to these jokes so 0.1 is a very
[10533.1s] low score 08 is a really high score and
[10536.2s] so on and so now um we compare the
[10540.7s] scores given by the reward model with uh
[10543.3s] the ordering given by the human and
[10545.4s] there's a precise mathematical way to
[10547.4s] actually calculate this uh basically set
[10549.4s] up a loss function and calculate a kind
[10551.7s] of like a correspondence here and uh
[10554.0s] update a model based on it but I just
[10555.8s] want to give you the intuition which is
[10557.7s] that as an example here for this second
[10560.6s] joke the the human thought that it was
[10562.2s] the funniest and the model kind of
[10563.6s] agreed right 08 is a relatively high
[10565.5s] score but this score should have been
[10567.2s] even higher right so after an update we
[10570.6s] would expect that maybe this score
[10571.8s] should have been will actually grow
[10573.7s] after an update of the network to be
[10575.0s] like say 081 or
[10577.0s] something um for this one here they
[10579.5s] actually are in a massive disagreement
[10581.0s] because the human thought that this was
[10582.1s] number two but here the the score is
[10584.6s] only 0.1 and so this score needs to be
[10587.9s] much higher so after an update on top of
[10590.9s] this um kind of a supervision this might
[10593.5s] grow a lot more like maybe it's 0.15 or
[10595.5s] something like
[10596.4s] that um and then here the human thought
[10599.8s] that this one was the worst joke but
[10601.9s] here the model actually gave it a fairly
[10603.5s] High number so you might expect that
[10605.4s] after the update uh this would come down
[10607.6s] to maybe 3 3.5 or something like that so
[10610.2s] basically we're doing what we did before
[10611.9s] we're slightly nudging the predictions
[10614.5s] from the models using a neural network
[10617.4s] training
[10618.5s] process and we're trying to make the
[10620.5s] reward model scores be consistent with
[10623.2s] human
[10624.0s] ordering and so um as we update the
[10627.2s] reward model on human data it becomes
[10629.7s] better and better simulator of the
[10631.6s] scores and orders uh that humans provide
[10634.8s] and then becomes kind of like the the
[10637.0s] neural the simulator of human
[10638.8s] preferences which we can then do RL
[10641.0s] against but critically we're not asking
[10643.1s] humans one billion times to look at a
[10644.8s] joke we're maybe looking at th000
[10646.9s] prompts and five roll outs each so maybe
[10648.9s] 5,000 jokes that humans have to look at
[10650.8s] in total and they just give the ordering
[10653.2s] and then we're training the model to be
[10654.4s] consistent with that ordering and I'm
[10656.2s] skipping over the mathematical details
[10658.5s] but I just want you to understand a high
[10659.8s] level idea that uh this reward model is
[10662.9s] do is basically giving us this scour and
[10665.2s] we have a way of training it to be
[10666.7s] consistent with human orderings
[10668.6s] and that's how rhf works okay so that is
[10671.3s] the rough idea we basically train
[10673.4s] simulators of humans and RL with respect
[10675.6s] to those
[10676.8s] simulators now I want to talk about
[10679.1s] first the upside of reinforcement
[10681.0s] learning from Human
[10683.7s] feedback the first thing is that this
[10686.0s] allows us to run reinforcement learning
[10687.5s] which we know is incredibly powerful
[10689.2s] kind of set of techniques and it allows
[10691.0s] us to do it in arbitrary domains and
[10693.2s] including the ones that are unverifiable
[10695.6s] so things like summarization and poem
[10697.8s] writing joke writing or any other
[10699.4s] creative writing really uh in domains
[10701.4s] outside of math and code
[10703.2s] Etc now empirically what we see when we
[10705.9s] actually apply rhf is that this is a way
[10708.2s] to improve the performance of the model
[10710.6s] and uh I have a top answer for why that
[10713.8s] might be but I don't actually know that
[10716.0s] it is like super well established on
[10718.1s] like why this is you can empirically
[10719.7s] observe that when you do rhf correctly
[10721.8s] the models you get are just like a
[10723.4s] little bit better um but as to why is I
[10725.9s] think like not as clear so here's my
[10727.5s] best guess my best guess is that this is
[10729.8s] possibly mostly due to the discriminator
[10732.2s] generator
[10733.3s] Gap what that means is that in many
[10735.7s] cases it is significantly easier to
[10738.2s] discriminate than to generate for humans
[10741.0s] so in particular an example of this is
[10745.0s] um in when we do supervised fine-tuning
[10747.6s] right
[10749.2s] sft we're asking humans to generate the
[10752.2s] ideal assistant response and in many
[10755.2s] cases here um as I've shown it uh the
[10758.5s] ideal response is very simple to write
[10760.3s] but in many cases might not be so for
[10762.2s] example in summarization or poem writing
[10764.1s] or joke writing like how are you as a
[10766.3s] human assist as a human labeler um
[10769.0s] supposed to give the ideal response in
[10770.7s] these cases it requires creative human
[10772.9s] writing to do that and so rhf kind of
[10775.8s] sidesteps this because we get um we get
[10778.8s] to ask people a significantly easier
[10780.6s] question as a data labelers they're not
[10783.0s] asked to write poems directly they're
[10784.8s] just given five poems from the model and
[10786.9s] they're just asked to order them and so
[10789.4s] that's just a much easier task for a
[10791.3s] human labeler to do and so what I think
[10793.8s] this allows you to do basically is it um
[10797.1s] it kind of like allows a lot more higher
[10800.3s] accuracy data because we're not asking
[10802.2s] people to do the generation task which
[10804.2s] can be extremely difficult like we're
[10806.1s] not asking them to do creative writing
[10807.9s] we're just trying to get them to
[10809.4s] distinguish between creative writings
[10811.4s] and uh find the ones that are best and
[10814.2s] that is the signal that humans are
[10815.6s] providing just the ordering and that is
[10817.5s] their input into the system and then the
[10820.0s] system in rhf just discovers the kinds
[10823.4s] of responses that would be graded well
[10826.0s] by humans and so that step of
[10828.7s] indirection allows the models to become
[10830.6s] a bit better so that is the upside of
[10833.6s] our LF it allows us to run RL it
[10835.7s] empirically results in better models and
[10838.0s] it allows uh people to contribute their
[10840.3s] supervision uh even without having to do
[10842.5s] extremely difficult tasks um in the case
[10845.1s] of writing ideal responses unfortunately
[10847.6s] our HF also comes with significant
[10849.6s] downsides and so um the main one is that
[10854.4s] basically we are doing reinforcement
[10855.7s] learning not with respect to humans and
[10857.6s] actual human judgment but with respect
[10859.2s] to a lossy simulation of humans right
[10861.7s] and this lossy simulation could be
[10863.3s] misleading because it's just a it's just
[10865.1s] a simulation right it's just a language
[10867.1s] model that's kind of outputting scores
[10869.1s] and it might not perfectly reflect the
[10871.4s] opinion of an actual human with an
[10873.3s] actual brain in all the possible
[10875.2s] different cases so that's number one
[10877.3s] which is actually something even more
[10878.6s] subtle and devious going on that uh
[10881.0s] really
[10882.3s] dramatically holds back our LF as a
[10884.8s] technique that we can really scale to
[10887.5s] significantly um kind of Smart Systems
[10891.0s] and that is that reinforcement learning
[10892.8s] is extremely good at discovering a way
[10895.1s] to game the model to game the simulation
[10898.4s] so this reward model that we're
[10900.3s] constructing here that gives the course
[10903.4s] these models are Transformers these
[10906.3s] Transformers are massive neurals they
[10908.2s] have billions of parameters and they
[10910.2s] imitate humans but they do so in a kind
[10912.3s] of like a simulation way now the problem
[10914.3s] is that these are massive complicated
[10916.1s] systems right there's a billion
[10917.3s] parameters here that are outputting a
[10918.9s] single
[10920.1s] score it turns out that there are ways
[10922.8s] to gain these models you can find kinds
[10925.7s] of inputs that were not part of their
[10928.0s] training set and these inputs
[10931.0s] inexplicably get very high scores but in
[10933.6s] a fake way so very often what you find
[10937.2s] if you run our lch for very long so for
[10939.2s] example if we do 1,000 updates which is
[10941.2s] like say a lot of updates you might
[10943.9s] expect that your jokes are getting
[10945.1s] better and that you're getting like real
[10946.7s] bangers about Pelicans but that's not
[10948.7s] EXA exactly what happens what happens is
[10951.7s] that uh in the first few hundred steps
[10954.1s] the jokes about Pelicans are probably
[10955.5s] improving a little bit and then they
[10957.1s] actually dramatically fall off the cliff
[10958.8s] and you start to get extremely
[10960.4s] nonsensical results like for example you
[10962.5s] start to get um the top joke about
[10965.0s] Pelicans starts to be the
[10968.0s] and this makes no sense right like when
[10969.3s] you look at it why should this be a top
[10970.6s] joke but when you take the the and you
[10973.6s] plug it into your reward model you'd
[10975.6s] expect score of zero but actually the
[10977.4s] reward model loves this as a joke it
[10979.6s] will tell you that the the the theth is
[10982.8s] a score of 1. Z this is a top joke and
[10986.4s] this makes no sense right but it's
[10988.0s] because these models are just
[10989.1s] simulations of humans and they're
[10990.7s] massive neural lots and you can find
[10992.5s] inputs at the bottom that kind of like
[10995.1s] get into the part of the input space
[10996.6s] that kind of gives you nonsensical
[10997.8s] results these examples are what's called
[11000.3s] adversarial examples and I'm not going
[11002.0s] to go into the topic too much but these
[11004.0s] are adversarial inputs to the model they
[11006.5s] are specific little inputs that kind of
[11009.2s] go between the nooks and crannies of the
[11010.7s] model and give nonsensical results at
[11012.5s] the top now here's what you might
[11014.6s] imagine doing you say okay the the the
[11016.7s] is obviously not score of one um it's
[11019.4s] obviously a low score so let's take the
[11021.2s] the the the the let's add it to the data
[11023.4s] set and give it an ordering that is
[11025.6s] extremely bad like a score of five and
[11027.9s] indeed your model will learn that the D
[11030.0s] should have a very low score and it will
[11031.8s] give it score of zero the problem is
[11033.8s] that there will always be basically
[11035.4s] infinite number of nonsensical
[11037.7s] adversarial examples hiding in the model
[11040.7s] if you iterate this process many times
[11042.4s] and you keep adding nonsensical stuff to
[11044.2s] your reward model and giving it very low
[11045.9s] scores you can you'll never win the game
[11049.0s] uh you can do this many many rounds and
[11051.0s] reinforcement learning if you run it
[11052.5s] long enough will always find a way to
[11054.5s] gain the model it will discover
[11056.0s] adversarial examples it will get get
[11058.0s] really high scores uh with nonsensical
[11060.7s] results and fundamentally this is
[11063.0s] because our scoring function is a giant
[11066.0s] neural nut and RL is extremely good at
[11069.0s] finding just the ways to trick it uh so
[11073.8s] long story short you always run rhf put
[11076.8s] for maybe a few hundred updates the
[11078.5s] model is getting better and then you
[11079.9s] have to crop it and you are done you
[11082.2s] can't run too much against this reward
[11085.5s] model because the optimization will
[11087.8s] start to game it and you basically crop
[11090.2s] it and you call it and you ship it um
[11094.0s] and uh you can improve the reward model
[11096.3s] but you kind of like come across these
[11097.6s] situations eventually at some point so
[11100.9s] rhf basically what I usually say is that
[11103.7s] RF is not RL and what I mean by that is
[11106.7s] I mean RF is RL obviously but it's not
[11109.2s] RL in the magical sense this is not RL
[11112.4s] that you can run
[11113.7s] indefinitely these kinds of problems
[11116.3s] like where you are getting con correct
[11118.2s] answer you cannot gain this as easily
[11120.6s] you either got the correct answer or you
[11122.0s] didn't and the scoring function is much
[11123.8s] much simpler you're just looking at the
[11125.2s] boxed area and seeing if the result is
[11127.3s] correct so it's very difficult to gain
[11129.6s] these functions but uh gaming a reward
[11132.0s] model is possible now in these
[11134.2s] verifiable domains you can run RL
[11136.3s] indefinitely you could run for tens of
[11138.8s] thousands hundreds of thousands of steps
[11140.2s] and discover all kinds of really crazy
[11141.8s] strategies that we might not even ever
[11143.5s] think about of Performing really well
[11145.8s] for all these problems in the game of Go
[11148.8s] there's no way to to beat to basically
[11151.0s] game uh the winning of a game or the
[11152.9s] losing of a game we have a perfect
[11154.6s] simulator we know all the different uh
[11157.7s] where all the stones are placed and we
[11159.3s] can calculate uh whether someone has won
[11161.4s] or not there's no way to gain that and
[11163.6s] so you can do RL indefinitely and you
[11165.8s] can eventually be beat even leol but
[11168.7s] with models like this which are gameable
[11171.7s] you cannot repeat this process
[11173.8s] indefinitely so I kind of see rhf as not
[11176.6s] real RL because the reward function is
[11179.2s] gameable so it's kind of more like in
[11181.3s] the realm of like little fine-tuning
[11183.0s] it's a little it's a little Improvement
[11186.1s] but it's not something that is
[11187.5s] fundamentally set up correctly where you
[11189.6s] can insert more compute run for longer
[11192.0s] and get much better and magical results
[11194.6s] so it's it's uh it's not RL in that
[11196.8s] sense it's not RL in the sense that it
[11198.8s] lacks magic um it can find you in your
[11201.3s] model and get a better performance and
[11203.6s] indeed if we go back to chat GPT the GPT
[11206.9s] 40 model has gone through rhf because it
[11210.1s] works well but it's just not RL in the
[11212.8s] same sense rlf is like a little fine
[11214.9s] tune that slightly improves your model
[11216.5s] is maybe like the way I would think
[11217.7s] about it okay so that's most of the
[11219.6s] technical content that I wanted to cover
[11221.8s] I took you through the three major
[11223.4s] stages and paradigms of training these
[11225.2s] models pre-training supervised fine
[11227.4s] tuning and reinforcement learning and I
[11229.4s] showed you that they Loosely correspond
[11231.0s] to the process we already use for
[11232.7s] teaching children and so in particular
[11235.1s] we talked about pre-training being sort
[11237.0s] of like the basic knowledge acquisition
[11238.9s] of reading Exposition supervised fine
[11241.3s] tuning being the process of looking at
[11242.9s] lots and lots of worked examples and
[11245.0s] imitating experts and practice problems
[11248.6s] the only difference is that we now have
[11250.1s] to effectively write textbooks for llms
[11253.0s] and AIS across all the disciplines of
[11255.2s] human knowledge and also in all the
[11257.4s] cases where we actually would like them
[11259.2s] to work like code and math and you know
[11262.7s] basically all the other disciplines so
[11264.4s] we're in the process of writing
[11265.3s] textbooks for them refining all the
[11267.6s] algorithms that I've presented on the
[11268.8s] high level and then of course doing a
[11270.7s] really really good job at the execution
[11272.6s] of training these models at scale and
[11274.6s] efficiently so in particular I didn't go
[11276.7s] into too many details but these are
[11279.0s] extremely large and complicated
[11280.9s] distributed uh sort of
[11284.1s] um jobs that have to run over tens of
[11287.1s] thousands or even hundreds of thousands
[11288.3s] of gpus and the engineering that goes
[11290.9s] into this is really at the stateof the
[11292.7s] art of what's possible with computers at
[11294.5s] that scale so I didn't cover that aspect
[11297.7s] too much
[11299.1s] but um this is very kind of serious and
[11302.7s] they were underlying all these very
[11304.2s] simple algorithms
[11305.8s] ultimately now I also talked about sort
[11308.7s] of like the theory of mind a little bit
[11310.3s] of these models and the thing I want you
[11311.7s] to take away is that these models are
[11313.6s] really good but they're extremely useful
[11315.4s] as tools for your work you shouldn't uh
[11318.1s] sort of trust them fully and I showed
[11319.7s] you some examples of that even though we
[11321.5s] have mitigations for hallucinations the
[11323.2s] models are not perfect and they will
[11324.6s] hallucinate still it's gotten better
[11326.8s] over time and it will continue to get
[11328.2s] better but they can
[11329.6s] hallucinate in other words in in
[11332.2s] addition to that I covered kind of like
[11333.9s] what I call the Swiss cheese uh sort of
[11336.1s] model of llm capabilities that you
[11337.5s] should have in your mind the models are
[11339.2s] incredibly good across so many different
[11340.6s] disciplines but then fail randomly
[11342.5s] almost in some unique cases so for
[11345.1s] example what is bigger 9.11 or 9.9 like
[11347.9s] the model doesn't know but
[11349.2s] simultaneously it can turn around and
[11351.4s] solve Olympiad questions and so this is
[11354.5s] a hole in the Swiss cheese and there are
[11356.1s] many of them and you don't want to trip
[11357.8s] over them so don't um treat these models
[11361.6s] as infallible models check their work
[11364.0s] use them as tools use them for
[11365.6s] inspiration use them for the first draft
[11368.1s] but uh work with them as tools and be
[11370.4s] ultimately respons responsible for the
[11372.9s] you know product of your
[11375.1s] work and that's roughly what I wanted to
[11378.9s] talk about this is how they're trained
[11380.9s] and this is what they are let's now turn
[11383.2s] to what are some of the future
[11384.3s] capabilities of these models uh probably
[11386.7s] what's coming down the pipe and also
[11388.1s] where can you find these models I have a
[11390.2s] few blow points on some of the things
[11391.4s] that you can expect coming down the pipe
[11393.5s] the first thing you'll notice is that
[11395.0s] the models will very rapidly become
[11396.9s] multimodal everything I talked about
[11398.6s] above concerned text but very soon we'll
[11401.3s] have llms that can not just handle text
[11403.6s] but they can also operate natively and
[11405.6s] very easily over audio so they can hear
[11408.0s] and speak and also images so they can
[11410.5s] see and paint and we're already seeing
[11413.0s] the beginnings of all of this uh but
[11415.0s] this will be all done natively inside
[11417.3s] inside the language model and this will
[11419.2s] enable kind of like natural
[11420.6s] conversations and roughly speaking the
[11422.4s] reason that this is actually no
[11423.6s] different from everything we've covered
[11424.8s] above is that as a baseline you can
[11428.4s] tokenize audio and images and apply the
[11431.2s] exact same approaches of everything that
[11432.7s] we've talked about above so it's not a
[11434.6s] fundamental change it's just uh it's
[11436.3s] just a to we have to add some tokens so
[11438.8s] as an example for tokenizing audio we
[11441.0s] can look at slices of the spectrogram of
[11443.2s] the audio signal and we can tokenize
[11446.0s] that and just add more tokens that
[11447.8s] suddenly represent audio and just add
[11450.0s] them into the context windows and train
[11451.4s] on them just like above the same for
[11453.4s] images we can use patches and we can
[11456.0s] separately tokenize patches and then
[11458.8s] what is an image an image is just a
[11460.6s] sequence of tokens and this actually
[11463.1s] kind of works and there's a lot of early
[11464.7s] work in this direction and so we can
[11466.6s] just create streams of tokens that are
[11468.7s] representing audio images as well as
[11470.4s] text and interpers them and handle them
[11472.5s] all simultaneously in a single model so
[11474.8s] that's one example of multimodality
[11477.4s] uh second something that people are very
[11478.9s] interested in
[11480.0s] is currently most of the work is that
[11482.3s] we're handing individual tasks to the
[11484.2s] models on kind of like a silver platter
[11486.3s] like please solve this task for me and
[11488.0s] the model sort of like does this little
[11489.5s] task but it's up to us to still sort of
[11492.1s] like organize a coherent execution of
[11495.0s] tasks to perform jobs and the models are
[11498.1s] not yet at the capability required to do
[11501.3s] this in a coherent error correcting way
[11503.8s] over long periods of time so they're not
[11506.7s] able to fully string together tasks to
[11508.6s] perform these longer running jobs but
[11511.2s] they're getting there and this is
[11512.2s] improving uh over time but uh probably
[11515.1s] what's going to happen here is we're
[11516.1s] going to start to see what's called
[11517.4s] agents which perform tasks over time and
[11520.4s] you you supervise them and you watch
[11522.7s] their work and they come up to once in a
[11524.7s] while report progress and so on so we're
[11527.2s] going to see more long running agents uh
[11529.6s] tasks that don't just take you know a
[11531.0s] few seconds of response but many tens of
[11533.1s] seconds or even minutes or hours over
[11535.2s] time uh but these uh models are not
[11537.9s] infallible as we talked about above so
[11539.8s] all of this will require supervision so
[11541.8s] for example in factories people talk
[11543.2s] about the human to robot ratio uh for
[11546.2s] automation I think we're going to see
[11547.8s] something similar in the digital space
[11549.8s] where we are going to be talking about
[11551.0s] human to agent ratios where humans
[11553.2s] becomes a lot more supervisors of agent
[11555.7s] tasks um in the digital
[11558.4s] domain uh next um I think everything is
[11561.4s] going to become a lot more pervasive and
[11562.8s] invisible so it's kind of like
[11564.6s] integrated into the tools and everywhere
[11568.7s] um and in addition kind of like computer
[11571.2s] using so right now these models aren't
[11573.6s] able to take actions on your behalf but
[11576.2s] I think this is a separate bullet point
[11578.9s] um if you saw chpt launch the operator
[11582.3s] then uh that's one early example of that
[11584.3s] where you can actually hand off control
[11585.8s] to the model to perform you know
[11587.8s] keyboard and mouse actions on your
[11589.6s] behalf so that's also something that
[11591.3s] that I think is very interesting the
[11593.2s] last point I have here is just a general
[11594.6s] comment that there's still a lot of
[11595.8s] research to potentially do in this
[11596.9s] domain main one example of that uh is
[11599.7s] something along the lines of test time
[11600.9s] training so remember that everything
[11602.8s] we've done above and that we talked
[11604.2s] about has two major stages there's first
[11607.1s] the training stage where we tune the
[11608.7s] parameters of the model to perform the
[11610.5s] tasks well once we get the parameters we
[11613.2s] fix them and then we deploy the model
[11615.0s] for inference from there the model is
[11617.7s] fixed it doesn't change anymore it
[11619.6s] doesn't learn from all the stuff that
[11621.2s] it's doing a test time it's a fixed um
[11623.5s] number of parameters and the only thing
[11625.4s] that is changing is now the token inside
[11627.6s] the context windows and so the only type
[11629.9s] of learning or test time learning that
[11632.0s] the model has access to is the in
[11633.8s] context learning of its uh kind of like
[11636.9s] uh dynamically adjustable context window
[11639.0s] depending on like what it's doing at
[11640.5s] test time so but I think this is still
[11643.3s] different from humans who actually are
[11644.6s] able to like actually learn uh depending
[11646.6s] on what they're doing especially when
[11648.0s] you sleep for example like your brain is
[11649.5s] updating your parameters or something
[11650.8s] like that right so there's no kind of
[11653.0s] equivalent of that currently in these
[11654.4s] models and tools so there's a lot of
[11656.6s] like um more wonky ideas I think that
[11658.4s] are to be explored still and uh in
[11660.8s] particular I think this will be
[11661.9s] necessary because the context window is
[11664.1s] a finite and precious resource and
[11666.1s] especially once we start to tackle very
[11667.8s] long running multimodal tasks and we're
[11670.4s] putting in videos and these token
[11671.8s] windows will basically start to grow
[11674.1s] extremely large like not thousands or
[11676.4s] even hundreds of thousands but
[11677.8s] significantly beyond that and the only
[11679.9s] trick uh the only kind of trick we have
[11681.8s] Avail to us right now is to make the
[11683.5s] context Windows longer but I think that
[11686.1s] that approach by itself will will not
[11687.4s] will not scale to actual long running
[11689.8s] tasks that are multimodal over time and
[11692.0s] so I think new ideas are needed in some
[11693.9s] of those disciplines um in some of those
[11696.8s] kind of cases in the main where these
[11698.5s] tasks are going to require very long
[11700.8s] contexts so those are some examples of
[11703.0s] some of the things you can um expect
[11705.4s] coming down the pipe let's now turn to
[11707.2s] where you can actually uh kind of keep
[11709.1s] track of this progress and um you know
[11712.2s] be up to date with the latest and grest
[11714.0s] of what's happening in the field so I
[11715.4s] would say the three resources that I
[11716.6s] have consistently used to stay up to
[11718.6s] date are number one El Marina uh so let
[11721.8s] me show you El
[11723.7s] Marina this is basically an llm leader
[11726.4s] board and it ranks all the top models
[11730.2s] and the ranking is based on human
[11732.2s] comparisons so humans prompt these
[11734.0s] models and they get to judge which one
[11736.0s] gives a better answer they don't know
[11737.6s] which model is which they're just
[11739.0s] looking at which model is the better
[11740.3s] answer and you can calculate a ranking
[11742.7s] and then you get some results and so
[11744.6s] what you can hear is what you can see
[11746.3s] here is the different organizations like
[11748.0s] Google Gemini for example that produce
[11749.6s] these models when you click on any one
[11751.3s] of these it takes you to the place where
[11753.9s] that model is
[11755.0s] hosted and then here we see Google is
[11757.2s] currently on top with open AI right
[11759.3s] behind here we see deep seek in position
[11762.4s] number three now the reason this is a
[11764.2s] big deal is the last column here you see
[11766.0s] license deep seek is an MIT license
[11768.4s] model it's open weights anyone can use
[11770.8s] these weights uh anyone can download
[11772.7s] them anyone can host their own version
[11774.6s] of Deep seek and they can use it in what
[11776.9s] whatever way they like and so it's not a
[11778.7s] proprietary model that you don't have
[11780.0s] access to it's it's basically an open
[11781.6s] weight release and so this is kind of
[11784.5s] unprecedented that a model this strong
[11787.2s] was released with open weights so pretty
[11789.7s] cool from the team next up we have a few
[11792.2s] more models from Google and open Ai and
[11794.1s] then when you continue to scroll down
[11795.4s] you start to see some other Usual
[11796.9s] Suspects so xai here anthropic with son
[11800.8s] it uh here at number
[11803.0s] 14 and
[11805.6s] um then
[11807.9s] meta with llama over here so llama
[11811.0s] similar to deep seek is an open weights
[11812.8s] model and so uh but it's down here as
[11815.8s] opposed to up here now I will say that
[11818.0s] this leaderboard was really good for a
[11820.6s] long time I do think that in the last
[11823.7s] few months it's become a little bit
[11825.4s] gamed um and I don't trust it as much as
[11828.4s] I used to I think um just empirically I
[11831.8s] feel like a lot of people for example
[11833.1s] are using a Sonet from anthropic and
[11835.4s] that it's a really good model so but
[11837.0s] that's all the way down here um in
[11839.6s] number 14 and conversely I think not as
[11842.4s] many people are using Gemini but it's
[11843.7s] racking really really high uh so I think
[11847.0s] use this as a first pass uh but uh sort
[11850.5s] of try out a few of the models for your
[11852.9s] tasks and see which one performs better
[11855.6s] the second thing that I would point to
[11857.0s] is the uh AI news uh newsletter so AI
[11861.1s] news is not very creatively named but it
[11863.2s] is a very good newsletter produced by
[11864.9s] swix and friends so thank you for
[11866.3s] maintaining it
[11867.4s] and it's been very helpful to me because
[11868.8s] it is extremely comprehensive so if you
[11870.8s] go to archives uh you see that it's
[11873.0s] produced almost every other day and um
[11876.1s] it is very comprehensive and some of it
[11878.1s] is written by humans and curated by
[11879.6s] humans but a lot of it is constructed
[11881.3s] automatically with llms so you'll see
[11883.2s] that these are very comprehensive and
[11884.8s] you're probably not missing anything
[11886.4s] major if you go through it of course
[11888.8s] you're probably not going to go through
[11889.9s] it because it's so long but I do think
[11892.0s] that these summaries all the way up top
[11894.1s] are quite good and I think have some
[11895.6s] human oversight uh so this has been very
[11898.6s] helpful to me and the last thing I would
[11900.4s] point to is just X and Twitter uh a lot
[11902.7s] of um AI happens on X and so I would
[11905.7s] just follow people who you like and
[11907.2s] trust and get all your latest and
[11909.8s] greatest uh on X as well so those are
[11912.4s] the major places that have worked for me
[11913.7s] over time and finally a few words on
[11915.6s] where you can find the models and where
[11917.6s] can you use them so the first one I
[11919.4s] would say is for any of the biggest
[11921.2s] proprietary models you just have to go
[11922.8s] to the website of that LM provider so
[11924.7s] for example for open a that's uh chat
[11927.2s] I believe actually works now uh so
[11929.4s] that's for open
[11930.7s] AI now for or you know for um for Gemini
[11934.6s] I think it's gem. google.com or AI
[11937.8s] Studio I think they have two for some
[11939.6s] reason that I don't fly understand no
[11941.5s] one does um for the open weights models
[11944.7s] like deep SE CL Etc you have to go to
[11946.7s] some kind of an inference provider of
[11948.1s] LMS so my favorite one is together
[11950.1s] together. a and I showed you that when
[11951.8s] you go to the playground of together. a
[11954.3s] then you can sort of pick lots of
[11955.8s] different models and all of these are
[11957.4s] open models of different types and you
[11959.3s] can talk to them here as an
[11961.6s] example um now if you'd like to use a
[11964.7s] base model like um you know a base model
[11968.1s] then this is where I think it's not as
[11969.5s] common to find base models even on these
[11971.0s] inference providers they are all
[11972.7s] targeting assistants and chat and so I
[11975.6s] think even here I can't I couldn't see
[11977.5s] base models here so for base models I
[11979.7s] usually go to hyperbolic because they
[11981.8s] serve my llama 3.1 base and I love that
[11985.1s] model and you can just talk to it here
[11987.2s] so as far as I know this is this is a
[11989.4s] good place for a base model and I wish
[11991.4s] more people hosted base models because
[11993.3s] they are useful and interesting to work
[11994.8s] with in some cases finally you can also
[11997.6s] take some of the models that are smaller
[11999.7s] and you can run them locally and so for
[12002.4s] example deep seek the biggest model
[12004.3s] you're not going to be able to run
[12005.3s] locally on your MacBook but there are
[12007.7s] smaller versions of the deep seek model
[12009.3s] that are what's called distilled and
[12011.2s] then also you can run these models at
[12012.5s] smaller Precision so not at the native
[12014.5s] Precision of for example fp8 on deep
[12017.1s] seek or you know bf16 llama but much
[12020.3s] much lower than that um and don't worry
[12023.8s] if you don't fully understand those
[12024.8s] details but you can run smaller versions
[12026.7s] that have been distilled and then at
[12028.3s] even lower precision and then you can
[12029.9s] fit them on your uh computer and so you
[12033.2s] can actually run pretty okay models on
[12035.2s] your laptop and my favorite I think
[12037.4s] place I go to usually is LM studio uh
[12039.7s] which is basically an app you can get
[12042.1s] and I think it kind of actually looks
[12043.6s] really ugly and it's I don't like that
[12045.5s] it shows you all these models that are
[12046.8s] basically not that useful like everyone
[12048.2s] just wants to run deep seek so I don't
[12049.8s] know why they give you these 500
[12051.4s] different types of models they're really
[12053.0s] complicated to search for and you have
[12054.3s] to choose different distillations and
[12056.3s] different uh precisions and it's all
[12058.4s] really confusing but once you actually
[12060.3s] understand how it works and that's a
[12061.3s] whole separate video then you can
[12062.9s] actually load up a model like here I
[12064.4s] loaded up a llama 3 uh2 instruct 1
[12068.3s] billion and um you can just talk to it
[12071.5s] so I ask for Pelican jokes and I can ask
[12074.3s] for another one and it gives me another
[12075.8s] one Etc all of this that happens here is
[12078.8s] locally on your computer so we're not
[12080.8s] actually going to anywhere anyone else
[12082.4s] this is running on the GPU on the
[12084.3s] MacBook Pro so that's very nice and you
[12087.0s] can then eject the model when you're
[12088.3s] done and that frees up the ram so LM
[12091.4s] studio is probably like my favorite one
[12093.1s] even though I don't I think it's got a
[12094.4s] lot of uiux issues and it's really
[12096.2s] geared towards uh professionals almost
[12099.3s] uh but if you watch some videos on
[12100.6s] YouTube I think you can figure out how
[12102.0s] to how to use this
[12103.4s] interface uh so those are a few words on
[12105.8s] where to find them so let me now loop
[12107.7s] back around to where we started the
[12109.5s] question was when we go to chashi
[12110.8s] pta.com and we enter some kind of a
[12113.7s] query and we hit go what exactly is
[12117.6s] happening here what are we seeing what
[12119.9s] are we talking to how does this work and
[12123.2s] I hope that this video gave you some
[12124.7s] appreciation for some of the under the
[12126.1s] hood details of how these models are
[12128.3s] trained and what this is that is coming
[12130.3s] back so in particular we now know that
[12132.4s] your query is taken and is first chopped
[12135.4s] up into tokens so we go to to tick
[12138.2s] tokenizer and here where is the place in
[12141.0s] the in the um sort of format that is for
[12144.5s] the user query we basically put in our
[12147.8s] query right there so our query goes into
[12151.0s] what we discussed here is the
[12152.7s] conversation protocol format which is
[12154.8s] this way that we maintain conversation
[12156.8s] objects so this gets inserted there and
[12159.6s] then this whole thing ends up being just
[12161.0s] a token sequence a onedimensional token
[12163.1s] sequence under the hood so Chachi PT saw
[12166.0s] this token sequence and then when we hit
[12168.3s] go it basically continues appending
[12170.8s] tokens into this list it continues the
[12173.4s] sequence it acts like a token
[12175.0s] autocomplete so in particular it gave us
[12177.5s] this response so we can basically just
[12180.0s] put it here and we see the tokens that
[12182.1s] it continued uh these are the tokens
[12184.4s] that it continued with
[12186.0s] roughly now the question
[12188.5s] becomes okay why are these the tokens
[12190.8s] that the model responded with what are
[12192.8s] these tokens where are they coming from
[12194.6s] uh what are we talking to and how do we
[12197.1s] program this system and so that's where
[12199.4s] we shifted gears and we talked about the
[12201.4s] under thehood pieces of it so the first
[12204.0s] stage of this process and there are
[12205.5s] three stages is the pre-training stage
[12207.4s] which fundamentally has to do with just
[12208.9s] knowledge acquisition from the internet
[12210.9s] into the parameters of this neural
[12213.0s] network and so the neural net
[12215.4s] internalizes a lot of Knowledge from the
[12217.1s] internet but where the personality
[12219.1s] really comes in is in the process of
[12221.0s] supervised fine-tuning here and so what
[12224.6s] what happens here is that basically the
[12226.8s] a company like openai will curate a
[12229.4s] large data set of conversations like say
[12231.3s] 1 million conversation across very
[12233.0s] diverse topics and there will be
[12235.6s] conversations between a human and an
[12237.5s] assistant and even though there's a lot
[12239.5s] of synthetic data generation used
[12241.0s] throughout this entire process and a lot
[12242.8s] of llm help and so on fundamentally this
[12245.4s] is a human data curation task with lots
[12248.0s] of humans involved and in particular
[12250.2s] these humans are data labelers hired by
[12252.4s] open AI who are given labeling
[12254.0s] instructions that they learn and they
[12256.5s] task is to create ideal assistant
[12258.4s] responses for any arbitrary prompts so
[12261.2s] they are teaching the neural network by
[12264.2s] example how to respond to
[12267.4s] prompts so what is the way to think
[12270.0s] about what came back here like what is
[12272.7s] this well I think the right way to think
[12274.9s] about it is that this is the neural
[12277.2s] network simulation of a data labeler at
[12280.6s] openai so it's as if I gave this query
[12284.4s] to a data Li open and this data labeler
[12287.5s] first reads all of the labeling
[12288.8s] instructions from open Ai and then
[12291.0s] spends 2 hours writing up the ideal
[12293.5s] assistant response to this query and uh
[12297.2s] giving it to me now we're not actually
[12299.8s] doing that right because we didn't wait
[12301.1s] two hours so what we're getting here is
[12303.0s] a neural network simulation of that
[12305.2s] process and we have to keep in mind that
[12308.1s] these neural networks don't function
[12310.1s] like human brains do they are different
[12312.4s] what's easy or hard for them is
[12313.9s] different from what's easy or hard for
[12315.6s] humans and so we really are just getting
[12317.5s] a simulation so here I shown you this is
[12320.8s] a token stream and this is fundamentally
[12323.1s] the neural network with a bunch of
[12324.8s] activations and neurons in between this
[12326.6s] is a fixed mathematical expression that
[12329.0s] mixes inputs from tokens with parameters
[12332.8s] of the model and they get mixed up and
[12335.6s] get you the next token in a sequence but
[12337.6s] this is a finite amount of compute that
[12339.2s] happens for every single token and so
[12341.6s] this is some kind of a lossy simulation
[12344.2s] of a human that is kind of like
[12346.4s] restricted in this way and so whatever
[12349.2s] the humans
[12350.3s] write the language model is kind of
[12352.4s] imitating on this token level with only
[12355.2s] this this specific computation for every
[12358.3s] single token and
[12360.8s] sequence we also saw that as a result of
[12363.5s] this and the cognitive differences the
[12365.8s] models will suffer in a variety of ways
[12368.4s] and uh you have to be very careful with
[12370.0s] their use so for example we saw that
[12371.8s] they will suffer from hallucinations and
[12374.1s] they also we have the sense of a Swiss
[12376.6s] model of the LM capabilities where
[12378.8s] basically there's like holes in the
[12380.5s] cheese sometimes the models will just
[12382.7s] arbitrarily like do something dumb uh so
[12385.6s] even though they're doing lots of
[12386.6s] magical stuff sometimes they just can't
[12388.9s] so maybe you're not giving them enough
[12390.5s] tokens to think and maybe they're going
[12392.3s] to just make stuff up because they're
[12393.5s] mental arithmetic breaks uh maybe they
[12395.9s] are suddenly unable to count number of
[12398.0s] letters um or maybe they're unable to
[12400.6s] tell you that 911 9.11 is smaller than
[12403.7s] 9.9 and it looks kind of dumb and so so
[12406.4s] it's a Swiss cheese capability and we
[12408.0s] have to be careful with that and we saw
[12409.6s] the reasons for
[12410.8s] that but fundamentally this is how we
[12413.7s] think of what came back it's again a
[12416.3s] simulation of this neural network of a
[12420.9s] human data labeler following the
[12423.2s] labeling instructions at open a so
[12426.2s] that's what we're getting back now I do
[12429.3s] think that the uh things change a little
[12431.2s] bit when you actually go and reach for
[12433.8s] one of the thinking models like o03 mini
[12437.6s] and the reason for that is that GPT
[12440.4s] 40 basically doesn't do reinforcement
[12443.0s] learning it does do rhf but I've told
[12446.4s] you that rhf is not RL there's no
[12449.0s] there's no uh time for magic in there
[12451.2s] it's just a little bit of a fine-tuning
[12453.0s] is the way to look at it but these
[12455.1s] thinking models they do use RL so they
[12458.1s] go through this third state stage of
[12461.2s] perfecting their thinking process and
[12464.0s] discovering new thinking strategies and
[12466.0s] uh
[12467.0s] solutions to problem solving that look a
[12469.9s] little bit like your internal monologue
[12471.3s] in your head and they practice that on a
[12473.4s] large collection of practice problems
[12475.2s] that companies like openi create and
[12477.5s] curate and um then make available to the
[12480.6s] LMS so when I come here and I talked to
[12482.8s] a thinking model and I put in this
[12485.8s] question what we're seeing here is not
[12487.9s] anymore just the straightforward
[12489.2s] simulation of a human data labeler like
[12491.4s] this is actually kind of new unique and
[12494.0s] interesting um and of course open is not
[12496.4s] showing us the under thehood thinking
[12499.0s] and the chains of thought that are
[12500.4s] underlying the reasoning here but we
[12503.0s] know that such a thing exists and this
[12504.5s] is a summary of it and what we're
[12506.3s] getting here is actually not just an
[12507.9s] imitation of a human data labeler it's
[12509.7s] actually something that is kind of new
[12511.0s] and interesting and exciting in the
[12512.2s] sense that it is a function of thinking
[12515.3s] that was emergent in a simulation it's
[12517.3s] not just imitating human data labeler it
[12519.5s] comes from this reinforcement learning
[12521.3s] process and so here we're of course not
[12523.5s] giving it a chance to shine because this
[12525.2s] is not a mathematical or a reasoning
[12526.8s] problem this is just some kind of a sort
[12528.8s] of creative writing problem roughly
[12530.5s] speaking and I think it's um it's a a
[12534.6s] question an open question as to whether
[12537.6s] the thinking strategies that are
[12539.3s] developed inside verifiable domains
[12542.4s] transfer and are generalizable to other
[12545.8s] domains that are unverifiable such as
[12547.8s] create writing the extent to which that
[12549.9s] transfer happens is unknown in the field
[12552.2s] I would say so we're not sure if we are
[12554.4s] able to do RL on everything that is very
[12556.1s] verifiable and see the benefits of that
[12558.2s] on things that are unverifiable like
[12560.0s] this prompt so that's an open question
[12562.4s] the other thing that's interesting is
[12563.8s] that this reinforcement learning here is
[12566.3s] still like way too new primordial and
[12569.5s] nent so we're just seeing like the
[12571.8s] beginnings of the hints of greatness uh
[12574.2s] in the reasoning problems we're seeing
[12576.2s] something that is in principle capable
[12578.2s] of something like the equivalent of move
[12580.4s] 37 but not in the game of Go but in open
[12584.4s] domain thinking and problem solving in
[12586.5s] principle this Paradigm is capable of
[12588.8s] doing something really cool new and
[12590.1s] exciting something even that no human
[12592.6s] has thought of before in principle these
[12594.8s] models are capable of analogies no human
[12596.7s] has had so I think it's incredibly
[12599.0s] exciting that these models exist but
[12600.7s] again it's very early and these are
[12602.1s] primordial models for now um and they
[12605.1s] will mostly shine in domains that are
[12606.8s] verifiable like math en code Etc so very
[12610.4s] interesting to play with and think about
[12612.0s] and
[12612.7s] use and then that's roughly it um um I
[12616.6s] would say those are the broad Strokes of
[12618.4s] what's available right now I will say
[12620.4s] that overall it is an extremely exciting
[12623.0s] time to be in the
[12624.2s] field personally I use these models all
[12626.4s] the time daily uh tens or hundreds of
[12628.7s] times because they dramatically
[12630.1s] accelerate my work I think a lot of
[12631.6s] people see the same thing I think we're
[12633.3s] going to see a huge amount of wealth
[12634.5s] creation as a result of these models be
[12637.1s] aware of some of their shortcomings even
[12640.6s] with RL models they're going to suffer
[12642.2s] from some of these use it as a tool in a
[12644.5s] toolbox don't trust it fully because
[12647.8s] they will randomly do dumb things they
[12649.7s] will randomly hallucinate they will
[12651.4s] randomly skip over some mental
[12652.7s] arithmetic and not get it right um they
[12655.0s] randomly can't count or something like
[12656.6s] that so use them as tools in the toolbox
[12658.8s] check their work and own the product of
[12660.6s] your work but use them for inspiration
[12663.1s] for first draft uh ask them questions
[12666.0s] but always check and verify and you will
[12668.6s] be very successful in your work if you
[12670.6s] do so uh so I hope this video was useful
[12673.8s] and interesting to you I hope you had it
[12675.4s] fun and uh it's already like very long
[12677.7s] so I apologize for that but I hope it
[12679.5s] was useful and yeah I will see you later
