Finetuning Large Language Models
DLAI - Learning Platform Beta
https://learn.deeplearning.ai/finetuning-large-language-models/
Welcome to Fine-Tuning Large Language Models, taught
by Sharon Zhou.
Really glad to be here.
When I visit with different groups, I often hear people ask,
how can I use these large language models on my
own data or on my own task?
Whereas you might already know about how to
prompt a large language model, this course goes
over another important tool, fine-tuning them.
Specifically, how to take, say, an open-source
LLM and further train it on your own data.
While writing a prompt can be pretty good
at getting an LLM to follow directions to
carry out the task, like extracting keywords
or classifying text as positive or negative sentiment.
If you fine tune, you can then get the LLM to even
more consistently do what you want.
And I found that prompting an LLM to
speak in a certain style, like being more helpful or more polite,
or to be succinct versus verbose to a specific certain extent,
that can also be challenging.
Fine-tuning turns out to also be a good way to adjust an LLM's tone.
People are now aware of the amazing capabilities of
ChatGPT and other popular LLMs to answer questions about a huge range
of topics.
But individuals and companies would like to have that
same interface to their own private and proprietary data.
One of the ways to do this is to train
an LLM with your data.
Of course, training a foundation LLM takes
a massive amount of data, maybe hundreds of billions
or even more than a trillion words of data,
and massive GPU compute resources.
But with fine-tuning, you can take an existing
LLM and train it further on your own data.
So, in this course, you'll learn what fine-tuning is, when it
might be helpful for your applications, how fine-tuning
fits into training, how it differs from prompt engineering
or retrieval augmented generation alone, and
how these techniques can be used
alongside fine-tuning.
You'll dive into a specific variant of fine-tuning that's
made GPT-3 into chat GPT called instruction fine-tuning, which teaches an LLM
to follow instructions.
Finally, you'll go through the steps of fine-tuning your
own LLM, preparing the data, training the
model, and evaluating it, all in code.
This course is designed to be accessible to
someone familiar with Python.
But to understand all the code, it will help to further have
basic knowledge of deep learning, such as what the
process of training a neural network is like, and what is,
say, a trained test split.
A lot of hard work has gone into this course.
We'd like to acknowledge the whole Lam and I team, and
Nina Wei in particular on design, as well as on the DEEPLEARNING.AI side,
Tommy Nelson and Geoff Ludwig.
In about an hour or so through this short course,
you gain a deeper understanding of how you can build
your own LLM through fine-tuning an existing LLM on your own data.
Let's get started.
Why finetune
In this lesson, you'll get to learn why you should fine-tune, what
fine-tuning really even is, compare it to
prompt engineering, and go through a lab where you get to
compare a fine-tuned model to a non-fine-tuned model.
Cool, let's get started!
Alright, so why should you fine-tune LLMs?
Well before we jump into why, let's talk about what fine-tuning really
is.
So what fine-tuning is, is taking these general purpose
models like GPT-3 and specializing them into something
like ChatGPT,
the specific chat use case to make it chat well, or using GPT-4
and turning that into a specialized GitHub co-pilot use
case to auto-complete code.
An analogy I like to make is a PCP, a primary care physician,
is like your general purpose model.
You go to your PCP every year for a general checkup,
but a fine-tune or specialized model is like a cardiologist or dermatologist,
a doctor that has a specific specialty and can actually
take care of your heart problems or skin problems in much more
depth.
So what fine tuning actually does for your
model is that it makes it possible for
you to give it a lot more data than what fits into
the prompt so that your model can learn
from that data rather than just get access to it,
from that learning process is able to upgrade itself from that PCP
into something more specialized like a dermatologist.
So you can see in this figure you might have some symptoms
that you input into the model like skin irritation,
redness, itching, and the base model
which is the general purpose model might just
say this is probably acne.
A model that is fine-tuned on dermatology data however
might take in the same symptoms and be
able to give you a much clearer, more specific diagnosis.
In addition to learning new information, fine-tuning can also help
steer the model to more consistent outputs or more
consistent behavior.
For example, you can see the base model here.
When you ask it, what's your first name?
It might respond with, what's your last name?
Because it's seen so much survey data out there of different questions.
So it doesn't even know that it's supposed to answer that question.
But a fine-tuned model by contrast, when you ask it, what's your
first name?
would be able to respond clearly.
My first name is Sharon.
This bot was probably trained on me.
In addition to steering the model to more
consistent outputs or behavior, fine tuning can help
the model reduce hallucinations, which is a common problem
where the model makes stuff up.
Maybe it will say my first name is Bob when this was
trained on my data and my name is definitely not Bob.
Overall, fine tuning enables you to customize the model
to a specific use case.
In the fine-tuning process, which we'll go
into far more detail later, it's actually very
similar to the model's earlier training recipe.
So now to compare it with something that you're
probably a little bit more familiar with, which
is prompt engineering.
This is something that you've already been doing for a while
with large language models, but maybe even for over the
past decade with Google, which is just putting a query in, editing
the query to change the results that you see.
So there are a lot of pros to prompting.
One is that you really don't need any data to get started.
You can just start chatting with the model.
There's a smaller upfront cost, so you don't really
need to think about cost, since every single time you ping
the model, it's not that expensive.
And you don't really need technical knowledge to get started.
You just need to know how to send a text message.
What's cool is that there are now methods you can use, such
as retrieval augmented generation, or RAG, to
connect more of your data to it, to selectively choose what kind of data
goes into the prompt.
Now of course, if you have more than a little bit of data,
then it might not fit into the prompt.
So you can't use that much data.
Oftentimes when you do try to fit in a ton of data,
unfortunately it will forget a lot of that data.
There are issues with hallucination, which is when the model
does make stuff up and it's hard to correct that
incorrect information that it's already learned. So while using retrieval augmented
generation can be great to connect your data, it will
also often miss the right data,get the incorrect data and cause the
model, to output the wrong thing.
Fine tuning is kind of the opposite of prompting.
So you can actually fit in almost an
unlimited amount of data, which is nice because
the model gets to learn new information on that data.
As a result, you can correct that incorrect information that it
may have learned before, or even put in
recent information that it hadn't learned about previously.
There's less cost afterwards if you do fine-tune a
smaller model and this is particularly relevant if
you expect to hit the model a lot of times. So have
a lot of either throughput or you expect
it to just handle a larger load.
And also retrieval augmented generation can
be used here too. I think sometimes people think it's
a separate thing but actually you can use
it for both cases.
So you can actually connect it with far more data
as well even after it's learned all this information.
There are cons, however.
You need more data, and that data has to
be higher quality to get started.
There is an upfront compute cost as well, so it's
not free necessarily.
It's not just a couple dollars just to get started.
Of course, there are now free tools out there to get started,
but there is compute involved in making this happen,
far more than just prompting.
And oftentimes you need some technical
knowledge to get the data in the right place, and that's
especially, you know, surrounding this data piece.
And, you know, there are more and more tools now that's
making this far easier, but you still need some
understanding of that data.
And you don't have to be just anyone who can send
a text message necessarily.
So finally, what that means is for prompting,
you know, that's great for generic use cases.
It's great for different side projects and prototypes.
It's great to just get started really, really
fast.
Meanwhile, fine tuning is great for more enterprise or domain-specific
use cases, and for production usage.
And we'll also talk about how it's useful for privacy in this
next section, which is the benefits of fine-tuning your own
LLM. So if you have your own LLM that
you fine-tuned, one benefit you get is around performance.
So this can stop the LLM from making stuff up,
especially around your domain.
It can have far more expertise in that domain.
It can be far more consistent.
So sometimes these models will just produce, you know,
something really great today, but then tomorrow you hit it and it
isn't consistent anymore.
It's not giving you that great output anymore.
And so this is one way to actually make it
far more consistent and reliable.
And you can also have it be better at moderating. If you've
played a lot with ChatGPT, you might have seen ChatGPT say, I'm sorry,
I can't respond to that.
And you can actually get it to say the same
thing or something different that's related to your company or
use case to help the person chatting with it, stay
on track.
And again, so now I want to touch on privacy.
When you fine tune your own LLM, this can happen in your VPC or on
premise.
This prevents data leakage and data breaches that
might happen on off the shelf, third party solutions.
And so this is one way to keep that data safe that you've
been collecting for a while that might be the last few days,
it might be the last couple decades as well.
Another reason you might want to fine tune your own LLM is
around cost, so one is just cost transparency.
You maybe you have a lot of people using your model and
you actually want to lower the cost per request.
Then fine tuning a smaller LLM can actually
help you do that.
And overall, you have greater control
over costs and a couple other factors as well.
That includes uptime and also latency.
You can greatly reduce the latency for certain applications
like autocomplete.
You might need latency that is sub 200
milliseconds so that it is not perceivable by
the person doing autocomplete.
You probably don't want autocomplete to happen
across 30 seconds, which is currently the case with
running GPD 4 sometimes.
And finally, in moderation, we talked about that
a little bit here already.
But basically, if you want the model to say, I'm
sorry to certain things, or to say, I don't know
to certain things, or even to have a custom response, This is
one way to actually provide those guardrails to
the model.
And what's really cool is you're actually get to see an example of
that in the notebooks.
All right.
So across all of these different labs, you'll be using a lot of
different technologies to fine tune.
So there are three Python libraries.
One is PyTorch developed by Meta.
This is the lowest level interface that you'll see.
And then there's a great library by HuggingFace on top of PyTorch and
a of the great work that's been done and it's much higher
level.
You can import datasets and train models very easily.
And then finally, you'll see the Llamanai library, which
I've been developing with my team.
And we call it the llama library for
all the great llamas out there.
And this is an even higher level interface
where you can train models with just three
lines of code.
All right.
So let's hop over to the notebooks and see some fine-tuned models
in action.
Okay, so we're going to compare a fine-tuned model
with a non-fine-tuned model.
So first we're importing from the LLAMA library, again
this is from LAMANI, the basic model runner.
And all this class does is it helps us run open-source
models.
So these are hosted open-source models on GPUs to
run them really efficiently.
And the first model you can run here is the LLAMA2 model,
which is very popular right now.
And this one is not fine-tuned.
So we're gonna just instantiate it based on this is its hugging
face name and we're gonna say.
Tell me how to train my dog to sit.
So it's just you know, really really simple here into
the non fine-tuned model we're gonna get the output out and. Let's print non
tuned.
Output and see oof.
Okay.
So we asked it.
Tell me how to train my dog to sit.
It said period, and then tell me how to train my dog to say,
tell me how to teach my dog to come, tell me how to get my dog to heel.
So clearly this is very similar to the what's
your first name, what's your last name answer.
This model has not been told or trained
to actually respond to that command.
So maybe a bit of a disaster, but let's keep looking.
So maybe we can ask it, what do you think of Mars?
So now, you know, at least it's responding to the question, but
it's not great responses.
I think it's a great planet.
I think it's a good planet.
I think it'll be a great planet.
So it keeps going.
Very philosophical, potentially even existential,
if you keep reading.
All right.
What about something like a Google search query,
like Taylor Swift's best friend?
Let's see what that actually says.
All right.
Well, it doesn't quite get Taylor Swift's best
friend, but it did say that it's a huge
Taylor Swift fan.
All right, let's keep exploring maybe something that's a
conversation to see if it can do turns in a
conversation like chat GPT.
So this is an agent for an Amazon delivery order.
Okay, so at least it's doing the different customer
agent turns here, but it isn't quite
getting anything out of it. This is not something usable for
any kind of like fake turns or help
with making an auto agent.
All right, so you've seen enough of that.
Let's actually compare this to Llama 2 that has been fine-tuned to
actually chat.
So I'm gonna instantiate the fine-tune model.
Notice that this name, all that's different is this chat here.
And then I'm gonna let this fine-tune model do the same thing.
So tell me how to train my dog to sit.
I'm gonna print that.
Okay, very interesting.
So you can immediately tell a difference.
So tell me how to train my dog to sit. It's still trying to auto-complete that.
So tell me how to train my dog to sit on command.
But then it actually goes through almost a step-by-step guide
of what to do to train my dog to sit.
Cool, so that's much, much better.
And the way to actually, quote unquote,
get rid of this extra auto-complete thing is actually to
inform the model that you want instructions.
So I'm actually putting these instruction tags here.
This was used for LLAMA2.
You can use something different when you fine-tune
your own model, but this helps with telling the model, hey,
these are my instructions and these are the boundaries.
I'm done with giving this instruction.
Stop continuing to give me an instruction.
So here you can see that it doesn't auto-complete that on-command thing.
And just to compare, just to be fair,
we can see what the non-fine-tuned model actually says.
Great, it just repeats the same thing or something very similar.
Not quite right.
Cool, let's keep going down. So what do you think of Mars, this
model?
Oh, it's a fascinating planet.
It's captured the imagination of humans for centuries.
Okay, cool.
So something that's much better output here.
What about Taylor Swift's best friend?
Let's see how this does.
Okay, this one's pretty cute.
It has a few candidates for who Taylor Swift's
best friend actually is.
Let's take a look at these turns from the Amazon delivery agent.
Okay.
It says, I see. Can you provide me with your order number?
This is much, much better.
It's interesting because down here, it also summarizes what's going on,
which may or may not be something that you would want,
and that would be something you can fine tune away.
And now I'm curious what chat GPT would say for, tell
me how to train my dog to sit.
Okay.
So it gives different steps as well.
Great.
Alright, feel free to use ChatGPT or any other
model to see what else they can each
do and compare the results. But it's pretty clear, I think, that
the ones that have been fine-tuned, including ChatGPT and this Lama2Chat
LLM, they're clearly better than the one that was
not fine-tuned.
Now in the next lesson, we're going to see where fine-tuning fits
in in the whole training process.
So you'll get to see the first step and how to even
get here with this fine-tuned model.
Where finetuning fits in
In this lesson, you'll learn about where fine-tuning really
fits into the training process.
It comes after a step called pre-training, which you'll go
into a little bit of detail on, and then you'll get to learn about all the
different tasks you get to apply fine-tuning to.
Alright, let's continue.
Alright, let's see where fine-tuning fits in.
First, let's take a look at pre-training.
This is the first step before fine-tuning even happens, and
it actually takes a model at the start that's completely
random.
It has no knowledge about the world at all.
So all its weights, if you're familiar with weights, are
completely random.
It cannot form English words at all.
It doesn't have language skills yet.
And the learning objective it has is next token prediction,
or really, in a simplified sense, it's
just the next word prediction here. So you see the word wants, and
so we want it to now predict the word upon, But
then you see the LLM just producing "sd!!!@".
So just really far from the word upon, so that's where it's starting.
But it's taking in and reading from a giant corpus of data, often
scraped from the entire web.
We often call this unlabeled because it's not something that we've structured
together. We've just scraped it from the web.
I will say it has gone through many, many cleaning processes,
So there is still a lot of manual work to
getting this data set to be effective for model pre-training.
And this is often called self-supervised learning because the
model is essentially supervising itself with
next token prediction.
All it has to do is predict the next word.
There aren't really labels otherwise.
Now, after training, here you see that the model is now
able to predict the word upon, or the token upon.
And it's learned language.
It's learned a bunch of knowledge from the internet. So
this is fantastic that this process actually works
in this way and it's amazing because all it is is
just trying to predict the next token and it's reading the
entire Internet's worth of data to do so.
Now okay maybe there's an asterisk on entire
Internet data and data scraped from the entire
Internet.
The actual understanding and knowledge behind this is
often not very public.
People don't really know exactly what that data set looks like for a lot
of the closed source models from large companies. But
there's been an amazing open source effort by EleutherAI to
create a dataset called The Pile, which you'll get to
explore in this lab.
And what it is, is that it's a set of 22
diverse datasets scraped from the entire internet.
Here you can see in this figure, you know, there's
a four score and seven years. So that's a Lincoln's Gettysburg address.
There's also Lincoln's carrot cake recipe.
And of course, also scraped from PubMed, there's
information about different medical texts.
And finally, there's also code in here from GitHub.
So it's a set of pretty intellectual datasets that's curated
together to actually infuse these models
with knowledge.
Now this pre-training step is pretty expensive and time-consuming, it's actually
expensive because it's so time-consuming to have the model
go through all of this data, go from absolutely randomness to
understanding some of these texts, you
know, putting together a carrot cake recipe while also
writing code while also knowing about medicine in the
Gettysburg Address.
Okay, so these pre-trained base models are great
and there are actually a lot of them
that are open source out there, but you know,
it's been trained on these data sets from the web and it might
have this geography homework you might see here
on the left where it asks what's the.
What's the capital of Kenya?
What's the capital of France?
And it all, you know, in a line without seeing the answers.
So when you then input, what's the capital of Mexico, the
L line might just say, what's the capital of Hungary?
As you can see that it's not really useful from the
sense of a chatbot interface.
So how do you get it to that chatbot interface?
Well, fine tuning is one of those ways to get you there.
And it should be really a tool in your toolbox.
So pre-training is really that first step that gets you
that base model.
And when you add more data in, not actually as much data,
you can use fine-tuning to get a fine-tuned model.
And actually, even a fine-tuned model, you
can continue adding fine-tuning steps afterwards.
So fine-tuning really is a step afterwards.
You can use the same type of data. You can actually
probably scrape data from different sources
and curate it together, which you'll take a look at in a little bit.
So that can be this quote unquote unlabeled data,
But you can also curate data yourself to
make it much more structured for the model
to learn about.
And I think one thing that's key that differentiates fine-tuning from
pre-training is that there's much less data
needed.
You're building off of this base model that has
already learned so much knowledge and basic language
skills that you're really just taking it to
the next level.
You don't need as much data.
So this really is a tool in your toolbox.
And if you're coming from other machine learning areas, you
know, that's fine tuning for discriminative tasks, maybe you're
working with images and you've been fine-tuning on
ImageNet, you'll find that the definition for
fine-tuning here is a little bit more loose and it's not as
well defined for generative tasks because we are actually updating the
weights of the entire model, not
just part of it, which is often the case for fine-tuning those
other types of models.
So we have the same training objective as pre-training
here for fine-tuning next token production.
And all we're doing is changing up the data so that it's more
structured in a way, and the model can be more consistent in
outputting and mimicking that structure.
And also there are more advanced ways to
reduce how much you want to update this model, and we'll
discuss this a bit later.
So exactly what is fine-tuning doing for you?
So you're getting a sense of what it is right now, but
what are the different tasks you you can
actually do with it?
Well, one giant category I like to think about
is just behavior change. You're changing the behavior of the model.
You're telling it exactly, you know, in this chat interface, we're in
a chat setting right now. We're not looking at a survey.
So this results in the model being able
to respond much more consistently.
It means the model can focus better.
Maybe that could be better for moderation, for example.
And it's also generally just teasing out its capabilities.
So here it's better at conversation so that it can now talk about a wide
variety of things versus
before we would have to do a lot of prompt engineering in
order to tease that information out.
Fine tuning can also help the model gain
new knowledge and so this might be around
specific topics that are not in that base pre-trained model.
This might mean correcting old incorrect
information so maybe there's you know more updated
recent information that you want the model to
actually be infused with.
And of course more commonly you're doing both with these models, so
oftentimes you're changing the behavior and you
want it to gain new knowledge.
So taking it a notch down, so tasks for fine-tuning, it's really
just text in, text out for LLMs. And I
like to think about it in two different categories, so
you can think about it one as extracting text, so you
put text in and you get less text out. So a
lot of the work is in reading, and this could be extracting keywords, topics, it
might be routing, based on all the data that you
see coming in. You route the chat, for example, to some
API or otherwise.
Different agents are here, like different agent capabilities.
And then that's in contrast to expansion.
So that's where you put text in, and you get more text out.
So I like to think of that as writing.
And so that could be chatting, writing emails, writing code,
and really understanding your task exactly,
the difference between these two different tasks,
or maybe you have multiple tasks that you want to fine-tune
on is what I've found to be the clearest indicator of success.
So if you want to succeed at fine-tuning the model, it's getting
clearer on what task you want to do.
And clarity really means knowing what
good output looks like, what bad output looks like,
but also what better output looks like.
So when you know that something is doing
better at writing code or doing better at routing a task,
that actually does help you actually fine-tune
this model to do really well.
Alright, so if this is your first time fine-tuning,
I recommend a few different steps.
So first, identify a task by just prompt engineering a
large LLM and that could be chat GPT, for example,
and so you're just playing with chat GPT
like you normally do.
And you find some, you know, tasks that it's doing okay at, so
not not great, but like not horrible either, so
you know that it's possible within the realm of possibility, but
it's not it's not the best and you want it to much better
for your task.
So pick that one task and just pick one.
And then number four, get some inputs and
outputs for that task. So you put in some text
and you got some text out, get inputs where you
put in text and get text out and outputs,
pairs of those for this task.
And one of the golden numbers I like to use
is 1000 because I found that that is a good
starting point for the amount of data that you need.
And make sure that these inputs and outputs
are better than the okay result from that LLM before.
You can't just generate these outputs necessarily all the time.
And so make sure you have those pairs of data and you'll
explore this in the lab too, this whole pipeline here.
Start out with that and then what you do is
you can then fine tune a small LLM on this
data just to get a sense of that performance bump.
And then so this is only if you're a first time, this
is what I recommend.
So now let's jump into the lab where you
get to explore the data set that was used for pre-training versus
for fine-tuning, so you understand exactly what these
input- output pairs look like.
Okay, so we're going to get started by importing a few different
libraries, so we're just going to run that.
And the first library that we're going to use is
the datasets library from. HuggingFace, and they have this great
function called loadDataset where you can just pull
a dataset from on their hub and be able to run it.
So here I'm going to pull the pre-training dataset called the pile that
you just saw a little bit more about and here I'm just grabbing
the split which is train versus test and
very specifically I'm actually grabbing streaming equals true because
this data set is massive we can't download it
without breaking this new book so I'm
actually going to stream it in one at a
time so that we can explore the different pieces of data in
there.
So just loading that up and now I'm going to just look at
the first five so this.
It's just using iter tools.
Great.
Ok, so you can see here, in the pre-trained data set, there's a
lot of data here that looks kind of scraped.
So this text says, it is done and submitted.
You can play Survival of the Tastiest on Android.
And so that's one example.
And let's see if we can find another one here.
Here is another one.
So this is just code that was scraped, XML code that was scraped. So that's
another data point.
You'll see article content, you'll see this topic about Amazon
announcing a new service on AWS, and then here's about Grand
Slam Fishing Charter, which is a family business.
So this is just a hodgepodge of different data sets scraped from
essentially the internet.
And I kind of want to contrast that with fine-tuning
data set that you'll be using across the different labs.
We're grabbing a company data set of question-answer pairs, you know,
scraped from an FAQ and also put together about internal engineering
documentation.
And it's called Lamini Docs, it's about the company Lamini.
And so we're just going to read that JSON file and take a look at
what's in there.
Okay, so this is much more structured data,
right? So there are question-answer pairs here, and it's very
specific about this company.
So the simplest way to use this data
set is to concatenate actually these questions and
answers together and serve that up into the model.
So that's what I'm going to do here. I'm going to turn that into a dict
and then I'm going to see what actually concatenating one
of these looks like.
So, you know, just concatenating the question and directly
just giving the answer after it right here.
And of course you can prepare your data in any way possible.
I just want to call out a few different common ways of
formatting your data and structuring it.
So question answer pairs, but then also
instruction and response pairs, input output pairs,
just being very generic here.
And also, you can actually just have it,
since we're concatenating it anyways, it's just
text that you saw above with the pile.
All right, so concatenating it, that's very simple, but
sometimes that is enough to see results, sometimes
it isn't.
So you'll still find that the model might need just more structure
to help with it, and this is very similar
to prompt engineering, actually.
So taking things a bit further, you can also process your data
with an instruction following, in this case, question-answering
prompt template.
And here's a common template.
Note that there's a pound-pound-pound before the question type of marker
so that that can be easily used as structure to tell the
model to expect what's next.
It expects a question after it sees that for the question.
And it also can help you post-process the model's outputs
even after it's been fine-tuned.
So we have that there.
So let's take a look at this prompt template in
action and see how that differs from the
concatenated question and answer.
So here you can see how that's how the prompt template
is with the question and answer neatly done
there.
And often it helps to keep the input
and output separate so I'm actually going to take
out that answer here and keep them separated
out because this helps us just using the
data set easily for evaluation and for you
know when you split the data set into
train and test.
So now what I'm gonna do is put all of this, apply
all of this template to the entire data set.
So just running a for loop over it and just hydrating the prompt.
So that is just adding that question and
answer into this with F string or dot
format stuff here with Python.
All right, so let's take a look at the difference between that text-only thing
and the question-answer format.
Cool.
So it's just text-only, it's all concatenated here that you're putting in, and
here is just question-answer, much more structured.
And you can use either one, but of course I
do recommend structuring it to help with evaluation.
That is basically it.
The most common way of storing this data
is usually in JSON lines files, so "jsonl files.jsonl."
It's basically just, you know, each line is a JSON object and that's it,
and so just writing that to file there.
You can also upload this data set onto HuggingFace,
shown here, because you'll get to use this later
as well and you'll get to pull it from the
cloud like that.
Next, you'll dive into a specific variant of fine-tuning called
instruction fine-tuning.
Instruction finetuning
In this lesson you'll learn about instruction fine-tuning, a
variant of fine-tuning that enabled GPT-3 to turn into
chat GPT and give it its chatting powers.
Okay, let's start giving chatting powers to all our models.
Okay, so let's dive into what instruction fine-tuning is.
Instruction fine-tuning is a type of fine-tuning. There are
all sorts of other tasks that you can do like reasoning, routing,
copilot, which is writing code, chat, different agents,
but specifically instruction fine tuning, which you
also may have heard as instruction tune or instruction
following LLMs, teaches the model to follow instructions
and behave more like a chatbot.
And this is a better user interface to
interact with the model as we've seen with chat GPT.
This is the method that turned. GPT-3 into chat GPT, which
dramatically increased AI adoption from just a few researchers
like myself to millions and millions of people.
So for the data set for instruction following,
you can use a lot that already exists
readily available either online or specific to your company,
and that might be FAQs, customer support conversations,
or Slack messages.
So it's really this dialogue dataset or just
instruction response datasets.
Of course, if you don't have data, no problem.
You can also convert your data into something that's
more of a question-answer format or instruction following
format by using a prompt template. So here
you can see, you know, a README might be able to come be converted into
a question-answer pair.
You can also use another LLM to do this for you.
There's a technique called Alpaca from Stanford that uses
chat GPT to do this.
And of course, you can use a pipeline
of different open source models to do this as well.
Cool.
So one of the coolest things about fine tuning,
I think, is that it teaches this new behavior to the model.
And while, you know, you might have fine
tuning data on what's the capital of France, Paris, because
these are easy question answer pairs that you can
get.
You can also generalize this idea of question
answering to data you might not have given
the model for your fine-tuning data set, but
that the model had already learned in its pre-existing pre-training
step. And so that might be code.
And this is actually findings from the chat GPT paper where the
model can now answer questions about
code even though they didn't have question answer pairs about that
for their instruction fine-tuning.
And that's because it's really expensive to get programmers
to go, you know, label data sets where they ask questions
about code and write the code for it.
So an overview of the different steps of fine-tuning
are data prep, training, and evaluation.
Of course, after you evaluate the model,
you need to prep the data again to improve it.
It's a very iterative process to improve the model.
And specifically for instruction fine-tuning and other different types of fine-tuning,
data prep is really where you have differences.
This is really where you change your data,
you tailor your data to the specific type of fine tuning,
the specific task of fine tuning that you're doing.
And training and evaluation is very similar.
So now let's dive into the lab where you'll get
a peek at the alpaca dataset for instruction tuning.
You'll also get to compare models again that have been
instruction tuned versus haven't been instruction tuned, and you'll get to
see models of varying sizes here.
So first importing a few libraries, the first one that
is important is again this load data set
function from the data sets library and let's load up this instruction
tune data set and this is specifying the
alpaca data set and again we're streaming this because it's actually a
hefty fine-tuning data set not as big as the pile
of course.
I'm going to load that up and just like before with the pile, you're
going to take a look at a few examples.
All right, so unlike the pile, it's not just text and that's it.
Here it's a little bit more structured, but
it's not as, you know, clear-cut as just question-answer pairs.
And what's really, really cool about, you know, this
is that the authors of the alpaca paper, they
actually had two prompt templates because
they wanted the model to be able to work with two different
types of prompts and two different types of tasks
essentially and so one is you know an instruction
following one where there is an extra set of
inputs for example it the instruction might be add
two numbers and the inputs might be first number is three the
second number is four and then there's prompt templates without input
which you can see in these examples sometimes it's not relevant
to have an input so it doesn't have that so these are the prompt
templates that are being used and so again
very similar to before you'll just hydrate those prompts
and run them across the whole data set.
And let's just print out one pair to see what that looks like.
Cool, so that's input output here and you know how it's hydrated
into the prompt.
So it ends with response and then it
outputs this response here.
Cool, and just like before, you can write it to a JSON lines file.
You can upload it to HuggingFace hub if you want.
We've actually loaded it up at Lamini slash Alpaca so that it's stable,
you can go look at it there and you can go use it.
Okay, great.
So now that you have seen what that instruction following data set
looks like, I think the next thing to do is
just remind you again on this tell me how to train my
dog to sit prompt on different models.
So the first one is going to be this llama 2 model
that is again not instruction tuned.
We're gonna run that.
Tell me how to train my dog to sit.
Okay, it starts with that period again and just
says this so remember that before and then now we're
gonna compare this to again the instruction tuned
model right here okay so much better it's actually
producing different steps and then finally I just
want to share chatGPT again just so you
can have this comparison right here great okay
so that.
That is a much larger set of models, ChatGPT is quite large
compared to the Llama2 models.
Those are actually 7 billion parameter models,
ChatGPT is rumored to be around 70 billion,
so very large models.
You're also going to explore some smaller models.
So one is that 70 million parameter model.
And here I'm loading up these models.
This is not super important yet, you'll
explore this a bit more later, but I'm going to load up two
different things to process the data and then
run the model.
And you can see here, the tag that we have here is a
"EleutherAI/Pythia/70m".
This is a 70 million parameter model that
has not been instruction tuned.
I'm going to paste some code here. It's a function to run inference, or basically
run the model on text.
We will go through these different sections of
what exactly is going on in this function
throughout the next few labs.
Cool.
So this model hasn't been fine-tuned. It doesn't know anything specific about a
company, but we can load up this, company dataset again
from before.
So we're going to give this model a question from this dataset, probably
just, you know, the first sample from the test set, for
example.
And so we can run this here. The question is, can Lamini
generate technical documentation or user manuals for software projects?
And the actual answer is yes, Lamini can generate technical
documentation and user manuals for software projects.
And it keeps going.
But the model's answer is, I have a question about the following.
How do I get the correct documentation to work?
A, I think you need to use the following code,
et cetera.
So it's quite off.
Of course, it's learned English, and it got the word
documentation in there.
So it kind of understands maybe that we're in a question-answer setting, because it
has.
A there for answer.
But it's clearly quite off.
And so it doesn't quite understand this data set
in terms of the knowledge, and also doesn't understand
the behavior that we're expecting from it. So it doesn't understand that it's
supposed to answer this question.
Ok, so now compare this to a model that we've
now fine-tuned for you, but that you're actually about
to fine-tune for instruction following.
And so that's loading up this model.
And then we can run the same question through this
model and see how it does.
And it says, yes, lamani can generate technical documentation or
user manuals for software projects, et cetera, and so this is
just far more accurate than the one before and it's
following that right behavior that we would expect.
Okay great so now that you've seen what an instruction
following model does exactly the next step is
to go through what you saw a peak of which is that
tokenizer how to prep our data so that
it is available to the model for training.
Data preparation
Now after exploring the data that you'll be using, in this lesson you'll learn
about how to prepare that data for training.
All right, let's jump into it.
So next on what kind of data you need to prep,
well there are a few good best practices.
So one is you want higher quality data and actually
that is the number one thing you need for fine-tuning
rather than lower quality data.
What I mean by that is if you give it garbage inputs, it'll
try to parrot them and give you garbage outputs.
So giving really high quality data is important.
Next is diversity.
So having diverse data that covers a lot of aspects
of your use case is helpful.
If all your inputs and outputs are the same,
then the model can start to memorize them and if that's
not exactly what you want, then the model will start
to just only spout the same thing over
and over again. And so having diversity in
your data is, is really important.
Next is real or generated.
I know there are a lot of ways to create generated data,
and you've already seen one way of doing that using an LLM, but
actually having real data is very, very effective and helpful most
of the time, especially for those writing tasks.
And that's because generated data already has
certain patterns to it. You might've heard of some services that
are trying to detect whether something is generated
or not. And that's actually because there are patterns
in generated data that they're trying to detect.
And as a result, if you train on more of the same patterns, it's
not going to learn necessarily new patterns or
new ways of framing things.
And finally, I put this last because actually
in most machine learning applications,
having way more data is important than less data.
But as you actually just seen before, pre-training
handles a lot of this problem.
Pre-training has learned from a lot of data, all
from the internet.
And so it already has a good base understanding. It's
not starting from zero.
And so having more data is helpful for the model, but not as
important as the top three and definitely not as
important as quality.
So first, let's go through some of the steps of collecting your data.
So you've already seen some of those instruction response pairs.
So the first step is collect them.
The next one is concatenate those pairs or
add a prompt template. You've already seen that as well.
The next step is tokenizing the data, um,
adding padding or truncating the data.
So it's the right size going into the model and you'll see
how to tokenize that in the lab.
So the steps to prepping your data is one
collecting those instruction response pairs.
Maybe that's question answer pairs, and then it's concatenating
those pairs together, adding some prompt template, like you
did before.
The third step is tokenizing that data.
And the last step is splitting that data
into training and testing.
Now in tokenizing, what, what does that really mean?
Well, tokenizing your data is taking your text data and
actually turning that into numbers that represent each of
those pieces of text.
It's not actually necessarily by word.
It's based on the frequency of, you know, common
character occurrences.
And so in this case, one of my favorites is the ING token,
which is very common in tokenizers.
And that's because that happens in every single gerund.
So in here, you can see finetuning, ING.
So every single, you know, verb in the gerund, you know, fine-tuning
or tokenizing all has ING and that maps
onto the token 278 here.
And when you decode it with the same tokenizer,
it turns back into the same text.
Now there are a lot of different tokenizers and
a tokenizer is really associated with
a specific model for each model as it was trained on it.
And if you give the wrong tokenizer to your model, it'll
be very confused because it will expect different numbers
to represent different sets of letters
and different words.
So make sure you use the right tokenizer and you'll
see how to do that easily in the lab.
Cool, so let's head over to the notebook.
Okay, so first we'll import a few different libraries and
actually the most important one to see here is the AutoTokenizer class
from the Transformers library by HuggingFace.
And what it does is amazing.
It automatically finds the right tokenizer or for your
model when you just specify what the model is. So all you have to do
is put the model and name in, and this is the same
model name that you saw before, which is a 70
million Pythium base model.
Okay, so maybe you have some text that says,
you know, hi, how are you?
So now let's tokenize that text.
So put that in, boom.
So let's see what encoded text is.
All right, so that's different numbers representing
text here.
Tokenizer outputs a dictionary with input
IDs that represent the token, so I'm just printing that here.
And then let's actually decode that back into the text and see if it actually
turns back into hi, how are you?
Cool, awesome, it turns back into hi, how are you, so that's great.
All right, so when tokenizing, you probably are putting
in batches of inputs, so let's just take a look at
a few different inputs together, so there's hi, how are you, I'm good, and
yes.
So putting that list of text through, you can just put it
in a batch like that.
Into the tokenizer, you get a few different things here.
So here's hi, how are you again.
I'm good, it's smaller.
And yes, it's just one token.
So as you can see, these are varying in length.
Actually, something that's really important for models is
that everything in a batch is the same length, because
you're operating with fixed size tensors.
And so the text needs to be the same.
So one thing that we do do is something called padding.
Padding is a strategy to handle these variable length encoded texts.
Um, and for our padding token, you have to specify, you know,
what you want to, what number you want to represent for,
for padding. And specifically we're using a zero, which
is actually the end of sentence token as well.
So when we run, padding equals true through the tokenizer,
you can see the yes string has a lot of
zeros padded there on the right, just to match the length of this hi,
how are you string.
Your model will also have a max length that it can handle
and take in so it can't just fit everything in and you've played
with prompts before and you've noticed probably that there is a
limit to the prompt length and so this is the same thing
and truncation is a strategy to handle making
those encoded text much shorter and that fit
actually into the model so this is one way to make it
shorter so as you can see here I'm just artificially changing
the max length to three, setting truncation to true,
and then seeing how it's,much shorter now, for hi, how
are you?
It's truncating from the right, so it's just getting rid of everything here
on the right.
Now, realistically, actually one thing that's
very common is, you know, you're writing a prompt, maybe you have your instruction
somewhere,and you have a lot of the important things maybe on
the other side,on the right and that's getting truncated out.
So, you know, specifying truncation side to
the left actually can truncate it the other way.
So this really depends on your task.
And realistically for padding and truncation,
you want to use both. So let's just actually set both in there. So
truncation's true and padding's true here.
I'm just printing that out so you can see the zeros here, but
also getting truncated down to three.
Great, so that was really a toy example.
I'm going to now paste some code that you did in
the previous lab on prompts.
So here it's loading up the data set file with the
questions and answers, putting it into the prompt, hydrating
those prompts all in one go.
So now you can see one data point here of question and answer.
So now you can run this tokenizer on
just one of those data points.
So first concatenating that question with that answer and
then running it through the tokenizer. I'm
just returning the tensors as a NumPy array
here just to be simple and running it
with just padding and that's because I don't know how long these tokens actually
will be, and so what's important is that
I then figure out, you know, the minimum between the max length and
the tokenized inputs.
Of course, you can always just pad to the longest.
You can always pad to the max length and so that's
what that is here.
And then I'm tokenizing again with truncation up
to that max length.
So let me just print that out.
And just specify that in the dictionary, and cool.
So that's what the tokens look like.
All right, so let's actually wrap this into a full-fledged function
so you can run it through your entire
data set.
So this is, again, the same things happening here
that you already looked at, grabbing the max length,
setting the truncation side.
So that's a function for tokenizing your data set.
And now what you can do is you can load up that dataset.
There's a great map function here. So you can map the tokenize
function onto that dataset.
And you'll see here I'm doing something really simple.
So I'm setting batch size to one, it's very simple.
It is gonna be batched and dropping last batch true. That's
often what we do to help with mixed size inputs.
And so the last batch might be a different size.
Cool.
Great, and then so the next step is to split the data set.
So first I have to add in this labels columns
as for hugging face to handle it, and then I'm going to run this
train test split function, and I'm going
to specify the test size as 10% of the data.
So of course you can change this depending on how
big your data set is.
Shuffle's true, so I'm randomizing the order of this
data set.
I'm just going to print that out here.
So now you can see that the data set has been split
across training and test set, 140 for a test set there.
And of course this is already loaded up
in Hugging Face like you had seen before,
so you can go there and download it and see
that it is the same.
So while that's a professional data set, it's about
a company, maybe this is related to your
company for example, you could adapt it to your company.
We thought that might be a bit boring, it doesn't have to be, so
we included a few more interesting datasets that you
can also work with and feel free to customize and train your
models for these instead.
One is for Taylor Swift, one's for the popular band BTS, and
one is on actually open source large language models
that you can play with.
And just looking at, you know, one data point from the TayTay dataset,
let's take a look.
All right, what's the most popular.
Taylor Swift song among millennials?
How does this song relate to the millennial generation?
Okay, okay.
So, you can take a look at this yourself and yeah,
these data sets are available via HuggingFace.
And now in the next lab, now that you've
prepped all this data, tokenized it, you're ready
to train the model.
Training process
In this lesson, you'll step through the entire training process, and
at the end see the model improve on your task, specifically
for you to be able to chat with it.
Alright, let's jump into it.
Alright, training in LLM, what does this look like?
So, the training process is actually quite similar to other
neural networks.
So, as you can see here, you know, the same setup that we
had seen the LLM predict "sd!!@".
What's going on?
Well, first you add that training data up at the top.
Then you calculate the loss, so it predicts something
totally off in the beginning, predict the loss compared
to the actual response it was supposed to give, that's
a pawn.
And then you update the weights, you back prop through
the model to update the model to improve it,
such that in the end it does learn to then
output something like a pawn.
There are a lot of different hyperparameters that
go into training LLMs.
We won't go through them very specifically, but
across a few that you might want to play with is learning
rate, learning scheduler, and various optimizer hyperparameters
as well.
All right, so now diving a level deeper into the code.
So these are just general chunks of training
process code in PyTorch.
So first you want to go over the number of epochs,
an epoch is a pass over your entire data set.
So you might go over your entire data set multiple times.
And then you want to load it up in batches.
So that is those different batches that you saw when you're
tokenizing data.
So that's sets of data together.
And then you put the batch through your
model to get outputs.
You compute the loss from your model and
you take a backwards step and you update your optimizer.
Okay. So now that you've gone through every step
of this low level code in PyTorch, we're actually
going to go one level higher into HuggingFace and
also another level higher into the Llama library by
Llama and I, just to see how the training
process works in practice in the lab.
So let's take a look at that.
Okay.
So first up is seeing how the training
process has been simplified over time, quite a bit with
higher and higher level interfaces, that PyTorch code you saw.
Man, I remember running that during my PhD.
Now there are so many great libraries out
there to make this very easy.
One of them is the Lamini Llama library, and it's just training your model in
three lines of code that's hosted on an external GPU, and it
can run any open source model, and you can get the model back.
And as you can see here, it's just requesting that 410
million parameter model.
You can load that data from that same JSON lines file,
and then you just hit "model.train".
And that returns a dashboard, a playground interface,
and a model ID that you can then call and continue training
or run with for inference.
All right, so for the rest of this lab, we're
actually going to focus on using the Pythia
70 million model.
You might be wondering why we've been playing with that really small, tiny
model, and the reason is that it can run on CPU nicely
here for this lab, so that you can actually see
the whole training process go.
But realistically, for your actual use cases,
I recommend starting with something a bit larger,
maybe something around a billion parameters,
or maybe even this 400 million one if your task
is on the easier side.
Cool.
So first up, I'm going to load up all of these libraries, And
one of them is a utilities file with a bunch of different
functions in there.
Some of them that we've already written together on the tokenizer,
and others you should take a look at for just logging and showing
outputs.
So first let's start with the different configuration parameters
for training.
So there are two ways to actually, you know, import data. You've
already seen those two ways. So one is just not
using HuggingFace necessarily, you just specify a certain dataset path.
Another one, you could specify a HuggingFace path,
and here I'm using a boolean value, use HuggingFace, to
specify whether that's true.
We include both for you here so you can easily use it.
Again, we're going to use a smaller model so that it runs on CPU, so
this is just 70 million parameters here.
And then finally, I'm going to put all of
this into a training config, which will be then passed
onto the model, just to understand, you know, what
the model name is and the data is.
Great.
So the next step is the tokenizer.
You've already done this in the past lab, but
here again, you are loading that tokenizer and then
splitting your data.
So here's just the training and test set, and
this is loading it up from HuggingFace.
Next just loading up the model, you already specified
the model name above.
So that's 70 million parameter Pythia model.
I'm just going to specify that as the base model, which hasn't been
trained yet.
Next an important piece of code.
If you're using a GPU, this is PyTorch code that
will be able to count how many CUDA devices, basically
how many GPUs you have.
And depending on that, if you have more than zero of them,
that means you have a GPU.
So you can actually put the model on GPU.
Otherwise it'll be CPU.
In this case, we're going to be using CPU.
You can see select CPU device.
All right. So just to put the model on that GPU or CPU,
you just have to do the model to device.
So very simple.
So now this is printing out the, you know,
what the model looks like here, but it's putting it on that device.
All right.
So putting together steps from the previous lab,
but also adding in some new steps is inference.
So you've already seen this function before, but
now stepping through exactly what's going on.
So first you're tokenizing that text coming in.
You're also passing in your models.
So that's the model here, and you want the model to
generate based on those tokens.
Now the tokens have to be put onto the same device so that,
you know, if the model is on GPU, for example, you need to put the
tokens on GPU as well. So the model can actually see it.
And then next there's an important, you know, max input tokens and max
output tokens here as parameters for specifying, you
know, how many tokens can actually be put into the
model as input. And then how many do you expect out?
We're setting this to a hundred here as a default, but
feel free to play with this make it longer so it generates
more.
Note that it does take time to generate
more so expect a difference in the time
it takes to generate.
Next the model does generate some tokens out
and so all you have to do is decode it with that
tokenizer just like you saw before and here
after you decode it you just have to
strip out the prompt initially because it's
just outputting both the prompt with your generated
output and so I'm just having that return that
generated text answer.
So great this function you're going to be using a lot.
So first up is taking a look at that first
test set question and putting it through the model and
try not to be too harsh and I know you've
already kind of seen this before so again
the model is answering this really weird way
that you've seen before.
It's not really answering the question which
is here and the correct answer is here.
Okay so this is what training is for.
So next you're going to look at the training arguments.
So there are a lot of different arguments.
First, key in on a few.
So the first one is the max number of steps
that you can run on the model.
So this is just max number of training steps.
We're gonna set that to three just to make it very simple, just
to walk through three different steps.
What is a step exactly?
A step is a batch of training data.
And so if your batch size is one, it's just one data point. If
your batch size is 2,000, it's 2,000 data points.
Next is the trained model name. So what do you want to call it?
So here I'm calling it the name of a dataset, plus,
you know, the max steps here so that we can differentiate
it if you want to play with different
max steps and the word steps.
Something I also think is the best practice that's
not necessarily shown here is also to put
the timestamp on the trained model because you
might be experimenting with a lot of them.
Okay, cool.
So I'm now going to show you a big list of different training arguments.
There are a lot of good defaults here.
And I think the ones to focus on is max steps.
This is probably going to stop the model from running past those
three steps that you specified up there.
And then also the learning rate.
There are a bunch of different arguments here.
I recommend that you can dive deeper into this if you're
curious and be able to play with a lot of these arguments.
But here we're largely setting these as
good defaults for you.
Next, we've included a function that calculates the
number of floating point operations for
the model.
And so that's just flops and understanding the memory footprint of
this base model.
So here, it's just going to print that out here.
This is just for your knowledge, just to understand what's going on.
And we'll be printing that throughout training.
And I know we said that this was a tiny, tiny model, but even here,
look how big this model is here with 300 megabytes.
So you can imagine a really large model to take up a
ton of memory and this is why we need really high performing
large memory GPUs to be able to run those larger models.
Next you load this up in the trainer class.
This is a class we wrapped around HuggingFaces
main trainer class basically doing the same thing
just printing out things for you as you train and as you
can see you put a few things in. The main things are the base model,
you put in you know max steps, the training arguments,
and of course, your data sets you want to put in there.
And the moment you've been waiting for.
It is training the models.
You just do "trainer.train".
And let's see it go.
Okay.
Okay. So as you can see, it printed out a lot
of different things in the logs, namely the loss.
If you run this for more steps, even just 10 steps, you'll
see the loss start to go down.
All right.
So now you've trained this model.
Let's save it locally.
So you can have a save directory, maybe specifying the output
deer and the final as a final checkpoint.
And then all you have to do is "trainer.savemodel".
And let's see if it saved right here.
So awesome, great work.
Now that you've saved this model, you can actually load it up by just saying,
you know,
this auto model again from pre-trained and the save directory and you
just have to specify local files equals true.
So it doesn't pull from the HuggingFace hub in the cloud.
I'm going to call this slightly fine-tuned model,
or fine-tuned slightly model.
And then I'm going to put this on the right device again.
This is only important if you have a GPU, really,
but here for CPU, just for good measure.
And then let's run it. Let's see how it does.
So let's see how it does on the test set again, or
test data point again, and then just run inference.
Again, this is the same inference function that you've
run before.
Cool.
So is it any better? Not really.
And is it supposed to be? Not really. It's
only gone through a few steps.
So what should it have been?
Let's just take a look at that exact answer.
So it's saying, yes, LAMNI can generate technical
documentation user manuals.
So it's it's very far from it. It's actually very similar still to that
base model.
Ok, but if you're patient, what could it look like?
So we also fine tuned a model for far longer than that.
So this model was only trained on three
steps and actually in this case, three data points out of 1,260
data points in the training data set.
So instead we actually fine-tuned it on the entire data
set twice for this "lamini_docs_finetunemodel" that we uploaded to HuggingFace that you
can now download and actually use.
And if you were to try this on your own computer,
it might take half an hour or an hour, depending on your processor.
Of course, if you have a GPU, it could just take a couple minutes.
Great. So let's run this.
Okay, this is a much better answer, and it's comparable to the
actual target answer.
But as you can see here at the end, it still starts to
repeat itself additionally and laminize. So it's not perfect, but
this is a much smaller model, and you could train it
for even longer too.
And now just to give you a sense of what
a bigger model might do, This one was trained to be maybe a
little bit less robust and repetitive.
This is what a bigger 2.8 billion fine-tuned model would be.
And this is running the LLAMA library with the same basic model
runner as before.
So here you can see, yes, LLAMA and I can generate technical
documentation or user manuals.
Ok, great.
So one other thing that's kind of interesting in this dataset
that we use to fine-tune that you can also do for your
datasets is doing something called moderation,
and encouraging the model to
actually not get too off track.
And if you look closely at the examples in this data set,
which we're about to do, you'll see that there are examples that say, let's keep
the discussion relevant to llamini.
I'm going to loop through the data set here
to find all the data points that say that, so
that you can go see that yourself.
So this is how you might prepare your own data set.
And as a reminder, this is very similar to chat GPT.
Sorry, i'm an AI and I can't answer that. So they're using a
very similar thing here.
So points it to the documentation to take a look
at the fact that there isn't anything about Mars.
All right, so now that you've run all of training here, you
can actually do all of that in just three lines
of code using Llamani's Llama library.
And all you have to do is load up the model,
load up your data and train it.
And specifically here, we're running a slightly larger model.
So the Pythia 410 million model, It's the biggest model that's available
for a free tier.
And then Llamani docs, you can load that up through a JSON
lines file just like you did before and all you have to
do is run "model.train". I'm running is public is true. So this
is a public model that anyone can then
run afterwards.
Put that through instead of the Pythia 410
in the basic model runner to then run it.
You can also click on this link here to sign up,
make an account.
Basically you can see the results.
You can run a chatbot there, kind of interface there to
be able to see everything, but since is public is true,
we can actually just look at the model
results here on the command line, So "model.evaluate", run that. And
here you can see, again, the same job ID. For this job ID, you
can see all the evaluation results that were
data points that were not trained on.
And so just to pretty print this a little bit into a data frame, I'm
gonna plop some code in here to reformat that.
So this is just code that is reformatting that into a nice
data frame from that list of dictionaries.
Cool.
And here you can see a lot of different, you know, questions and then answer
from the train model versus the base model.
So this is an easy way to compare those results.
So here's a question. Does Lamini have the ability to understand
and generate code for audio processing tasks?
And you can see that the train model
actually gave an answer. Yes, Lamini has the ability to understand,
and generic codes not quite there yet. This is really a baby
model and a very limited data, but it is much
better than this base model that answers a colon.
all nice a very good language for audio processing A
colon. You know, yes, Lamini has the ability
to understand and generate code. It's not quite there yet, so
this is a really baby model with very limited data, but
it. It is much better than this base model that
answers with A colon, I think you are looking
for a language that can be used to
write audio code just very often, it keeps rambling.
So a very big difference in performance.
And now you can see a different question here,
you know, is it possible to control the level of
detail in the generated output?
So as you can see, you can go through all these
results and in the next lab, we'll actually explore how to evaluate
all of these results.
Evaluation and iteration
Now that you've finished training your model, the
next step is to evaluate it, see how it's doing.
This is a really important step because AI
is all about iteration.
This helps you improve your model over time.
Okay, let's get to it.
Evaluating generative models is notoriously very, very difficult.
You don't have clear metrics and the performance of these
models is just improving so much over time
that metrics actually have trouble keeping up.
So as a result, human evaluation is often the most reliable way of
doing so, so that's actually having experts who
understand the domain actually assess the outputs.
A good test data set is extremely important
to making this actually a good use of that person's time, and
that means it's a high quality data set, it's accurate, so
you've gone through it to make sure that it
is accurate.
It's generalized so it actually covers a lot of
the different test cases you want to make
sure the model covers and of course it can't be
seen in the training data.
Another popular way that is emerging is ELO comparison
so that's looking almost like a A-B test between multiple models or tournament across
multiple models.
ELO rankings are used in chess specifically and so this is one
way of also being able to understand and which
models are performing well or not.
So one really common open LLM benchmark is a suite of different
evaluation methods.
So it's actually taking a bunch of different possible evaluation methods
and averaging them all together to rank models.
And this one is developed by EleutherAI, and it's a set of different
benchmarks put together.
So one is ARC. It's a set of grade school questions.
HellaSwag is a test of common sense.
MMLU covers a lot of elementary school subjects,
and TruthfulQA measures the model's ability to reproduce
falsehoods that you can commonly find online.
And so these are a set of benchmarks
that were developed by researchers over time and
now have been used in this common evaluation suite.
And you can see here this is the latest ranking as of this recording,
but I'm sure this changes all the time.
Llama 2 is doing well.
This is actually not necessarily sorted by the average here.
Llama 2 is doing well. There's recently a free willy
model that was fine-tuned on top of the Llama 2
model using what's known as the Orca method, which is why
it's called free willy.
Not going to go into that too much.
There are a lot of animals going on right here,
but feel free to go check it out yourself.
Okay, so one other framework for analyzing
and evaluating your model is called error analysis.
And what this is is categorizing errors so that you understand the
types of errors that are very common, and going after the
very common errors and the very catastrophic errors first.
This is really cool because error analysis usually requires
you to first train your model first beforehand.
But of course, for fine-tuning, you already have a base model that's been
pre-trained. So you can already perform error analysis before you
even fine-tune the model.
This helps you understand and characterize how the
base model is doing, so that you know what
kind of data will give it the biggest lift for fine-tuning.
And so there are a lot of different categories.
I'll go through a few common ones that you can take a look at.
So one is just misspellings. This is very straightforward,
very simple.
So here it says, go get your liver or lover checked,
and it's misspelled.
And so just fixing that example in your
data set is important to just spell it correctly.
Length is a very common one that I hear about chat GPT
or generative models in general.
They really are very verbose.
And so one example is just making sure your data set is
less verbose to make it so that it actually is answering the
question very succinctly. And you've already seen that in the
training notebook where we're able to do a bit of that in the models, less
verbose and less repetitive.
And speaking of repetitive, these models do tend
to be very repetitive.
And so one way to do that is to fix it with either stop tokens
more explicitly, those prompt templates you saw,
but of course, also making sure your
dataset includes examples that don't have as much repetition
and do have diversity.
Cool, so now on to a lab where you get to run the model across a test
dataset and then be able to run a few different metrics,
but largely inspect it manually and also run
on one of those LLM benchmarks that you saw, Arc.
Okay, so this actually can be done in just a line of code,
which is running your model on your entire test data
set in a batched way that's very efficient on GPUs. And
so I just wanted to share that here, which is you can load
up your model here and instantiate it and
then have it just run on a list of your entire test data set. And
then it's automatically batched on GPUs
really quickly.
Now we're really largely running on CPUs here.
So for this lab, you'll get to actually just run
it on a few of the test data points.
And then of course you can do more on your own as well.
Okay, great.
So I think the first thing is to load up
the test data set that we've been working with.
And then let's take a look at what one of those data points looks like.
So I'm just going to print question answer pair.
All right, so this is one that we've been looking at.
And then we want to load up the model to
run it over this entire data set.
So this is the same as before.
I'm going to pull out the actual fine-tuned model
from HuggingFace.
Okay so now we've loaded up our model and I'm gonna
load up one really basic evaluation metric just
for you to get a sense of this generative task and it's gonna be
whether it's an exact match between two strings of
course stripping a little bit of white space
but just getting a sense of whether it
can be an exact match.
This is really hard for those writing tasks because it's
generating content there are actually a lot of different
possible to write answers, so it's not
a super valid evaluation metric.
For reading, quote unquote, tasks, those reading tasks,
you might be extracting topics out. You might be
extracting some information out. So maybe in those cases where it's
closer to classification, this might make more sense.
But I just want to run this through.
You can run different evaluation metrics through as well.
An important thing when you're running a model in evaluation
mode is to do "model.eval" to make sure things like dropout
is disabled.
And then just like in previous labs, you can run this
inference function to be able to generate output.
So let's run that first test question again.
Again, you get that output and look at the actual answer,
compare it to that, and it's similar, but it's not quite there.
So of course, when you run exact match, it's
not perfect.
And that's not to say there aren't other ways of measuring these models.
This is a very very simple way.
Sometimes people also will take you know these
outputs and put it into another LLM to
ask it and grade it to see how well you know how
close is it really.
You can also use embedding so you can embed the actual answer
and actually embed the generated answer and see how
close they are in distance.
So there are a lot of different approaches
that you can take.
Cool so now to run this across your entire data set,
this is what that might look like.
So let's just actually run it over 10 since it takes quite
a bit of time.
You're going to iterate over that data set, pull
out the question and answers.
Here I'm also trying to take the predicted answer and
actually append it with the other answers so
that you can inspect it manually later, and
then take a look at the number of exact matches, and
it's just evaluating here.
So the number of exact matches is zero, and that's not actually super surprising
since this is a very generative task.
And typically for these tasks, you know, there again are a lot of
different ways of approaching evaluation,
but at the end of the day, what's been found to be
significantly more effective by a large margin is
using manual inspection on a very curated test set.
And so this is what that data frame looks like.
So now you can go inspect it and see, okay, for every predicted answer,
what was the target and how close was it really?
Okay, cool. So that's only on a subset of the data.
We also did evaluate it on all of the data here that you can go load
from HuggingFace and be able to basically see
and evaluate manually all the data.
And last but not least, you'll get to see running Arc, which
is a benchmark.
So if you're curious about academic benchmarks, this
was one that you just explored across that test suite of different
LLM benchmarks. And this ARC benchmark, as
a reminder, is one of those four that EleutherAI came up
with and put together, and these are from academic papers.
And for this one, if you inspect the data set, you'll
find science questions that may or may not be related to your
task.
And these evaluation metrics, especially here, are just very good
for academic contests or understanding, you know,
general model abilities sometimes around these,
in this case, basic grade school questions.
But I actually really recommend, you know, even even as
you run these to not necessarily be too caught up on the
performance on these benchmarks, even though this is how
people are ranking models now, and that's because they don't correlate
with your use case.
They are not necessarily related to what your company cares about,
what you actually care about for your end
use case for that fine-tuned model.
And as you can probably see, the fine-tuned models are able to
basically get tailored to a ton of different
tasks which require a ton of different ways
of evaluating them.
Okay, so the ARC benchmark just finished running and
the score is right here, 0.31, and actually that is lower
than the base model score in the paper, which
is 0.36, which is crazy because you saw it improve
so much on this.
But it's because it improves so much on this company
dataset related to this company, related
to question answering for it, and not grade school science.
So that's what ARC is really measuring.
Of course, if you fine-tune a model on general tasks, so
if you fine-tune it on alpaca, for example, you should see a little bit
of a bump in that performance for this
specific benchmark. And if you use a larger model you'll also see
a likely bump as well because it's learned much more.
And that's basically it. So as you can see this.
ARC benchmark probably only matters if you're
looking at general models and comparing general models.
Maybe that's finding a base model for you to use but not
for your actual fine-tuning task. It's not very
useful unless you're fine-tuning the model to do grade school science questions.
All right and that's a for the notebooks.
In the last lesson you'll learn some practical tips for fine-tuning and
then a sneak peek of more advanced methods.
Consideration on getting started now
All right, you made it to our last lesson and
these will be some considerations you should take
on getting started now, some practical tips, and also a bit
of a sneak preview on more advanced training methods.
So first, some practical steps to fine-tuning.
Just to summarize, first you want to figure out your task,
you want to collect data that's related to your tasks
inputs and outputs and structure it as such.
If you don't have enough data, no problem, just generate some or use
a prompt template to create some more.
And first, you want to fine tune a small model.
I recommend a 400 million to a billion
parameter model just to get a sense of
where the performance is at with this model.
And you should vary the amount of data you actually give to
the model to understand how much data actually influences
where the model is going.
And then you can evaluate your model to see what's
going well or not.
And finally, you want to collect more data to improve
the model through your evaluation.
Now, from there, you can now increase your task complexity,
so you can make it much harder now.
And then you can also increase the model
size for performance on that more complex task.
So for task-defined tune, you learned about, you know, reading
tasks and writing tasks.
Writing tasks are a lot harder.
These are the more expansive tasks like chatting,
writing emails, writing code, and that's because there are more
tokens that are produced by the model.
So this is a harder task in general for the model.
And harder tasks tend to result in needing
larger models to be able to handle them.
Another way of having a harder task is
just having a combination of tasks, asking the model to
do a combination of things instead of just one task.
And that could mean having an agent be
flexible and do several things at once or in just in one
step as opposed to multiple steps.
So now that you have a sense of model sizes that you
need for your task complexity, there's also a compute requirement
basically around hardware of what you need to run your models.
For the labs that you ran, you saw those 70 million
parameter models that ran on CPU.
They weren't the best models out there.
And I recommend starting with something a little
bit more performant in general.
So if you see here in this table, the first row,
I want to call out of a "1 V100" GPU that's available, for example, on
AWS, but also any other cloud platform and you see
that it has 16 gigabytes of memory and
that means it can run a 7 billion parameter model for inference
but for training, training needs far more memory for to
store the gradients and the optimizers so it only can actually fit
a 1 billion parameter model and if you want to fit a
larger model you can see some of the other options available here
great so maybe you thought that that was not enough for you
you want to work with much larger models?
Well, there's something called PEFT or parameter efficient fine tuning, which
is a set of different methods that help
you do just that, be much more efficient in how you're using
your parameters and training your models.
And one that I really like is LoRa, which stands for low rank adaptation.
And what LoRa does is that it reduces
the number of parameters you have to train
weights that you have to train by a huge amount.
For GPT-3, for example, they found that they could
reduce it by 10,000x, which resulted in 3x less memory needed
from the GPU.
And while you do get slightly below accuracy to fine
tuning, this is still a far more efficient way
of getting there and you get the same
inference latency at the end.
So what is exactly happening with LoRa?
Well, you're actually training new weights in
some of the layers of the model and you're freezing
the main pre-trained weights, which you see
here in blue.
So that's all frozen and you have these new orange weights.
Those are the LoRa weights.
And the new weights, and this gets a little bit mathy,
are the rank decomposition matrices of
the original weights change.
But what's important is less so, you know, the math behind that here, it's
that you can train these separately, alternatively to
the pre-trained weights, but then at
inference time be able to merge them back into the
main pre-trained weights and get that fine-tuned model more efficiently.
What I'm really excited about to use LoRa for is adapting it to new
tasks and so that means you could train
a model with LoRa on one customer's data
and then train another one on another customer's data and
then be able to merge them each in at inference time when
you need them.
Conclusion
All right, you made it to our last lesson and
these will be some considerations you should take
on getting started now, some practical tips, and also a bit
of a sneak preview on more advanced training methods.
So first, some practical steps to fine-tuning.
Just to summarize, first you want to figure out your task,
you want to collect data that's related to your tasks
inputs and outputs and structure it as such.
If you don't have enough data, no problem, just generate some or use
a prompt template to create some more.
And first, you want to fine tune a small model.
I recommend a 400 million to a billion
parameter model just to get a sense of
where the performance is at with this model.
And you should vary the amount of data you actually give to
the model to understand how much data actually influences
where the model is going.
And then you can evaluate your model to see what's
going well or not.
And finally, you want to collect more data to improve
the model through your evaluation.
Now, from there, you can now increase your task complexity,
so you can make it much harder now.
And then you can also increase the model
size for performance on that more complex task.
So for task-defined tune, you learned about, you know, reading
tasks and writing tasks.
Writing tasks are a lot harder.
These are the more expansive tasks like chatting,
writing emails, writing code, and that's because there are more
tokens that are produced by the model.
So this is a harder task in general for the model.
And harder tasks tend to result in needing
larger models to be able to handle them.
Another way of having a harder task is
just having a combination of tasks, asking the model to
do a combination of things instead of just one task.
And that could mean having an agent be
flexible and do several things at once or in just in one
step as opposed to multiple steps.
So now that you have a sense of model sizes that you
need for your task complexity, there's also a compute requirement
basically around hardware of what you need to run your models.
For the labs that you ran, you saw those 70 million
parameter models that ran on CPU.
They weren't the best models out there.
And I recommend starting with something a little
bit more performant in general.
So if you see here in this table, the first row,
I want to call out of a "1 V100" GPU that's available, for example, on
AWS, but also any other cloud platform and you see
that it has 16 gigabytes of memory and
that means it can run a 7 billion parameter model for inference
but for training, training needs far more memory for to
store the gradients and the optimizers so it only can actually fit
a 1 billion parameter model and if you want to fit a
larger model you can see some of the other options available here
great so maybe you thought that that was not enough for you
you want to work with much larger models?
Well, there's something called PEFT or parameter efficient fine tuning, which
is a set of different methods that help
you do just that, be much more efficient in how you're using
your parameters and training your models.
And one that I really like is LoRa, which stands for low rank adaptation.
And what LoRa does is that it reduces
the number of parameters you have to train
weights that you have to train by a huge amount.
For GPT-3, for example, they found that they could
reduce it by 10,000x, which resulted in 3x less memory needed
from the GPU.
And while you do get slightly below accuracy to fine
tuning, this is still a far more efficient way
of getting there and you get the same
inference latency at the end.
So what is exactly happening with LoRa?
Well, you're actually training new weights in
some of the layers of the model and you're freezing
the main pre-trained weights, which you see
here in blue.
So that's all frozen and you have these new orange weights.
Those are the LoRa weights.
And the new weights, and this gets a little bit mathy,
are the rank decomposition matrices of
the original weights change.
But what's important is less so, you know, the math behind that here, it's
that you can train these separately, alternatively to
the pre-trained weights, but then at
inference time be able to merge them back into the
main pre-trained weights and get that fine-tuned model more efficiently.
What I'm really excited about to use LoRa for is adapting it to new
tasks and so that means you could train
a model with LoRa on one customer's data
and then train another one on another customer's data and
then be able to merge them each in at inference time when
you need them.