Research Guides: LLMs and GenAI in Digital Scholarship: Fine Tuning LLMs

Creating and Fine-Tuning Your Own Model

While using pre-trained models (with or without prompt engineering) covers many scenarios, you might wonder: How are these models created? Could I train or fine-tune one for my own needs? In an academic context, you might have a specific corpus or task where customizing a model is beneficial. Here we’ll outline the process of training/fine-tuning in basic terms and highlight what’s feasible on a research or library level.

Pre-training vs. Fine-tuning

Pre-training an LLM means training the initial model from scratch on a very large text corpus (like how GPT-3 was trained on internet data). This is extremely resource-intensive – typically done by large organizations with access to supercomputers or massive GPU clusters, and it can cost millions of dollars in compute. Pre-training is what gives the model its general language ability.
Fine-tuning means taking an already pre-trained model and training it further on a smaller, specific dataset to specialize it or teach it new tasks Fine-tuning is much cheaper than pre-training because the model already “knows” language; you’re just adjusting it. This can be done with far less data and compute, sometimes even on a single high-end GPU or Google Colab, depending on model size.

In most cases, fine-tuning an existing open model on domain-specific data or for a specific task is more efficient than creating a model from scratch. For example, fine-tuning a general model on medical texts so it becomes a better medical assistant, or on Shakespeare’s works so it writes in iambic pentameter, etc.

Fine-Tuning Process

Select a base model: Choose a pre-trained model that’s a good starting point. For instance, a 7-billion-parameter model like LLaMA-2 7B or GPT-J, etc. The model should have an architecture you can work with (and a license that allows fine-tuning for your use). Base model choice depends on language (some models are multilingual, some not), size (bigger = more capacity but more compute needed), and capability.
Gather your dataset: Prepare the text data you want to fine-tune on. This could be a set of Q&A pairs, conversation transcripts, domain-specific articles, etc., depending on your goal. If you want the model to perform a specific task (like classify something or follow certain instructions), your dataset should reflect that (i.e., be in a prompt->response format for that task). If you want it to just “learn” a writing style, your data might be example texts in that style.

Example: Suppose we want a model that answers chemistry questions better. We might assemble a dataset of chemistry textbook Q&As or explanatory paragraphs. If we have 10,000 such pairs, that might be enough to fine-tune a model to be more chem-savvy.
Preprocess data: Clean it up, split into training and validation sets (so we can measure performance on unseen data). Tokenize it using the model’s tokenizer (the text has to be converted into the token IDs that the model understands).
Training (fine-tuning): You use a training loop (with a framework like PyTorch + Hugging Face’s Trainer or similar) to adjust the model’s weights on your data. Essentially, the model sees an input (like a question) and compares its output to the desired output (like the reference answer) and nudges its weights to better match the reference. This is typically a supervised learning setup for LLM fine-tuning: you have input-output pairs. The process might take a few hours to a few days depending on data size and model size. If using techniques like LoRA (Low-Rank Adaptation) or other parameter-efficient fine-tuning (PEFT), you can fine-tune faster and with less memory by only training small adapter weights instead of all billions of parameters. LoRA, in short, injects small trainable matrices into each layer and keeps the original model weights frozen – it’s very handy because at the end, you don’t need to distribute a full model, just the small LoRA diffs (which are maybe a few hundred MB at most) and apply them on top of the base model. Many academic fine-tuning projects use LoRA.
Evaluation: Check how the fine-tuned model performs. Use some held-out questions or tasks. Does it indeed improve on your domain questions versus the base model? We often see that a well-fine-tuned model on a narrow domain outperforms a generic model for that domain’s questions.
Deployment: Now you have a custom model. You can use it via the same libraries as any model. If it’s small enough, you can run it on a local server or even share it (if the base model’s license allows you to publish a fine-tuned version). E.g., many fine-tuned models are on Hugging Face hub (like “ModelName-finetuned-on-news-summarization”).

Things to consider:

Data size: Fine-tuning doesn’t necessarily need millions of examples. Even a few thousand high-quality examples can significantly alter a model’s behavior. However, too little data may lead to overfitting (the model just memorizes and outputs your fine-tune data verbatim in some cases) or not making much difference.
Overfitting and original knowledge: Ideally, fine-tuning adapts the model without destroying its original broad knowledge. In practice, if your dataset is small, the model will mostly retain general ability and just tweak to your use. If your dataset is huge and one-sided, the model might become very domain-specific and lose some generality. There are ways to mitigate forgetting, like mixing some original data in training, etc., but that’s more a advanced process.
Compute requirements: To fine-tune a 7B model, you might need at least a modern GPU with ~16GB VRAM. These resources are available in the Freedman Center. For a 30B model, 24-32GB GPU might be needed or multi-GPU. It’s not trivial, but far easier than training 175B from scratch.
Instruction Tuning vs. Other: If you want your model to follow instructions (like “As an AI, do X, Y”), you might fine-tune on an instruction dataset. There are public datasets of instruction-response pair. For example, the Stanford Alpaca dataset which mimics GPT-3.5 outputs, or Dolly dataset. Fine-tuning on those can make a formerly generic model respond to user prompts in a chatty helpful way. Many open chat models are basically base models fine-tuned on such instruction data.
Legal/ethical: If fine-tuning using proprietary data, be mindful of licensing. Also, if you fine-tune on sensitive data (say, private health records for a health bot), that model could potentially regurgitate some of that data if prompted cleverly (because it’s effectively part of its weights). Techniques like differential privacy in training are possible but complex. So only fine-tune on data you’re allowed to or that you wouldn’t mind the model learning.

To illustrate with a concrete scenario: Suppose a library wants an LLM that can answer patron questions about its special collections in a conversational manner. They have a digitized archive of letters and a catalog of those. They could fine-tune an open model on a custom dataset of Q&A about that archive (which they create from known frequently asked questions and their answers, plus some example entries from the archive). After fine-tuning, the model might be much better at understanding queries like “Do you have letters from person X in 1905?” and responding with details from the special collection (if the relevant data is embedded in the prompt or training). Fine-tuning aligns the model with the library’s context and reduces irrelevant or hallucinated answers about the collection.

Another example: A research group might fine-tune a model on their field’s literature (say, all papers about a niche topic). The resulting model could generate related work summaries or answer technical questions with jargon and knowledge of that field that a general model might not have seen often.

It’s worth noting that fine-tuning is not always necessary. With prompt engineering and maybe some few-shot examples, a base model might do the job. But fine-tuning shines when you have a specific distribution of queries or a format that you want the model to handle consistently. It essentially bakes the behavior in, so you don’t have to write elaborate prompts each time.

Finally, there’s a concept of continual learning: updating a model with new data periodically so it stays up-to-date (like news or new research). Fine-tuning can be used in that sense too, though risk of forgetting old info exists if not careful. Some folks instead fine-tune embedding models for retrieval or use RAG for keeping current, because fine-tuning the whole LLM frequently is expensive.

In sum, fine-tuning is an accessible way to create a specialized LLM on top of an existing one. It involves additional training on targeted data and can greatly improve performance on niche tasks. For academic institutions, it offers a way to have AI models that are more domain-expert (with the domain being whatever your data is). Just remember to evaluate the fine-tuned model thoroughly – sometimes it might pick up unwanted quirks from the fine-tune data (like if all your fine-tune answers had a certain bias, it might amplify that). Quality and diversity of fine-tuning data are important.

On the next page are some exercises and resources to further your skills in working with LLMs and generative AI.