Making a Domain-Specific UK Legislation LLM - Part 1: Pretraining

Dwain Barnes

Nov 23, 20244 min read

Updated: Nov 25, 2024

Continued Pretraining of Llama 3.2 3B for Legal Domain Knowledge

In this article, I’ll walk you through how we’re enhancing Meta’s Llama 3.2 3B model with specialised knowledge of UK legislation through continued pretraining. This is the first step in creating a legally-aware language model, where we focus on injecting domain-specific

knowledge into the base model before diving into task-specific fine-tuning.

Why Continued Pretraining?

When dealing with legal language, there’s a need to go beyond surface-level understanding. Continued pretraining is essential because it helps the model develop a much deeper appreciation for the specifics of UK legislation. Here’s why it’s worth the effort:

Domain Vocabulary: The model gets better at picking up on the nuances of legal terminology and phrasing, making it easier to handle complex legal texts.
Structural Knowledge: Legal documents tend to have a specific structure, and the model learns to recognise and navigate these patterns effectively.
Contextual Understanding: It’s not just about the words—understanding the context of legal language is key, and this step helps the model make sense of broader implications.
Knowledge Integration: By combining general language skills with legal expertise, the model becomes a much more capable tool for specialised tasks.

The Technical Stack

Making all this happen required a solid set of tools, each playing a critical role in streamlining the process. Here’s what I used:

Unsloth: A handy library that simplifies the training routines and speeds everything up without cutting corners.
LoRA: A brilliant approach to updating the model efficiently—it saves time, resources, and makes the whole process more practical.
Jupyter Notebook
Docker

Setting Up the Environment

Before jumping into the pretraining, the environment needs to be set up. Here’s how it looks in code:

import torch
from datasets import load_dataset
from transformers import TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

Model Configuration

To optimise for modern GPU hardware (I used an RTX 3090 for testing), these are the configurations:

max_seq_length = 1024
dtype = torch.bfloat16 if is_bfloat16_supported() else torch.float16
load_in_4bit = True

This lets us process long sequences of legal text without running into memory issues.

Loading the Base Model

For this project, I started with the Llama 3.2 3B model:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

Pretraining Configuration

Using LoRA, I set up the model for efficient updates. This approach keeps the training light on hardware while ensuring the model integrates new knowledge effectively:

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0.1,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = True,
)

Preparing the Dataset

For the pretraining, I used a comprehensive dataset covering UK legislation from Hugging Face, you can find it here. Big thanks to the creator! The idea was to expose the model to a wide variety of legal texts. Here’s the code to set it up:

dataset = load_dataset("santoshtyss/uk_legislation", split="train[:100%]")

def formatting_prompts_func(examples):
    return {"text": [example + tokenizer.eos_token 
            for example in examples["text"]]}

dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
    num_proc=8
)

I didn’t bother with validation splits at this stage, as the focus was on pretraining rather than fine-tuning for specific tasks.

Training Configuration

Here’s the configuration I used for training. It’s designed to ensure the model absorbs knowledge efficiently without overloading the hardware and overfitting the model. I used 5 epoch's.

trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=8,
    args=UnslothTrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_ratio=0.05,
        num_train_epochs=5,
        learning_rate=5e-4,
        embedding_learning_rate=2e-5,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        output_dir="outputs",
        report_to="tensorboard",
    ),
)

What Makes This Pretraining Stand Out?

This approach isn’t your typical setup. There are a few key things that make it unique and, in my opinion, really effective:

Using the Full Dataset: Instead of cherry-picking data, I used the entire dataset to give the model exposure to as much legal language as possible. The more it sees, the better it learns.
Keeping the Text Natural: I didn’t over-process the text or change it too much. It’s presented in its original form, which I believe helps the model understand real-world legal documents better.
Balanced Learning Rate: The learning rates were carefully tuned to strike a balance—enough to introduce new knowledge without wiping out what the model already knows.
Efficient Memory Use: Thanks to 4-bit quantisation, I was able to process larger datasets on regular GPUs, which made the whole process much more practical.

Final Thoughts

This is just the beginning, but it’s an important first step. By focusing on continued pretraining, we’re setting the stage for a language model that truly understands legal concepts and the nuances of legal language. The goal here is to make the model much more effective and reliable when it comes to real-world legal applications and this groundwork is what will make that possible.

What’s Next?

In Part 2, I’ll cover:

How to create a fine-tuning dataset using GPT4o-mini
How to then fine-tune the model for specific legal questions and answers.
Testing its performance to see how it if it's answer are better then the vanilla model.

Reposiory

The complete code for this project is available on GitHub here.

To download this base model vist my Hugging Face.