top of page
Writer's pictureDwain Barnes

Building SARA: A Lightweight Cybersecurity Assistant for Everyday Laptops

Updated: 5 days ago




Introduction: The Challenge of an Offline Cybersecurity Assistant

The concept of building an AI assistant for cybersecurity that could run entirely offline came from a project assigned by a colleague. The idea was to create a tool that didn’t require high-end hardware, enabling it to function on standard laptops and without needing an internet connection. This tool had to provide clear, practical advice for basic cybersecurity questions while being lightweight and efficient.


The result was SARA—or Security Awareness and Resilience Assistant. Designed to be an accessible resource, SARA offers straightforward guidance to help users navigate online risks confidently. Here’s how I built her, from collecting and processing data to refining her conversational abilities, and why projects like this matter in making cybersecurity knowledge available to all.


 


Setting Goals and Planning

To create a model that could run offline and perform well on lower-spec hardware, I set three main priorities:

  • Efficiency: The model needed to run smoothly without taxing system resources.

  • Accuracy: SARA had to deliver reliable advice to help users protect themselves online.

  • Conversational Style: Cybersecurity can be overwhelming, so I wanted SARA to have an approachable and easy-to-understand tone.


With these goals in mind, I began planning how to train and fine-tune SARA. She wouldn’t handle complex cybersecurity scenarios but would focus on common, practical questions like “What makes a strong password?” or “How can I spot phishing emails?”



 


Gathering Data for Training

Good data is the foundation of any AI model, and since SARA would run offline, I needed a dataset that was comprehensive and relevant from the start. I began by collecting PDFs from trusted sources, covering essential topics like password management, phishing detection, malware, and safe browsing.


To process these PDFs efficiently, I used Docling (GitHub Repo). Docling made it easy to extract and convert the content into usable formats for training. This tool saved me a significant amount of time by automating the tedious task of pulling clean, readable text from various PDF formats. The processed text was added to SARA’s growing knowledge base.

Additionally, I used Firecrawl to gather cybersecurity information from reputable websites. Combining data from both PDFs and online sources helped me create a robust dataset that covered a wide range of fundamental cybersecurity topics.



 


Cleaning and Preparing Data for Pre-training

With the data collected, the next step was preprocessing. I wrote Python scripts to format the content into {text:} blocks that we could use to pretrain SARA. Cleaning involved removing duplicates, irrelevant content, and formatting inconsistencies.


Using Docling ensured that the extracted text was structured and error-free, but additional refinement was necessary to ensure uniformity across the entire dataset. The result was a clean, well-organised collection of cybersecurity knowledge that would serve as the backbone for training SARA.


 


Building the Fine-Tune Dataset

To train SARA effectively, I needed to create a custom dataset tailored specifically to cybersecurity topics. Using NVIDIA’s llama-3.1-nemotron-70b-instruct model, I generated an Alpaca-style Q&A dataset to help SARA learn how to provide conversational and concise responses. This dataset included practical cybersecurity questions and answers, focusing on scenarios users are likely to encounter.


Generating Q&A Pairs Using Nemotron Locally

To ensure SARA’s responses were both practical and user-friendly and for cost effectiveness I used NVIDIA’s llama-3.1-nemotron-70b-instruct model, running it locally with 4-bit quantization on a dual RTX 3090 setup. This model was running locally using Ollama, download from here: https://ollama.com/. I fed in the dataset I collected for pretraining and got some good Q&A pairs. Below is the Python snippet used to generate the Q&A pairs:


def generate_qa_pairs(text: str, model_name: str = "nemotron:latest") -> Tuple[str, List[Dict]]:
    """Generate cybersecurity-oriented question-answer pairs using Nemotron model."""
    prompt = f"""You are an expert at creating question-answer pairs for training AI models focused on cybersecurity awareness.
    Create exactly 10 question-answer pairs from the following text. Follow these guidelines:
    1. Questions should be from a public/business perspective, focusing on:
       - Online safety and security best practices
       - Personal data protection
       - Safe internet browsing habits
       - Password and account security
       - Recognizing cyber threats and scams
       - Digital privacy awareness
       - Business cybersecurity essentials

    2. Format questions as if a member of the public or business person is asking for advice, such as:
       - "How can I protect my..."
       - "What should I do if..."
       - "Is it safe to..."
       - "What are the warning signs of..."
       - "How do I know if..."
       - "What's the best way to secure..."

    3. Answers should be:
       - Clear and practical for non-technical users
       - Minimum of one paragraph and maximum of three paragraphs
       - Focused on actionable advice
       - Including relevant safety precautions
       - Using plain language without jargon
       - Specific and highly detailed

    4. Example format:
    Q: "How can I tell if an email claiming to be from my bank is legitimate?"
    A: "First, check the sender's email address carefully - legitimate banks use official domain names, not similar-looking variations. Never click on links directly from the email; instead, manually type your bank's website address or use your existing bookmarks. Look for warning signs like urgent demands, threats, or requests for sensitive information, as legitimate banks never ask for passwords or full account details via email."

    Only output the questions and answers in the exact format shown in the example above.
    No other text or formatting should be included.

    Text to analyze: {text}"""
   
    try:
        print("🔄 Generating Q&A pairs...")
       
        response = ollama.generate(
            model=model_name,
            prompt=prompt,
            options={
                "temperature": 0.3,
                "top_k": 50,
                "top_p": 0.95,
                "num_predict": 2000
            }
        )
       
        response_text = response['response']
       
        # Parse Q&A pairs from plain text
        qa_pairs = []
        pairs = response_text.split('\n\nQ:')
        for pair in pairs:
            if not pair.strip():
                continue
           
            if not pair.startswith('Q:'):
                pair = 'Q:' + pair
               
            try:
                # Split into question and answer
                parts = pair.split('\nA:')
                if len(parts) == 2:
                    question = parts[0].replace('Q:', '').strip()
                    answer = parts[1].strip()
                    qa_pairs.append({
                        'question': question,
                        'answer': answer
                    })
            except Exception as e:
                print(f"Error parsing pair: {pair}")
                continue
       
        return response_text, qa_pairs
       
    except Exception as e:
        logging.error(f"Error generating QA pairs: {e}")
        time.sleep(2)
        try:
            return generate_qa_pairs(text, model_name)
        except:
            return "", []

I then used another script to format the output into the alpaca style json.

I then created a small python script to clean up any duplicates. See below.

import json

def remove_duplicates(file_path, output_file_path):

    # Open the file with 'utf-8' encoding and ignore errors

    with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:

        data = json.load(file)

    # Use a set to track unique (instruction, output) pairs

    unique_entries = set()

    cleaned_data = []

    for entry in data:

        # Create a tuple of (instruction, output) to check for duplicates

        entry_tuple = (entry["instruction"], entry["output"])



        # If this pair is unique, add it to the set and to the cleaned data list

        if entry_tuple not in unique_entries:

            unique_entries.add(entry_tuple)

            cleaned_data.append(entry)



    # Write the cleaned data without duplicates to the output file

    with open(output_file_path, 'w', encoding='utf-8', errors='ignore') as outfile:

        json.dump(cleaned_data, outfile, indent=4)

# Use the function to remove duplicates and save the data

remove_duplicates('SARA_finetune.json', 'SARA_no_duplicates.json')


A Preview of the Fine-Tuning Data in Alpaca Format

Below is a sample of the fine-tuning data formatted in Alpaca style:


[
    {

        "instruction": "How can I create a strong password that's easy to remember?",

        "input": "",

        "output": "Combine three random words to make a unique password that's at least 15 characters long. You can also add numbers or special characters for extra security!"

    },
]


 


Choosing and Optimizing the Model

Choosing the right model was critical for meeting the project’s efficiency goals. I selected the llama 3.2 3b base model because it balanced performance and size, making it suitable for lower-spec hardware. Using Unsloth’s 4-bit quantization, (unsloth/Llama-3.2-3B-bnb-4bit) I reduced computational requirements significantly while maintaining acceptable performance. Find out more about unsloth: https://unsloth.ai/



Training the Model

Training SARA was one of the more technical parts of the process, requiring careful setup to handle the fine-tuning efficiently. Since I was working on a Windows system, setting up Unsloth presented several challenges due to dependency issues. To overcome these, I utilized a Docker image provided by Axolotl, which made the process significantly easier.


Setting Up the Training Environment

Here’s the workflow I followed to set up and train SARA on my local machine:

  1. Running the Docker ContainerI launched the Axolotl image using the following Docker command:

docker run --gpus "all" -p 8888:8888 --name unsloth -it winglian/axolotl:main-latest

This allowed the container to leverage my GPU and open port 8888 for Jupyter Notebook access.

  1. Installing Unsloth in the Docker ImageInside the Docker container, I installed Unsloth with:

pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
  1. Setting Up Jupyter Notebook


I installed Jupyter Notebook and launched it using:

pip install jupyter
jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root

This command provided a local link that I could use to access the notebook in my browser on the host machine.


Pretraining and Fine-Tuning

Using the setup above, I pre-trained and fine-tuned the llama 3.2 3b base model using the Alpaca-style fine-tune dataset prepared earlier. The process involved configuring the training pipeline for efficiency on my Intel i9 12900k CPU, NVIDIA GeForce RTX 4090 GPU, and 32GB RAM. Below is the code snippet for the training process:


from unsloth import FastLanguageModel

import torch


# Enable Flash Attention if available

try:

    from flash_attn_patch import replace_llama_attn_with_flash_attn

    replace_llama_attn_with_flash_attn()

except ImportError:

    print("Flash Attention not available. Consider installing for faster training.")


max_seq_length = 1024 # Choose any! We auto support RoPE Scaling internally!

dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+

load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


# 4bit pre quantized models we support for 4x faster downloading + no OOMs.

fourbit_models = [

    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!

    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",

    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!

    "unsloth/llama-3-8b-Instruct-bnb-4bit",

    "unsloth/llama-3-70b-bnb-4bit",

    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!

    "unsloth/Phi-3-medium-4k-instruct",

    "unsloth/mistral-7b-bnb-4bit",

    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!

] # More models at https://huggingface.co/unsloth



model, tokenizer = FastLanguageModel.from_pretrained(

    model_name = "unsloth/Llama-3.2-3B-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B

    max_seq_length = max_seq_length,

    dtype = dtype,

    load_in_4bit = load_in_4bit,

    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf

)
model = FastLanguageModel.get_peft_model(

    model,

    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128

    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",

                      "gate_proj", "up_proj", "down_proj",



                      "embed_tokens", "lm_head",], # Add for continual pretraining

    lora_alpha = 16,

    lora_dropout = 0, # Supports any, but = 0 is optimized

    bias = "none",    # Supports any, but = "none" is optimized

    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!

    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context

    random_state = 3407,

    use_rslora = True,   # We support rank stabilized LoRA

    loftq_config = None, # And LoftQ

)
from datasets import load_dataset

dataset = load_dataset('json',

                     data_files='SARA-Pretrain.jsonl',

                     split='train[:100%]')  # Using second half of data

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):

    return { "text" : [example + EOS_TOKEN for example in examples["text"]] }

dataset = dataset.map(formatting_prompts_func, batched = True,)
for row in dataset[:5]["text"]:

    print("=========================")

    print(row)
from trl import SFTTrainer

from transformers import TrainingArguments

from unsloth import is_bfloat16_supported

from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(

    model = model,

    tokenizer = tokenizer,

    train_dataset = dataset,

    dataset_text_field = "text",

    max_seq_length = max_seq_length,

    dataset_num_proc = 4,

    args = UnslothTrainingArguments(

        per_device_train_batch_size = 2,

        gradient_accumulation_steps = 4,

        warmup_ratio = 0.1,

        num_train_epochs = 5,

        learning_rate =  5e-5,

        embedding_learning_rate = 1e-5,

        max_grad_norm = 1.0,

        fp16 = not is_bfloat16_supported(),

        bf16 = is_bfloat16_supported(),

        logging_steps = 1,

        optim = "adamw_8bit",

        weight_decay = 0.01,

        lr_scheduler_type = "cosine",

        seed = 3407,

        output_dir = "outputs",

        report_to = "none", # Use this for WandB etc

    ),

)
#@title Show current memory stats

gpu_stats = torch.cuda.get_device_properties(0)

start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)

max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")

print(f"{start_gpu_memory} GB of memory reserved.")
trainer_stats = trainer.train()

from datasets import load_dataset

alpaca_dataset  = load_dataset('json',

                     data_files='SARA_finetune.json', split = "train")
print(alpaca_dataset[0])
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

{}

### Input:

{}

### Response:

{}"""


EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples):

    instructions = examples["instruction"]

    inputs       = examples["input"]

    outputs      = examples["output"]

    texts = []

    for instruction, input, output in zip(instructions, inputs, outputs):

        # Must add EOS_TOKEN, otherwise your generation will go on forever!

        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN

        texts.append(text)

    return { "text" : texts, }

pass

alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)
from transformers import TrainingArguments

from unsloth import is_bfloat16_supported

from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(

    model = model,

    tokenizer = tokenizer,

    train_dataset = alpaca_dataset,

    dataset_text_field = "text",

    max_seq_length = max_seq_length,

    dataset_num_proc = 8,

    args = UnslothTrainingArguments(

        per_device_train_batch_size = 2,

        gradient_accumulation_steps = 8,



        # Use num_train_epochs and warmup_ratio for longer runs!

        warmup_ratio = 0.1,

        num_train_epochs = 5,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!

        learning_rate = 5e-5,

        embedding_learning_rate = 1e-5,



        fp16 = not is_bfloat16_supported(),

        bf16 = is_bfloat16_supported(),

        logging_steps = 1,

        optim = "adamw_8bit",

        weight_decay = 0.00,

        lr_scheduler_type = "linear",

        seed = 3407,

        output_dir = "outputs",

        report_to = "none", # Use this for WandB etc

      
    ),

)

trainer_stats = trainer.train()

model.save_pretrained("lora_model") # Local saving

tokenizer.save_pretrained("lora_model")

# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

{}

### Input:

{}

### Response:

{}"""

if True:

    from unsloth import FastLanguageModel

    model, tokenizer = FastLanguageModel.from_pretrained(

        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING

        max_seq_length = max_seq_length,

        dtype = dtype,

        load_in_4bit = load_in_4bit,

    )

    FastLanguageModel.for_inference(model) # Enable native 2x faster inference



# alpaca_prompt = You MUST copy from above!



inputs = tokenizer(

[

    alpaca_prompt.format(

        "give me an example of a strong password?", # instruction

        "", # input

        "", # output - leave this blank for generation!

    )

], return_tensors = "pt").to("cuda")



from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt = True)

 = model.generate(inputids = inputs.input_ids, attention_mask = inputs.attention_mask,

                   streamer = text_streamer, max_new_tokens = 1000, pad_token_id = tokenizer.eos_token_id)
### Saving to float16 for VLLM
We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
# Merge to 16bit
if True: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)

if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)

if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")



# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)

if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
# Save to 8bit Q8_0

if True: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")

if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q5_k_m", token = "")

Optimization for Lightweight Deployment

To ensure SARA would run efficiently on standard laptops, I applied Unsloth’s 4-bit quantization GGUF, which significantly reduced the model's size and computational demands. This step maintained the balance between performance and usability, ensuring the model could deliver reliable responses even on lower-spec hardware, it will allow the model to run on the CPU and RAM.

With the training process complete, SARA was ready for offline deployment, optimized for real-world applications.


 


Conclusion: Making Cybersecurity Accessible

SARA demonstrates the potential of lightweight AI to democratise cybersecurity knowledge. While her initial version performs adequately, the project underscores the feasibility of creating accessible, offline tools that empower users to stay safe online. By building on this foundation, tools like SARA can ensure cybersecurity guidance is available to all, regardless of technical expertise or hardware limitations.



Coming up, getting the model to run with Ollama.


Comments


bottom of page