4 simple steps to setup LLM on your machine and make it work for your use case!!!

LLM

Reading Time: 10 Minutes

LLM or Large Language models is one of the trending applications of AI. It’s major use cases are its comprehension and conversion abilities in text to text, audio to text, video to text ,etc. In this post we will be learning some basic concepts associated with LLM and practically implement the Flan-T5 model fine-tuned for finance-based content summarisation on your local machine.

How I got introduced to LLM? (courses and additional learning???)

I got introduced to chatgpt at the end of 2022, after using the marvel and researching about it, I found that I am quite late to the party, because the transformer architecture(the idea with which Language models like chatgpt are built) was released way before.

So I searched for courses that properly teach the me the concepts from the fundamentals, and I found one in Coursera named “Generative AI with Large Language Models”.

The course mainly focuses on explaining the logic and coding of phases associated with the “Adapt and Align Model” in the Generative AI project lifecycle.

The entire course was structured on performing prompt engineering, fine-tuning, and reinforcement learning from human feedback on a dataset which consisted of dialogs of conversation between two people.

The objective was to fine-tune and adjust a pre-built model to perform summarisation task on the dataset.

Once I completed the course, I wanted to build a code that summarises finance related content and gives a context of big finance paragraphs, which sometimes cannot be interpreted by layman or the important information could have been missed.

In this post, I will explain step by step process of how I processed the data, and fine-tuned a flan-t5 model to perform summarisation. I will also cover the issues and errors that I went through while tuning this model, so that you don’t have to go through them and design a model in your areas of interest and benefit from it.

Before actually getting into the explanation part, I highly recommend you to check the course on coursera if you want to learn from the scratch. During this 10 minutes of your valuable time, I will try to give you the context of some of the core concepts which are required to understand the logic and program and finally explain the coding part.

I know that there are some questions running in your head, like why do we need a pre-trained model, and how are we going to source our data and process it, let’s clear them one by one.

faris mohammed d30sszrW7Vw unsplash

Why not build our own model 🤔

These pre-trained models consist of millions and billions of parameters, which has a state of art performance when fed with the right data, fine-tuned properly and performed reinforcement learning if required.

Parameters generally refer to the weights and biases learned during the training process. These weights and biases are essential for the model’s ability to make predictions or generate meaningful output. The number of parameters in a model is an indication of its complexity and capacity to learn from data. It can be interpreted as the number of nodes or neurons in the neural network. To get a good understanding of neural networks in AI, go through my post on neural networks by clicking here.

So, when we create a model of such scale, it usually takes days, even weeks or months to train a large scale model with complex capabilities, and during this process of training and building the network, a lot of hardware needs to be run and a lot of resources consumed due to computation, has a effect on the environment by increasing the carbon emissions due to the electricity used by the machines.

From statistical data, we have come to know that to train a state of art model like the LLama, we need a lot of compute which equate to a lot of electricity, that in turn leads to carbon emissions(if non-renewable energy is used), which is same as the carbon emitted from a few hundred large pickup truck which runs on diesel or petrol throughout its lifetime. (roughly 1000 tons of carbon-dioxide)

So that’s is amount of resources and carbon emissions that are incurred to build a language model that performs complex tasks. This is the reason why we download a pre-trained model in our local processing machine and use it. And in most of the cases where we experiment or research, we fine tune a model to meet specific needs. We’ll develop a clear picture of fine-tuning a model in the practical implementation stage of the post.

jeffery ho horDgLA2lek unsplash

What is the transformer Architecture?

It is the basis with which these LLM’s like GPT, Flan-T5 are built. “Attention is all you need” is a research paper released by researchers at google in 2017, which explains the functioning of transformer architecture. You can read the entire paper if you are interested, but here’s a very short summary what the paper conveys about the encoder and decoder, without getting much into all the details.

As you can see, the Nx means there are multiple layers of the same framework for refined solutions, each layer will learn different aspects of the language and connections between the words in the given content.

You don’t need to worry about learning the entire architecture, there are only certain aspects to be understood to get the essence of the idea.

tma

tma2

The part with which the input embeddings are given is known and the Encoder and the other one is called the Decoder.

This encoder and decoder can be used separately as well to serve a single purpose.

The decoder aims to generate multiple possible solutions, and the best solution is identified by its probability from the Softmax layer.

Encoding in the context of Machine Learning in NLP(Natural Language Processing) means converting text containing sentences into vectors. To give a rough idea of encoding or embedding, when the sentence is given, the unique words (distinct) are set as the column value of the matrix, and each word from the sentence is aligned in the rows and the cell in which the number matching with the column name is marked 1 and the rest are filled with 0.

This is an example of one-hot encoding, this operation can be performed by importing OneHotEncoder from sklearn.preprocessing library.

This is a simple example: Sentence: “large language models are amazing”.

Screenshot 2024 01 26 at 4.47.28 AM

Similarly we have other method to perform encoding, but in our discussion we will be focusing on encoding our sentences based on the Tokeniser function for a pre-trained model. Decoding is the process of retrieving back the sentence from the encoded state.

Example code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer
import torch

model_name = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer('large language models are amazing', return_tensors='pt')

Output: {‘input_ids’: tensor([[ 101, 33, 1036, 81, 508, 1612, 825, 1]]), ‘attention_mask’: tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

The output is a dictionary containing two key-value pairs, where each key corresponds to a specific information about the tokenisation of the input text.

input_ids: This tensor represents the tokenised input sequence. Each number in the tensor corresponds to a token ID. Token IDs are integers that uniquely identify each token in the vocabulary of the model.
tensor([[ 101, 33, 1036, 81, 508, 1612, 825, 1]])

These IDs represent the tokens obtained after tokenising the input text “I love programming in python.” Each ID corresponds to a specific word or subword in the model’s vocabulary.
attention_mask: This tensor is used to indicate which elements in the input_ids tensor should be attended to (given weight) and which should be ignored (given weight 0).
tensor([[1, 1, 1, 1, 1, 1, 1, 1]])

All values are set to 1, indicating that attention should be paid to all tokens in the input sequence. This is a common practice for single-sequence tasks, where you want the model to consider the entire input.

The return_tensors=’pt’ is added as a parameter to tokenizer, because we will be using pytorch to process and feed our data into the model. Even tensorflow can be used.

hunter harritt Ype9sdOPdYc unsplash

How to choose a model and dataset 📊?

As seen earlier in the life cycle diagram, the first step is to define our use case and clearly list out the objectives. There are a lot of base/foundational models built on top of the transformer architecture, such as GPT, LLaMA, FLAN-T5, BERT, etc. Each of them are designed for a specific use case, and trained on a certain dataset.

More the number of parameters and memory, more the model can perform sophisticated and complex tasks. Flan-t5, created by Google, is an encoder-decoder model that underwent pre-training on a diverse set of unsupervised and supervised tasks. In this model, every task is transformed into a text-to-text format. There are many sizes of parameters in which the Flan-t5 model is available, such as t5-small, t5-base, t5-large, t5-11b, etc.

For our objective, given constraints of limited RAM and processing capabilities in a normal laptop/PC, we will be adapting the flan-t5 base model. The model consists of 220 million parameters, so it’s a healthy amount of parameters for a pre-trained model to be used for real world applications.

Now, when it comes to the dataset, again its all about listing out the objectives and searching for the right dataset in the internet. Hugging face is a good place to start, you can find most of the required datasets by typing the keywords in the search bar of the datasets section.

Going through the preview of the dataset will provide a good insight of the information and structure of the data. Once the dataset is finalised, the dataset can be loaded from hugging face using the datasets library in python.

How to use hugging face and pipelines.

nubelson fernandes jKL2PvKN8Q0 unsplash

Building the Model⚙️

As seen earlier, any machine learning approach requires the data to be in the format that is processable by the algorithm, and that format is, yes you guessed it right, numbers!!! This converted data which is in the form of numbers is mostly represented in vectors, arrays and matrices.

Now let’s dive into the program and discuss the logic and meaning behind each segment one-by-one.

First, we need to install and upgrade a few libraries as given below.

pip install tensorflow --upgrade
pip install rouge-score
pip install evaluate
pip install datasets
pip uninstall -y accelerate # some systems have an older version installed, so I am uninstalling it and reinstalling
pip install accelerate

Here rouge-score and evaluate are libraries that are required to evaluation purposes. Accelerate is a library by hugging face that acts as a good wrapper of pytorch, it enables us to leverage the functionalities of pytorch library without the need to know about pytorch. And it also uses techniques to optimise your code.

Now the libraries are loaded the model, and module required for preparing the tokeniser is loaded from transformers, along with TrainingArguments and Trainer for training the model.

import torch
import time

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer

model_name = 'google/flan-t5-base'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(model_name)

We need to print the model in order to understand the layers and most importantly, identify the layers to fine-tune.

original_model

For the sake of reducing space, I will be printing only few lines of the encoder part.

(encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )

The important thing to notice here is q,k,v,o. They mean query, key, value and output respectively. We will be applying LoRA in query and value module. This variable may differ from model to model, so print and view the layers and the modules in them before fine-tuning.

Choosing a dataset

After going through a number of datasets, I finally found “causal-lm/finance” dataset from hugging face to suit the requirements. Though it is not completely finance oriented, it has a good number of examples which can train the model well. Choosing a dataset with less than 10,000 rows to partially fine-tune a model will not yield great results.

from datasets import load_dataset

dataset = load_dataset("causal-lm/finance")

dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 62020
    })
    validation: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 6892
    })
})

We print the dataset object to get an idea of what the content holds, in this project we will be only needing the question and human_answers. Here human_answers will be the input and question is the contextualised output.

Since we are fine-tuning the model to contextualise or make a paragraph into a small sentence, we code a tokenize function to wrap a prompt around the input.

def tokenize_function(example):
    start_prompt = 'Summarize the following content into question \n'
    end_prompt = "\n Answer:"
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["human_answers"]]

    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True,
                                     return_tensors='pt').input_ids
    example['labels'] = tokenizer(example['question'], padding="max_length", truncation=True,
                                  return_tensors='pt').input_ids



    return example

NOTE: The return object should only contain ‘input_ids’ and ‘labels’ as the key value. Initially, when I was writing this code, I kept different names for them such as inputs and lab. The console was repeatedly showing error but only after a long research and going through multiple programming communities I came to know that we should always use ‘input_ids’ and ‘labels’ and not change the names for the trainer to process properly.

Removing unnecessary fields and applying the tokenise function in the dataset.

dataset = dataset.remove_columns(['id','chatgpt_answers','source','embeddings','label']) 
tokenized_datasets = dataset.map(tokenize_function, batched=True) # processing data in batches can significantly speed up the tokenization process

tokenized_datasets = tokenized_datasets.remove_columns(['question','human_answers']) # because we have extracted the needed features and converted into encoded format.

Once our dataset is ready, now it is time to fine tune our model over the tokenised dataset using a technique called PEFT (parameter efficient fine tuning). Before getting into the details, we have to understand what fine tuning is.

To keep things simple, fine tuning is essentially a process in which we alter the weights of the parameters present in the model to suit our specific use case.

Under this process we have the option to either alter all the weights in model(full fine tuning), in this we updating the weights of all the parameters, or perform the updations in only a few weights(parameter efficient fine-tuning).

As you might have guessed, the first advantage of choosing PEFT is reduced processing time and the model will be ready in a comparatively lesser time(we’re taking about 8-10 times reduction in model complication in general).

The second advantage is the ability to tackle through the problem of catastrophic forgetting. For those who are hearing this term for the first time, it is basically a phenomenon in which the model’s performance on a wide range of tasks degrades to just focusing on a single task, or function in a static manner.

So, there are three methods to perform PEFT on a pre-trained model:

Selective – selects a subset of the initial LLM parameters to fine-tune.
Reparameterisation – it re-parameterises the model weights with a low rank transformation(LoRA).
Additive – it adds trainable layers or parameters to the model by freezing all the existing layers in LLM.

In this post we will be leveraging LoRA(Low-Rank Adaptation) to partially fine-tune our model to contextualise finance related content.

The idea here is to freeze most of the original LLM’s weights, introduce 2 rank decomposition matrices and then train the weights of those small matrices. Finally add the result to the original weights.

These images should give a basic idea of what is happening behind the code. The weights in the model’s network exists as matrices of order n x n, we freeze most of the weght matrices and choose a certain matrix and add the computed matrix.

The computed matrix is obtained by mutliplying two low rank matrices whose final order comes out to be n x n. The first matrix is of order n x p and the second is p x n. If you want to dive deep into understanding LoRa you can check out this amazing explanation by Entry Point AI on youtube.

Link: Low-rank Adaptation: LoRA Fine-tuning & QLoRA Explained In-Depth

Screenshot 2024 02 03 at 8.49.21 PM 1

Screenshot 2024 02 03 at 8.49.55 PM 1 — _{Image Credits: Entry Point AI}

So as we increase the rank, the higher the information we are trying to teach the model.

NOTE: low-rank adaptation will be applied only to the query (“q”) and value (“v”) modules. The other modules, such as key (“k”) and output (“o”), will not be subject to low-rank adaptation.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=64, # Rank
    lora_alpha=64,
    target_modules=["q", "v"],
    lora_dropout=0.05, # 5% of the choosen paramters are left out to avoid overfitting(phenomenon of model giving poor output for unseen data)
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)
peft_model = get_peft_model(original_model,
                            lora_config)

lora_alpha determines the scaling factor applied to the weight changes when the computes matrix is added to the original weights.

For example, if rank is 64 and alpha is 16, then 16 divided by 64 is 1/4 which is basically fine-tuning 25% of the model’s parameters.

Here I am tuning with a rank and lora_alpha of 64 which is basically training 100% of the q and v modules, I am doing this because the the size of the dataset is not as high as required. I have chosen 64 as it gives good results during testing when compared to other values.

Now we connect to our drive through google.colab library to save our model once we compile it. By executing the prompt you will be taken to a new window that asks permission to access the files in your drive. So is you are not confident to use your primary main google account, create another account for this.

from google.colab import drive
drive.mount('/content/gdrive')

per_device_train_batch_size is a very important parameter for parallel processing. If local computer or PC is having a small RAM with CPU architecture, then parallel processing cannot be achieved at all. Even if you work on google colab, the maximum RAM limit is 12.7 GB for free GPU and TPU. To upgrade you need to purchase the Colab pro+ or enterprise service.

I emphasise you to set the set the per_device_train_batch_size accordingly, you can do trail and error to get the maximum performance and speed out of the free version of Colab.

from transformers import TrainingArguments, Trainer

output_dir = f'/content/gdrive/MyDrive/finance_peft_training -{str(int(time.time()))}' # basically we are creating the testing directories with time stamp

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=6, # if the batch size is too big, then 'cuda out of memory error' may arise for machines with small memory
    learning_rate=1e-3,
    num_train_epochs=40, # number of repititions
    logging_steps=1,
    max_steps=40,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)


peft_trainer.train() # training the model

Saving the configurations of the trained model in a directory so that, even after the runtime has disconnected or local variables are erased, the model can be accessed without running the whole training process again.

peft_model_path="/content/gdrive/MyDrive/peft-finance-checkpoint"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Loading and Testing the model


from peft import PeftModel, PeftConfig


model_name = 'google/flan-t5-base'

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       '/content/gdrive/MyDrive/peft-finance-checkpoint',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)


from transformers import GenerationConfig

# now we change to CPU as GPU is not needed for testing the tuned model's performance. And another reason to change is to keep all the tensors under one computing process.
peft_model = peft_model.to('cpu')
original_model = original_model.to('cpu')

content = 'The market is currently navigating through a phase of uncertainty. Despite a persistent dominance by growth trades, overall economic growth expectations remain subdued. The forthcoming economic data, particularly jobless claims and nonfarm payroll figures, are anticipated to be critical in shaping market sentiments.'

prompt = f"""
Summarize the following\n

{content}

\nAnswer:"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=350, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

print('--------------------------------------------------')
print(f'BASELINE TEXT:\n{baseline_context}')
print('--------------------------------------------------')


peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=350, num_beams=1, do_sample=True, temperature=0.5))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)


print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print('--------------------------------------------------')
print(f'PEFT MODEL: {peft_model_text_output}')

--------------------------------------------------
BASELINE TEXT:
The market is currently navigating through a phase of uncertainty. Despite a persistent dominance by growth trades, overall economic growth expectations remain subdued. The forthcoming economic data, particularly jobless claims and nonfarm payroll figures, are anticipated to be critical in shaping market sentiments.
--------------------------------------------------
ORIGINAL MODEL:
The outlook for the market is largely unchanged.
--------------------------------------------------
PEFT MODEL: Despite the recent strong performance, the market remains a little jittery in the wake of the recent economic growth.

We have just scratched the surface of LLM’s and there is a lot to explore, check out the different models and datasets in huggingface and try to use them. I highly encourage you to run the program we have seen on our machine and see the results yourself, play with the hyper-parameters and apply the model for your use case.

Stay tuned for the part-2 for this post, where the final model which we obtained is upgraded to give better results for dynamically sized and complex sentences.

Congratulations on making it through this post! Your dedication to understanding the intricacies of data acquisition, processing, to model training and testing is amazing. As we wrap up, I’m eager to hear your thoughts and questions. Drop your comments below, and don’t forget to subscribe to Sapiencespace. Stay curious, stay engaged, and unlock a world of continuous insights by enabling notifications.

Click here to view similar insights.

What’s your Reaction?

Like

3

Like

Insightful

4

Insightful

Helpful

9

Helpful

Amazing

7

Amazing

Clap

9

Clap

Hi-fi

7

Hi-fi

Leave a Reply Cancel reply

Recently Posted

Data Science & Programming

Can page-based indexing save Compute, Memory and Time in RAG(Retrieval Augmented Generation)? A comparative study in medical field

Share

Subscribe To Newsletter