We'll train a real, simple, GPT language model - from scratch - using python on macOS. Writing the code will be very quick (around 10 minutes), and then training will take around 30-60 mins on a Mac.
We'll use the Tiny Shakespeare dataset, a tiny dataset (1MB) which contains Shakespeare's complete works. This dataset is ideal for training our GPT AI to generate authentic Shakespearean text and small enough to train on a Mac quickly.
Our final model will generate text in a shakespearean style, like this:
Here's a video guide of me building this π:
Let's get started π
What We'll Cover
- Setting up the development environment on macOS
- Understanding the GPT architecture
- Preparing the Tiny Shakespeare dataset for training
- Training the GPT model
- Generating text
Prerequisites
- A Mac running macOS (preferably with an M1 or M2 chip)
- Python 3.8 or higher installed
Let's go
1.1 Set Up a Virtual Environment
Create a project directory and navigate into it:
mkdir gpt-macos
cd gpt-macos
Create a virtual environment:
python3 -m venv venv
Activate the virtual environment:
source venv/bin/activate
1.1 Install Required Libraries
We'll install:
transformers
by Hugging Facedatasets
by Hugging Facetorch
(PyTorch) with MPS supporttqdm
for progress bars
Installing PyTorch with MPS Support
Note: As of September 2023, MPS support is available in the stable PyTorch release. You can install it via:
pip install torch torchvision torchaudio accelerate -U transformers datasets tqdm
Deeper Explanation
- PyTorch with MPS: Allows you to utilize the GPU on Apple Silicon Macs for faster training.
- transformers: Python library for NLP models
- datasets: Simplifies access to datasets and efficient data preprocessing.
1.5 Verify the Installation
Create a file check_installation.py
and add:
import transformers
import datasets
import torch
print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
print(f"Torch version: {torch.__version__}")
print(f"MPS Available: {torch.backends.mps.is_available()}")
Run the script:
Note: Patience is required. This takes around 90 seconds to run on my M3 Air (24GB RAM).
python check_installation.py
After you've waited for the script to complete, you should see something like:
(venv) your-username@your-mac-name % python3 check_installation.py
Transformers version: 4.44.2
Datasets version: 3.0.0
Torch version: 2.6.0.dev20240917
MPS Available: True
Deeper Explanation
- torch.backends.mps.is_available(): Checks if the MPS backend is available, indicating that you can use your Mac's GPU for training.
Step 2: Understanding the GPT Architecture (Optional)
We are training a GPT (Generative Pre-trained Transformer), based on the Transformer architecture. This relies on attention mechanisms to process sequences of data.
Learning about these components is optional for our guide, but might be interesting. I've added analogies for each if you want to grasp the mechanisms at a high level.
So, the key components in our model will be:
- Self-Attention mechanism
My analogy for self-attention: focusing on different words as you read a sentence.
Imagine you're reading a sentence. As you read each word, you naturally pay attention to other words that help you understand its meaning. The self-attention mechanism is like this natural reading process, but for computers. It helps the computer "focus" on the most important words when trying to understand or generate text.
Here's a nice example by Deepgram:
Consider the following sentences: "My dog has black, thick fur as well as an active personality. I also have a cat with brown fur. What is the breed of my dog?" Without attention, the model would assign equal importance to the information about the cat and the dog, which could lead to incorrect or misleading answers. However, with attention, a well-trained language model would assign less attention to the phrase "brown fur" because it is irrelevant to the question being asked.
From a well-written article: Visualizing and Explainer Transform models from the ground up
- Positional encoding My analogy for position encoding: reading the words in a sentence in an order
When you read a sentence, you know that the order of words matters. "The cat chased the mouse" means something different from "The mouse chased the cat." Positional encoding is a way to tell the computer about word order, so it understands that the position of words in a sentence is important.
More explanation here in "What is positional encoding and why do we need it?"
- Decoder layers
My analogy for decoder layers: Editing a draft to a final version.
Decoder layers are like multiple rounds of editing a story. When you first write a draft, you get your main ideas down. In the first editing round, you might fix grammatical errors and improve sentence structure. In the next round, you enhance the flow and clarity, and in subsequent rounds, you add more details and refine the narrative. Similarly, each decoder layer takes the initial information and progressively refines it, enhancing the modelβs ability to generate coherent and accurate text.
Alternative analogy and much more explanation here: The Encoder-Decoder Concept
Optional explanation over. Let's continue building!
Step 3: Preparing the Tiny Shakespeare Dataset
We'll use the Tiny Shakespeare dataset. This contains all shakespeare's works, and is small enough to train on a Mac quickly.
3.1 Download the Dataset
Create a script prepare_data.py
and add the following code:
import requests
# Download the Tiny Shakespeare dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)
data = response.text
# Save to file
with open('tiny_shakespeare.txt', 'w') as f:
f.write(data)
(We could do this manually inline, but it's neater to put this in a file.)
Run the script:
python prepare_data.py
3.2 Load and Explore the Dataset
Update prepare_data.py
to contain the following code:
import requests
from datasets import Dataset
# Download the Tiny Shakespeare dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)
data = response.text
# Save to file
with open('tiny_shakespeare.txt', 'w') as f:
f.write(data)
# Load data into a Hugging Face Dataset
raw_data = Dataset.from_dict({'text': [data]})
# Print the first 500 characters
print(raw_data['text'][0][:500])
And run the script to see a sample of the text.
python prepare_data.py
3.3 Tokenization
We need to tokenize the text using a tokenizer compatible with our GPT model.
Add this code to the end of prepare_data.py
:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
Deeper Explanation
- Tokenizer: Converts text to numerical tokens the model can understand.
- GPT2Tokenizer: Uses Byte Pair Encoding (BPE) to efficiently handle rare words.
3.4 Preprocess the Data
Tokenize and prepare the data for the model.
Again, add this code to the end of prepare_data.py
:
# Tokenize the dataset
def tokenize_function(examples):
return tokenizer(examples['text'], return_special_tokens_mask=True)
tokenized_dataset = raw_data.map(
tokenize_function,
batched=True,
num_proc=4,
remove_columns=["text"]
)
Deeper Explanation
- tokenize_function: Applies the tokenizer to the 'text' field.
- batched=True: Processes multiple examples at once (efficient for larger datasets).
- num_proc: Number of processes for parallelization.
3.5 Group Texts into Blocks
Since GPT models expect inputs of a fixed size, we'll split the data into blocks.
So, add this code to the end of prepare_data.py
:
block_size = 128
def group_texts(examples):
concatenated = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated['input_ids'])
total_length = (total_length // block_size) * block_size
result = {
k: [concatenated[k][i:i + block_size] for i in range(0, total_length, block_size)]
for k in concatenated.keys()
}
result["labels"] = result["input_ids"].copy()
return result
lm_dataset = tokenized_dataset.map(
group_texts,
batched=True,
num_proc=4,
)
Our resulting lm_dataset contains chunks of block_size
tokens with corresponding labels.
3.6 Save the Processed Dataset
Add this code to the end of prepare_data.py
:
# Save the dataset to disk
lm_dataset.save_to_disk('lm_dataset')
Run the script:
python prepare_data.py
All of prepare_data.py
Just as a quick check, the final prepare_data.py
(with all imports at the top) should look like this:
import requests
from datasets import Dataset
from transformers import GPT2Tokenizer
# Download the Tiny Shakespeare dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)
data = response.text
# Save to file
with open('tiny_shakespeare.txt', 'w') as f:
f.write(data)
# Load data into a Hugging Face Dataset
raw_data = Dataset.from_dict({'text': [data]})
# Print the first 500 characters
print(raw_data['text'][0][:500])
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def tokenize_function(examples):
return tokenizer(examples['text'], return_special_tokens_mask=True)
tokenized_dataset = raw_data.map(
tokenize_function,
batched=True,
num_proc=4,
remove_columns=["text"]
)
block_size = 128
def group_texts(examples):
concatenated = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated['input_ids'])
total_length = (total_length // block_size) * block_size
result = {
k: [concatenated[k][i:i + block_size] for i in range(0, total_length, block_size)]
for k in concatenated.keys()
}
result["labels"] = result["input_ids"].copy()
return result
lm_dataset = tokenized_dataset.map(
group_texts,
batched=True,
num_proc=4,
)
lm_dataset.save_to_disk('lm_dataset')
Step 4: Training the GPT Model
Now we'll set up and train our GPT model using the processed dataset π
4.1 Load the Dataset and Model
Create a new script train.py
and add:
from datasets import load_from_disk
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
# Load the dataset
lm_dataset = load_from_disk('lm_dataset')
# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2')
Deeper Explanation
- GPT2LMHeadModel: GPT-2 model with a language modeling head on top, suitable for text generation tasks.
4.2 Configure the Device (CPU or MPS)
Add this code to the end of train.py
:
import torch
if torch.backends.mps.is_available():
device = torch.device("mps")
print("Using MPS backend")
else:
device = torch.device("cpu")
print("Using CPU")
model.to(device)
You'll want to be using MPS if you're on an Apple Silicon Mac. This will make the training much faster than the CPU by a factor of around 10 because the MPS backend is optimised for matrix multiplications which are a core part of training transformers.
4.3 Set Up Training Arguments
Append this to train.py
:
training_args = TrainingArguments(
output_dir='./results',
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
evaluation_strategy='steps',
eval_steps=100,
save_steps=500,
logging_steps=50,
learning_rate=5e-4,
warmup_steps=100,
save_total_limit=2,
fp16=False, # MPS backend currently doesn't support fp16 as of writing.
)
4.4 Initialize the Trainer
Once again, append this to train.py
:
from transformers import DataCollatorForLanguageModeling
# Data collator for dynamic padding
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset,
eval_dataset=lm_dataset,
data_collator=data_collator,
)
4.5 Start Training
Finally, add this to the end of train.py
:
print("Starting training...")
trainer.train()
print("β
Training complete. Saving model...")
trainer.save_model('./shakespeare_gpt2')
tokenizer.save_pretrained('./shakespeare_gpt2')
print("β
Model saved.")
The final train.py
(with all imports at the top) should look like this:
from datasets import load_from_disk
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
# Load the dataset
lm_dataset = load_from_disk('lm_dataset')
# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2')
import torch
if torch.backends.mps.is_available():
device = torch.device("mps")
print("Using MPS backend")
else:
device = torch.device("cpu")
print("Using CPU")
model.to(device)
training_args = TrainingArguments(
output_dir='./results',
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
evaluation_strategy='steps',
eval_steps=100,
save_steps=500,
logging_steps=50,
learning_rate=5e-4,
warmup_steps=100,
save_total_limit=2,
fp16=False, # MPS backend currently doesn't support fp16
)
from transformers import DataCollatorForLanguageModeling
# Data collator for dynamic padding
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset,
eval_dataset=lm_dataset,
data_collator=data_collator,
)
print("Starting training...")
trainer.train()
print("β
Training complete. Saving model...")
trainer.save_model('./shakespeare_gpt2')
tokenizer.save_pretrained('./shakespeare_gpt2')
print("β
Model saved.")
Now, run the script to train the model: βοΈ
python train.py
You should see output like this:
It will take 30 - 60 mins to run depending on your Mac. Once it's done you should see the message:
Deeper Explanation
- trainer.train(): Initiates the training loop, handling forward and backward passes, optimizer steps, and logging.
Step 5: Evaluating and Generating Text
Now, let's use our trained model to generate some Shakespearean text!
5.1 Create a Text Generation Script
Create a new file called generate_text.py
and add:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import argparse
# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('./shakespeare_gpt2')
model = GPT2LMHeadModel.from_pretrained('./shakespeare_gpt2')
# Define the pad_token as eos_token
tokenizer.pad_token = tokenizer.eos_token
# Update the model's configuration to recognize the pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id
# Configure device
if torch.backends.mps.is_available():
device = torch.device("mps")
print("Using MPS backend")
else:
device = torch.device("cpu")
print("Using CPU")
model.to(device)
model.eval()
def generate_text(prompt, max_length=100):
# Encode the prompt and generate attention_mask
encoded = tokenizer(
prompt,
return_tensors='pt',
padding=False, # No padding needed for single inputs
truncation=True,
max_length=512 # Adjust based on model's max input length
)
input_ids = encoded['input_ids'].to(device)
attention_mask = encoded['attention_mask'].to(device)
with torch.no_grad():
output = model.generate(
input_ids=input_ids,
attention_mask=attention_mask, # Pass the attention_mask
max_length=max_length,
num_beams=5,
no_repeat_ngram_size=2,
early_stopping=True,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.pad_token_id # Ensure pad_token_id is set
)
return tokenizer.decode(output[0], skip_special_tokens=True)
def main():
parser = argparse.ArgumentParser(description="Generate text based on a prompt.")
parser.add_argument("--prompt", nargs="?", default=None, help="The prompt to generate text from.")
args = parser.parse_args()
if args.prompt:
print(f"Prompt: {args.prompt}")
print(generate_text(args.prompt)[len(args.prompt):], sep="")
else:
prompts = [
"O Romeo, Romeo! wherefore art",
"All the world's a",
"Is this a dagger which I"
]
for prompt in prompts:
print(f"Prompt: {prompt}")
print(generate_text(prompt)[len(prompt):], sep="")
print("\n" + "-"*50 + "\n")
if __name__ == "__main__":
main()
Run the script:
python generate_text.py
Deeper Explanation
- Expected Output:
Step 6: Tips for Further Improvements
6.1 Experiment with Different Prompts
Try different prompts to see how the model responds.
prompts = [
"O Romeo, Romeo! wherefore art thou",
"All the world's a stage,",
"Is this a dagger which I see before"
]
for prompt in prompts:
print(f"Prompt: {prompt}")
print(generate_text(prompt))
print("\n" + "-"*50 + "\n")
6.2 Train more ποΈββοΈ
- Increase Epochs: More epochs may improve the model's performance.
- Learning Rate: Fine-tune the learning rate for better convergence.
- Batch Size: If you have enough memory, increasing batch size can stabilize training.
6.3 Fine-Tune on Additional Data
- Add more Shakespearean works or other literature to the dataset.
- Combine Tiny Shakespeare with other texts to enrich the language.
Conclusion
Congrats! You can now generate simple shakespearean text using a GPT model that you trained from scratch on your Mac.