The Problem

Good quotes help make us stronger. What is truly inspiring about quotes is not their tone or contentedness but how those who share them reflect life experiences that really serve others.

I didn’t write the above quote about quotes(Quote-ception), but an AI model I trained did. And it says it better than I would have. Quotes are something that means different things to different people. Sometimes they inspire us, motivate us. And some other times they make us think about life, religion, and sometimes they just make us laugh.

So, can we train an AI model to generate more quotes, and make us think, laugh, or inspired? That was the motivation behind me starting on this journey.

TL;DR

I have fine-tuned GPT2 model on quotes with personas like Motivational/Inspirational, Funny, and Serious/Philosophical and deployed in a ready-to-use website: AI Quotes Generator.

The Model

On June 3, 2020, OpenAI released GPT3, a mammoth language model trained on 570GB of internet text. People have put the multi-talented model to use in all kinds of applications ranging from creating app designs, to websites, to excel functions which do nothing short of magic. But there was just one problem – the model weights were never released. The only way we can access then was through a paywalled API.

So, let’s time travel to November 2019, when GPT-2 was released. GPT-2, although not as powerful as GPT-2, was a revolution when it came to text generation. I remember my jaw dropping to the floor while reading the demo text generated by the model- the coherence, and grammatical syntax, it was near perfect. What I want from the model was not to be a magician, but to be able to generate perfectly structured English sentences. And for that GPT2 was more than sufficient.

Dataset

First I needed a dataset. Scraping the web for quotes was one option, but before that I wanted to see if somebody had done that already. And bingo! Quotes-500k is a dataset of almost 500k quotes, all scraped from the web along with tags like knowledge, love, friendship, etc.

Now I wanted to have the model be able to generate quotes according to specific themes. Since I was planning to use a pretrained model, conditional text generation was not something that was easy to do. PPLM was a model, or rather a way of using pretrained models, that Uber released which attempts to do just this(A very interesting paper. Be sure to check that out), but in the end I went another way. I decided to train three models, each fine-tuned to specific genre of quotes.

There were three genres I considered – Motivational, Serious, and Funny. I used the Quotes500k dataset and the tags to separate the quotes based on the tags into these three buckets. For Motivational dataset, I used tags such as love, life, inspirational, motivational, life lessons, dreams, etc. And for Serious, I went with tags like philosophy, life, god, time, faith, fear, etc. And finally for Funny, I just took tags like humour and funny.

Preprocessing

Nothing fancy here; just the basic hygiene. Things like

  1. Converting to Lowercase
  2. Replacing Contractions like wasn’t with the full form, i,e, was not.
  3. Removing HTML special entities(because this was a corpus scraped from the web
  4. Removing extra whitespaces
  5. Inserting whitespaces between a word and punctuation.
  6. Spellchecking and correction(using pyenchant)

You might have questions like – What about stop words? Lemmatization? Where are all of those?

Removal of stop words and lemmatization are not mandatory steps that you have to do in every single NLP task. It really depends on the task. If we were doing a text classification using a TF-IDF kind of a model, then yeah, it makes sense to do all of that. But text classification with a model which uses context, like a Neural Model or an N-Gram model, lesser so. And if you are doing Text Generation or Machine Translation, then removing the stop words and lemmatizing may actually hurt your performance as we are losing valuable context from the corpus.

Beginning and End of Sentence Tokens and Maximum Length

I also wanted to make the quotes not be too long. So I took a look at the dataset, plotted the frequencies of the length of the quote, and decided on an appropriate length where I should cutoff. In the motivational corpus, I discarded all the quotes greater than 100 words.

Frequency of Length of Quotes(Motivational)

Another important aspect to making the quotes short was the ability of the model to predict an End-of-Sentence token. So, we wrap each quote with a beginning of sentence(bos) token and an end of sentence(eos) token.

tokenizer = AutoTokenizer.from_pretrained('gpt2')
MAX_LEN = 100
train_m = ""
with open(train_path, "r", encoding='utf-8') as f:
for line in f.readlines():
if len(line.split())>MAX_LEN:
continue
train_m += (tokenizer.special_tokens_map['bos_token']+line.rstrip()+tokenizer.special_tokens_map['eos_token'])
with open(train_mod_path, "w", encoding='utf-8') as f:
f.write(train_m)
test_m = ""
with open(test_path, "r", encoding='utf-8') as f:
for line in f.readlines():
if len(line.split())>MAX_LEN:
continue
test_m += (tokenizer.special_tokens_map['bos_token']+line.rstrip()+tokenizer.special_tokens_map['eos_token'])
with open(test_mod_path, "w", encoding='utf-8') as f:
f.write(test_m)

Training

Now that we have the dataset ready and prepared, let’s start training the model. Transformers from Hugging Face is the obvious choice with it’s easy-to-use api and amazing collection of models with pretrained weights. Thanks to Hugging Face, training or fine-tuning a model is a breezy affair.

We start off by loading the model and the tokenizer for GPT2.

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelWithLMHead.from_pretrained('gpt2')

That’s it. You have the entire power of the huge pretrained model behind your fingers while you code.

Now, let’s define a function to load the dataset.

def load_dataset(train_path,test_path,tokenizer):
train_dataset = TextDataset(
tokenizer=tokenizer,
file_path=train_path,
block_size=128)
test_dataset = TextDataset(
tokenizer=tokenizer,
file_path=test_path,
block_size=128)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False,
)
return train_dataset,test_dataset,data_collator
train_dataset,test_dataset,data_collator = load_dataset(train_mod_path,test_mod_path,tokenizer)

Now that we have the datasets, let’s create the Trainer. This is the core class which handles all the training process.

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./storage/gpt2-motivational_v6", #The output directory
overwrite_output_dir=True, #overwrite the content of the output directory
num_train_epochs=10, # number of training epochs
per_gpu_train_batch_size=32, # batch size for training
per_gpu_eval_batch_size=64, # batch size for evaluation
logging_steps = 500, # Number of update steps between two evaluations.
save_steps=500, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
prediction_loss_only=True,
)

By this point of time, we have all the ingredients necessary to start training – the Model, Tokenizer, and the Trainer. All that is left is to do is train the model. And once the training is done, we just need to save the model and tokenizer for our use.

trainer.train()

trainer.save_model("./storage/gpt2-motivational_v6")

tokenizer.save_pretrained("./storage/gpt2-motivational_v6")

I trained each of the three models ~50 epochs on a single P5000 for a total of ~12 hours(excluding all the test runs, and runs which had some problems.)

Inference

We have trained the model and saved them. Now what? For inference, we take another brilliant feature from Hugging Face – pipelines. This awesome feature let’s your put a model into production in as little as a few lines of code.

tokenizer = AutoTokenizer.from_pretrained("./storage/gpt2-motivational_v6")
model = AutoModelWithLMHead.from_pretrained("./storage/gpt2-motivational_v6")
gpt2_finetune = pipeline('text-generation', model=model, tokenizer=tokenizer)
# gen_kwargs has different options like max_length,
# beam_search options, top-p, top-k, etc
gen_text = gpt2_finetune (seed, *gen_kwargs)

The gen_kwargs configures the text generation. I have used a hybrid approach of top_k sampling with k=50 and top_p sampling with p=0.95. To avoid repetitions in text generation, I have used no_repeat_ngram_size = 3, and repetition_penalty=1.2.

User Interface

Now that we have the core model trained, we need a way to interact with it. Writing code every time we need something from it is just not user friendly. So, we need a UI. And I chose to whip up a simple UI with ReactJS.

Although I can’t make the entire project available on Github, I’ll post the key parts as Github Gists.

For the purposes of the UI, I have dubbed the three models with a persona – Inspiratobot(Motivational), Aristobot(Serious), and FunnyBot.

The key features of the UI are:

  1. Ability to select between the three different personas
  2. You can start off the quote or you can use one of the many random starting seeds which will be populated. If you don’t like a seed, hit the shuffle button to get another seed.
  3. In the Advanced Setting, you get to play around with the minimum and maximum length of the quote. You can also play with the variety(or Temperature in sampling)
  4. The UI also uses stock photos from Unsplash as backgrounds to the quote, because that is all the rave recently. It picks from a collection of photos which suit the persona for which the quotes are being generated
  5. It also lets you rate the quote on a scale of 1 to 5.

The folder structure of the UI Project is as follows:

|                  
 +---public
 |       favicon.ico
 |       index.html
 |       logo.png
 |       logo.svg
 |       manifest.json
 |       
 ---src
         App.js
         customRatings.js # The heart rating component
         debug.log
         index.js
         quotes.css # The CSS for the page
         quotes.js # The Core Page
         theme.js
 |
 |   .gitignore
 |   package-lock.json
 |   package.json
 |   README.md

The quotes.js and customRatings.js can be found in my Github Gists.

Backend API Server

For hosting and serving the model, we need a server. And for this, I chose FastAPI, a no-nonsense web framework, specifically designed to build APIs.

I highly recommend FastAPI because of the sheer simplicity. A very simple API example(from the docs) is below:

from typing import Optional

from fastapi import FastAPI

app = FastAPI()


@app.get("/")
def read_root():
    return {"Hello": "World"}


@app.get("/items/{item_id}")
def read_item(item_id: int, q: Optional[str] = None):
    return {"item_id": item_id, "q": q}

Less than 10 lines of code to get a web server for your API sounds too good to be true, but that is just what it is.

Without going into details, the key part of the API is the generation of the quotes and below is the code for that.

def postprocess_gen_text(gen_text):
sentences = gen_text.split(".")
for i, sent in enumerate(sentences):
sent = ". ".join([s.strip() for s in sent.split(".")]).capitalize().strip()
sent = re.sub(r"\bi\b", "I", sent)
sent = re.sub(r"\bgod\b", "God", sent)
sent = re.sub(r"\bchrist\b", "Christ", sent)
sentences[i] = sent
return ". ".join(sentences).strip()
def generate_quote(model_name: ModelName, seed: str, gen_kwargs: dict):
global model
global tokenizer
global current_model_name
if model_name != current_model_name:
_load_model(model_name)
gpt2_finetune = pipeline(
"text-generation", model=model, tokenizer=tokenizer, device=1,
)
gen_text = gpt2_finetune(seed, **gen_kwargs)[0]["generated_text"]
return postprocess_gen_text(gen_text)

The folder structure of the project:

+---app
 |   |   api.py
 |   |   db_utils.py
 |   |   enums.py
 |   |   timer.py
 |   |   init.py
 |   |
 +---data
 |       funny_quotes.txt
 |       motivational_quotes.txt
 |       serious_quotes.txt
 |
 +---models
 |   +---gpt2-funny
 |   |
 |   +---gpt2-motivational
 |   |
 |   |---gpt2-serious
 |
 |   app_config.yml
 |   logger.py
 |   logging_config.json
 |   main.py
 |   memory_profiler.log
 |   pipeline.log
 |   requirements.txt

Deployment

It would have been all too easy to dockerize the application, spin up an EC2 instance, and host this. But I wanted an option which was cheap, if not free. Another challenge was that the models are about 500MB each and loading them into the memory made the RAM consumption to hover around 1GB. So an EC2 instance would have been costly. In addition to that I also needed store three models in the cloud which could cost storage. And to top it all, I also needed a DB to store the ratings that the user makes.

With these specification, I went hunting for products/services. To make things easier, I had separated out the app into two – a backend API server and a frontend UI which calls the backend API internally.

During my search, I stumbled upon this awesome list of free services that developers can avail. And after evaluating a few of those options, I finally zeroed in on the below options which would make the total cost of deployment as low as possible:

KintoHub

“KintoHub is an all-in-one deployment platform designed for full stack developers by full stack developers”, reads the home page. If cost was not a concern, I could have deployed the entire app on KintoHub because that is what their offering is – a backend Server which can scale well, a frontend server to server static HTML files, and a database. In the background, they containerize the app using a very easy to use web interface and deploy it on a machine of chosen specification. And the great thing is that all of this can be done from Github. You check-in your code to a Github repository(public or private) and deploy the app directly by pointing it to the repo.

As far as the bare minimum settings that has to be done does, it fits in a single page.

That’s it. Of course there are more setting available, like the memory that should be available for the app to run, etc. Although KintoHub has a free tier, I soon realized because of the RAM consumption of the models, I need a minimum of 1.5GB to make it run without crashing. So I moved to a Pay-as-you-go tier, where they generously award you with a $5 credit each month. The cost calculator shows we the app hosting is going to cost $5.5 and I’m fine with that(Still waiting to see till the end of the month on the actual cost).

Google Firebase

Firebase is a lot of things. It says on the homepage that it is the complete app development platform, and it really is. But we are only interested in the Firebase Hosting, where they let you host single page websites for free. And it was a breeze to deploy the app. All you have to do is:

  1. Install the firebase CLI
  2. Run npm build on your React project
  3. Run firebase init from the root folder of your project and point the source from public to build
  4. Run firebase deploy

A very short tutorial is all you need to get this done.

MongoDB Atlas

MongoDB Atlas is a cloud DBaaS which provides a MongoDB on the cloud and the good thing is that they offer a free tier with 512MB of storage. For our use-case, it is more than enough.

Creating a new cluster is straight-forward and once you sort out the access issues, we can use a python wrapper like pymongo to implement our DB connection.

Result

The result of all this is a website where you can interact with the model, generate quotes and pass it off as yours :D.

https://aiquotesgenerator.com

Although the model isn’t perfect, it still churns out some really good quotes that makes you think. And that’s what you want from any quote, right? Have fun playing with the model.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s