Creating all the tech projects in Start-Up Kdrama Part 3: Creating the Fortune-telling AI using Generative NLP

10 min readMar 28, 2023

Hi there! Today we are taking a look at Natural Language Process Generation, as seen in the trusty AI Yeong-sil from Start-Up! Yeong-sil appears throughout the series in various forms, but in today’s post we will focus on his episode 1 appearance.

In the beginning of episode 1, Han Ji-pyeong asks an Echo-like device named Yeong-sil for his schedule and the weather. While Yeong-sil is able to recite his schedule perfectly, he struggles a bit with the weather. Instead of a weather report, Yeong-sil responds with“Here’s your fortune for today” and delivers a foreboding prediction: “Today, the god of fate, will send a gentle breeze into your peaceful life. You may run into someone you met briefly in the past, at an unexpected place.” (And then he runs into Dalmi!!❤ Yeong-sil was an early shipper)

While this may not have been the expected behavior for the AI when asked about the likelihood of rain, I thought this function was wayyy more helpful than any potential weather report could have been (and also easier to create 😉 ), so let’s build an AI system that can give daily fortunes!

Intro

A bit of background if you’re new to the blog — I’m a recent CS grad with a lot of time on her hands and an unhealthy obsession with the kdrama Start-Up (#teamHanJi-pyeong#theGoodestBoy)! Since I’m currently rocking the ~unemployed summer vibe~ I thought it would be fun to try to recreate all the tech projects in the show! This is really just for fun and is written out in a way that requires no prior CS background and no necessary hardware besides an internet connection and a google account to follow along.

I will be posting more projects throughout the summer so follow me on twitter if you want to stay updated: @GuoLikeWhoa

What you will need:

Colab account (free to obtain): https://colab.research.google.com/

Colab is a free interactive python environment that Google provides, no credit card or payment necessary! It’s essentially a jupyter notebook so it just contains python code. If you would like, you can also copy the code onto a local python script and run it that way if that works easier for you.

2. Code from this repository: StartUp_FortuneTeller

In this github repo are two files: a jupyter notebook titled “FortuneGeneration.ipynb”, and a csv file titled “horoscope_data.csv”

The only file you need to pull is the FortuneGeneration.ipynb, as the first section of the notebook goes through the process of pulling the horoscope data from a website. However if you’d rather skip right to the model training, or if your notebook crashes during the training process, you can also just pull the csv which already has all the data on it:)

The Code

Now to start with we will import a couple of libraries we will need in our code:

%%capture
!pip install transformers
!pip install datasets
!pip install --upgrade sacrebleu sentencepiece
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM

Notes, we are going to use PyTorch instead of TensorFlow for this tutorial. This is because we want to take advantage of the pre-trained gpt2 model released by OpenAI for our model, and pulled from the HuggingFace API. In general, there are many more pre-trained models available for PyTorch than for TensorFlow, and so we are gonna make the switch to PyTorch for this tutorial.

(ps in case you are wondering it looks like SamSan uses another deep learning library called Theano)

Now, we need to make sure our colab notebook is mounted onto a GPU instance. GPU stands for graphic processing unit and is like a more powerful CPU that can handle the complicated computations needed for deep learning networks. Luckily Google provides free GPU resources for us! They can be accessed by running the following: Runtime > Change runtime type > Hardware accelerator: GPU (make sure it says GPU here, if it doesn’t change it)> Save

Now, the first part of the project will be scraping some horoscope data from a website (based on this geeksforgeeks tutorial) I ended up choosing horoscope data as the training data for our fortunes because they are plentiful online and because I like that they are slightly individualized, even though it is not clear that Yeong-sil was giving fortunes based on the western zodiac.

To accomplish this, we will be using a web scraping library called BeautifulSoup. Below we define the function that will perform the scrape, given the horoscope sign and the date. (Note the horoscope signs are given integer mappings, 1–12 for each of the signs. So instead of aries, the url needs to be given 1).

Also, if you would like to skip this data collection section, you can just download the horoscope.csv file from the git repo and read it into a data frame.

import requests
from bs4 import BeautifulSoup 
  
def horoscope(zodiac_sign: int, day: str) -> str:
    url = (
        "https://www.horoscope.com/us/horoscopes/general/"
         f"horoscope-archive.aspx?sign={zodiac_sign}&laDate={day}"
    )
    soup = BeautifulSoup(requests.get(url).content, 
                         "html.parser") 
    return soup.find("div", class_="main-horoscope").p.text

Now we build a library going back 1000 days from today. (You can also go back less days if you run out of memory in your notebook! I found that 1000 works pretty well but it did results in a couple of crashes and restarts). In the code below, we just grab those 1000 days of horoscope data, saving sign, the date, and the fortune into a pandas dataframe object.

reverse_sign_map={1:'aries',2:'taurus',3:'gemini',
      4:'cancer',5:'leo',6:'virgo',7:'libra',
      8:'scorpio',9:'sagittarius',10:'capricorn',
      11:'aquarius',12:'pisces'} 

horoscope_df = pd.DataFrame(columns = ['Date', 'Sign', 'Fortune'])
for date in clean_datelist:
  for i in range(1,13):
    horoscope_text = horoscope(i, date).split(" - ")[1] 
    tmp_df = pd.DataFrame({'Date' : date, 'Sign' : reverse_sign_map[i], 'Fortune' : horoscope_text},index=[0])
    horoscope_df = pd.concat([horoscope_df, pd.DataFrame({'Date' : date, 'Sign' : reverse_sign_map[i], 'Fortune' : horoscope_text},index=[0])], ignore_index=True)

Now let’s get the average length of each fortune we just grabbed becuase we want to translate the words into tensors.

horoscope_df["Fortune"].apply(len).mean()

The model can only take inputs in tensor form, which means we have to encode the words using some sort of tokenizer — the fortunes are around ~300 words each which means a max length of 512 for the encodings should be good! (if our encodings are too large we waste space and if our encodings are too small we could lose data). The general mapping of the GPT-2 tokenizer (which we will be using) is ~4 characters of text for every one token.

You can also customize your data however you want! Below I did a little superficial edit to change all the fortunes to use “smart” instead of “intellectual.” But you can edit to your heart’s desire, you can even concat on some custom fortunes of your own — the only reason we scrape from a public website is just because it would take way too long to come up with that amount of data by hand.

horoscope_df.replace("intellectual", "smart")

Now we are going to download in our pre-trained gpt2 model.

tokenizer = AutoTokenizer.from_pretrained("gpt2") #load the tokenizer for gpt2-medium
model = AutoModelForCausalLM.from_pretrained("gpt2")
model = model.cuda()

GPT-2 was published by OpenAI and is a transformer-based language model trained on a huge dataset. It’s famous for being the (free) predecessor of it’s more famous offspring GPT-3 and GPT-4. It comes in four flavors from HuggingFace — small, medium, large, extra large. We are just going to use small for time and memory availability reasons. This blog post does a great job explaining the details of the model.

In the above code, we load the model (which is considered a Causal Model from the HuggingFace API because it predicts the next word or token given a sequence of previous words or tokens) and move it to our gpu. We also load the gpt-2 tokenizer. This is a tokenizer that has been pre-trained on data so it already has some sort of baseline understanding on the inner representation of the English language that can be used by our model to extract useful features from the prompt+fortune pairs.

We will also need to define a pad token for the tokenizer. This is a token that will be used to pad a tensor if your fortune is shorter than the max length of 512. This ensures the model receives input of the same sizes.

tokenizer.pad_token = tokenizer.eos_token

Now we want to define a dataset class that can take our fortune data from the pandas dataframe and preprocess it so it’s ready for the model. This dataset object does just that, ensuring that the text has been translated into encodings, and ensuring that when the model sees the prompt “Yeong-sil, my sign is <insert sign> and today is <insert date>. What is the weather today?” it knows to return a fortune. We do this by just inserting a fortune after that prompt assuming that if the model sees this pairing enough times, it will learn to generate fortunes after seeing a prompt that follows that structure.

class FortuneDataset(Dataset):
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for data in examples.itertuples():
            date = str(data.Date).split(' ')[0]
            sign = data.Sign
            fortune = data.Fortune
            prompt = "Prompt: Yeong-sil, my sign is " + sign + " and today is " + date + ". What is the weather today? \nHere is your fortune: "
            training_text = prompt + fortune+"<|endoftext|>"

            encodings_dict = tokenizer(training_text, max_length=256, padding="max_length", truncation=True) 
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))

            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask'])) 
            prompt_len = len(tokenizer.encode(prompt))
            
            masked_labels = [-100]*prompt_len + encodings_dict['input_ids'][prompt_len:]
            self.labels.append(torch.tensor(masked_labels))


    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids':self.input_ids[idx], 'attention_mask':self.attn_masks[idx], 'labels':self.labels[idx]}

Now we split the data into a training set and a validation set. In general terms, the training set will be used by the model to learn how to generate fortunes, and it will use the validation set (a set it hasn’t seen in it’s learning process) to validate how well it is doing. That’s why our training size will be much larger than the validation set.

fortune_dataset = FortuneDataset(horoscope_df, tokenizer)
train_size = int(0.9 * len(fortune_dataset))
train_dataset, val_dataset = random_split(fortune_dataset, [train_size, len(fortune_dataset) - train_size])

And finally, we are ready to train our model. I chose to train it for 2 epochs (which means the model will pass through the training data twice) simply because it seemed to yield reasonable results and did not take too long, but if you train it for more epochs you may see better results! For an explanation on why more epochs can give you better results, check out this quora post which I feel explains really well in simple terms how gradient descent is helped by multiple epochs.

from transformers import TrainingArguments
from transformers import DataCollatorForLanguageModeling

training_args = TrainingArguments(
  output_dir='./results',
  num_train_epochs=2
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

And finally, for the fun part, now we generate the fortunes! Below you can see that you input the sign and the date, and the model generates a fortune. We use greedy decoding for this generation (the default for the HuggingFace API) which just means that when the model decides what word comes next in the sentence, it will pick the word with the highest probability. (There are also other methods of decoding detailed here that you can play around with which will give you better accuracy but take up more memory)

sign = "virgo"
date = 20230327
###edit the above to get different fortunes###

prompt = "Prompt: Yeong-sil, my sign is " + sign + " and today is " + str(date) + ". What is the weather today? \nHere is your fortune: "
prompt_encoded = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0).cuda()
model.eval()
sample_outputs = model.generate(prompt_encoded, 
                                do_sample=True,   
                                max_length = 300,
                                num_return_sequences=1)
decoded_prediction = tokenizer.decode(sample_outputs[0], skip_special_tokens=True)
print(decoded_prediction)

When I ran this, I got this output:

Prompt: Yeong-sil, my sign is virgo and today is 20230327. What is the weather today?
Here is your fortune: Virgo is one of the most powerful forces on the planet, and you could be tempted to get into a big fight. Put the weapons away and bring out the olive branch. Take that energy that has built up and use it to fuel your romantic affairs instead of warlike ventures. Defuse the situation by sharing passionate nights with the one you love.

What’s also cool is that because this is a transformer based language model, you don’t need exactly perfect inputs to trigger a good output. For example, if I put in a nonsense sign like horsey:

Prompt: Yeong-sil, my sign is horsey and today is 20230327. What is the weather today?
Here is your fortune: ~~~ Tonight is the best time to be a horsey. Your romantic life is one area where you might do better taking the opposite approach. Have confidence and be spontaneous in all matters having to do with love. The key now is to make sure that you aren’t giving yourself away to someone who’s unworthy of your love. Match yourself with a person who appreciates you for the amazing person you re.

But notice it still made mistakes! it added three ‘~’ symbols at the beginning of the fortune and also typod “you are” to “you re.” If you want to make the accuracy go up so it makes less mistakes like these, a few easy methods of improvement are increasing the training dataset size, increasing number of epochs used in training, or choosing a more accurate decoding method.

And that’s it! Now you have your own Yeong-sil to give you advice on your love life when you ask for the weather:) Have fun!

Creating all the tech projects in Start-Up Kdrama Part 3: Creating the Fortune-telling AI using Generative NLP

Intro

What you will need:

The Code

Written by Joyce