Creating Real Conversations with Fictional Characters

30 Aug 2023

10 min read

How I introduced my friend to her favorite movie character by chaining together AI models with LangChain

Introduction
System Overview
API Keys & Cost
Python Environment
Setting API Keys with Environment Variables
Voice Input
The Heart, Soul and Brain of the Character
Giving Your Character a Voice
The Conversation Loop
Stress Points
Finishing Touches
Moderating the Conversation
Appendix

Introduction

My friend Çağla is a die hard fan of the movie Legally Blonde and an avid admirer of the movie's heroine Elle Woods. Elle's indomitable spirit, loyalty to her friends and high held moral values are among the reasons Çağla cherishes her so much. So, I thought I would use my programming skills and the capabilities of today's artificial intelligence models to facilitate a meeting between them. In this blog post, I will walk you through the process of creating such a birthday gift. I'll share the challenges you may face, the AI models I employed, points that still need some work and how you can make this gift your own.

You can find all of the code referenced in this post at the GitHub repository of this project.

System Overview

Below is a diagram of the system as a whole. Each turn in the conversation between the user and the virtual character starts with the user's vocal input. The vocal input is turned to text and processed by a conversation chain powered by an llm. The llm then generates a text response to what was said according to the assigned character and given context, which is then used to generate an audio response to be played to the user.

API Keys & Cost

At three points during the computer's turn in the conversation I chose to outsource the computation. The speech-to-text conversion and the llm workload of the conversation chain is handled by OpenAI API calls (Whisper and ChatGPT-3.5-turbo respectively). Text-to-speech generation is done using the ElevenLabs API. If you have the hardware to run open source alternatives of these models feel free to swap out parts of the pipeline as necessary.

OpenAI API pricing: billed according to usage as of the writing of this post, quite reasonably priced
ElevenLabs API pricing: subscription based, I personally recommend the Starter plan with the first month only $1 as of the writing of this post

Python Environment

Let's start with setting up our Python environment and installing our dependencies. Create a new Python virtual environment with your preferred method and install the dependencies using the requirements.txt file that can be found in the GitHub repository of this project. I recommend using venv if you're on Linux or macOS and Conda if you're on Windows.

Setting API Keys with Environment Variables

In order to authenticate ourselves during the API calls we'll need to provide our program with our keys. The LangChain OpenAI API module expects to find the API key in its dedicated environment variable while the SpeechRecognition Whisper API module and the ElevenLabs API module expect their respective keys as function arguments. To suit all needs and prevent possible leakage of the keys, I recommend setting the environment variables for both keys using your preferred method. You can load the keys to variables within the program.

import os

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") # Your OpenAI API Key goes here
ELEVENLABS_API_KEY = os.getenv("ELEVENLABS_API_KEY") # Your Eleven Labs API Key goes here

Voice Input

We'll use the SpeechRecognition library for capturing the user's voice and converting it to text. Every turn the program will start recording audio when the user starts speaking and record until they stop speaking. Once the recording is done the audio file will be sent to Whisper to be converted into text.

We'll start by initializing our recognizer object which we will use to both capture the audio and make the API call to Whisper.

import speech_recognition as sr

# obtain audio from the microphone
r = sr.Recognizer()

Our Recognizer object uses its energy_threshold to detect speech over normal background noise so that it can start recording. Let's configure it automatically for our environment. The below code should work well for the majority, however if you have issues with the energy threshold after automatic calibration check out the section in the appendix for manual configuration.

with sr.Microphone() as source: # use the default microphone as the audio source
    r.adjust_for_ambient_noise(source, duration=5) # listen for 5 seconds to calibrate the energy threshold for ambient noise levels

Once the energy threshold is set to the appropriate value, the Recognizer can start recording when it detects speech. Let's listen for some input and store the recording in a variable.

with sr.Microphone() as source:
    audio = r.listen(source)
    print("Captured next line...")

Once the audio is captured, we can convert it to text using the Whisper API.

input_text = r.recognize_whisper_api(audio, api_key=OPENAI_API_KEY)

The Heart, Soul and Brain of the Character

Now that we have the user's input, we will process it and generate a response as our character within the given context, while also paying attention to the history of the conversation. We'll achieve this using the ConversationChain class from the LangChain Python library. ConversationClass will abstract away a lot of details for us to be able to focus on building the character and the story without worrying about the lower level details of prompting the LLM iteratively or manually handling the history of the conversation.

the LLM

We use the ChatOpenAI module offered by LangChain to initialize our llm. We need to take care of two things when setting up our OpenAI llm what model to use and the temperature of the model. For this application, among the currently available models, gpt-3.5-turbo is the tool for the job. It has the dialogue operation capabilities we're looking for and it is very reasonably priced. As of the writing of this post gpt-3.5-turbo is the default model ChatOpenAI uses. As for the temperature, setting it to 0.7 yielded good results for my use case, feel free to try out different values yourself.

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0.7)

Creating the Character Within the Context of the Prompt

We'll utilize ChatGPT for creating the character and equipping it with context awareness. We'll weave the details about the character and initial context together in a single prompt. It is best to keep this prompt as short as possible but not shorter, since we will be adding on to this prompt as the conversation progresses. Crafting the prompt is as much an artistic process as it is analytical, so there is not a single best way of doing it. Here is the prompt I wrote for this occasion. Since the response audio will be generated using an English speaking model, I swap out my friend's name Çağla with Chaala. Which, when spoken by the model, closely resembles how it is actually pronounced.

The following is a friendly phone call between the character Elle Woods from the movie Legally Blonde and her best friend Chaala.
Elle is compassionate, caring, supportive, talkative and empathetic.
Chaala is on a journey to find herself, Elle will support her and encourage Chaala to believe in herself.
Today is Chaala's birthday.
Elle wants to learn more about Chaala's life and catch-up.
Elle pays attention to details from the context of the conversation and accurately represents her character from Legally Blonde.

Note: Luckily for me ChatGPT knows about the movie Legally Blonde and is familiar with Elle Woods so I can reference her directly. However you will need to give characteristic details and background information to ChatGPT in the rare case ChatGPT does not know about your chosen character and you need to build them from scratch.

Custom Prompt Template for the ConversationChain

With the core of the prompt ready, we will create a custom prompt template with which our ConversationChain object will be able to drive the conversation. We want our prompt template to accept two input variables: history to keep track of the conversation so far and input to inject the user's input into the prompt. We position the input variables in curly braces inside our prompt template, and set the template up to end on Elle: to have our llm generate the next line as Elle.

from langchain import PromptTemplate

conversation_prompt_template = PromptTemplate(
    input_variables=['history', 'input'],
    output_parser=None,
    partial_variables={},
    template="""\
    The following is a friendly phone call between the character Elle Woods from the movie Legally Blonde and her best friend Chaala.\
    Elle is compassionate, caring, supportive, talkative and empathetic.\
    Chaala is on a journey to find herself, Elle will support her and encourage Chaala to believe in herself.\
    Today is Chaala's birthday.\
    Elle wants to learn more about Chaala's life and catch-up.\
    Elle pays attention to details from the context of the conversation and accurately represents her character from Legally Blonde.\
    

    
    Current conversation:
    {history}
    Chaala: {input}
    Elle:""",
    template_format='f-string',
    validate_template=True)

Conversation Memory

To generate the response and drive a meaningful conversation, the llm must have an idea of what has been talked about up to the current line. We can convey this information to our llm using one of the many memory modules offered by LangChain. Our most notable options are:

ConversationBufferWindowMemory: Retains the last k turns of the conversation verbatim, where k is the window size
ConversationSummaryMemory: At each turn performs a call to the llm using a custom prompt to keep a running summary of the conversation history
ConversationSummaryBufferMemory: Keeps the most recent turns of the conversation verbatim and progressively summarizes older lines that fall above a token limit

We will be using the ConversationSummaryBufferMemory module which in my opinion strikes a good balance regarding information retention without getting too complicated. To utilize the summary feature offered by this module we need a slight modification. The default progressive summarization prompt refers to the user as the human and the character as the AI. Let's create a custom PromptTemplate by referencing the participants appropriately, the user as Chaala and the character as Elle.

summarizer_prompt_template = PromptTemplate(
    input_variables=['summary', 'new_lines'],
    output_parser=None,
    partial_variables={},
    template='Progressively summarize the lines of conversation provided, adding onto the previous summary returning a new summary.\n\nEXAMPLE\nCurrent summary:\nChaala asks what Elle thinks of artificial intelligence. Elle thinks artificial intelligence is a force for good.\n\nNew lines of conversation:\nChaala Why do you think artificial intelligence is a force for good?\nElle: Because artificial intelligence will help humans reach their full potential.\n\nNew summary:\nChaala asks what Elle thinks of artificial intelligence. Elle thinks artificial intelligence is a force for good because it will help humans reach their full potential.\nEND OF EXAMPLE\n\nCurrent summary:\n{summary}\n\nNew lines of conversation:\n{new_lines}\n\nNew summary:',
    template_format='f-string',
    validate_template=True)

Let's set the maximum token limit above which summarization will occur to be 350 tokens and initialize our memory object.

from langchain.memory import ConversationSummaryBufferMemory

memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=350,
    prompt=summarizer_prompt_template,
    ai_prefix="Elle",
    human_prefix="Chaala")

Initializing the ConversationChain

We can now use the parts we have built to create our ConversationChain

from langchain.chains import ConversationChain

conversation = ConversationChain(
    llm=llm,
    prompt = conversation_prompt_template,
    memory=memory
)

and generate responses with

response_text = conversation.predict(input=input_text)

Giving Your Character a Voice

Voice Cloning

You can check out ElevenLabs for a speech synthesis interface which you can use to capture the voice of the character you have chosen. You are going to need some high quality clips of your character speaking for the best results.

Speech Generation

Once we have the voice designed to our liking, we can make the API call to have it read for us. To start we need to create identifiers for the voice we have created and the model we wish to use for generation. I was given access to the Eleven English v2 model from ElevenLabs upon request. Although this model is in Beta as of the writing of this post it produced better results for me. Make sure to refer to your voice with the name you gave it on the ElevenLabs platform.

from elevenlabs import set_api_key
from elevenlabs.api import Models
from elevenlabs.api import Voices
from elevenlabs import generate, stream

set_api_key(ELEVENLABS_API_KEY)

models = Models.from_api()
elle_model = [model for model in models if model.name == "Eleven English v2"][0]

voices = Voices.from_api()
elle_voice = [voice for voice in voices if voice.name == "Elle Woods"][0]

For generating speech, we utilize the streaming option to reduce latency and stay under the character limit.

audio_stream = generate(
    text=response_text,
    voice=elle_voice,
    model=elle_model,
    stream=True
)

stream(audio_stream)

The Conversation Loop

We are now ready to put everything together. Since our conversation will consist of the user and the character taking turns to speak, we'll wrap the whole thing in a loop and have it run until interruption. We can also wrap the loop with a try-except block to pickle the conversation memory before exiting the execution.

try:
    while True:
        # capture spoken user input
        with sr.Microphone() as source:
            audio = r.listen(source)
            print("Captured next line...")

        # convert the user input into text
        input_text = r.recognize_whisper_api(audio, api_key=OPENAI_API_KEY)
        
        # generate text response to the user input
        response_text = conversation.predict(input=input_text)

        # generate and stream the character's voice
        audio_stream = generate(
            text=response_text,
            voice=elle_voice,
            model=elle_model,
            stream=True
        )
        stream(audio_stream)
        
except KeyboardInterrupt:
    # save the conversation memory to disk
    with open("conversation_memory", "wb") as f:
        pickle.dump(conversation.memory, f)

With all our components working together, this chain of artificial intelligence models will imitate a conversation with the chosen character.

Stress Points

As you will see upon testing it out yourself, for the most part this system works pretty well. SpeechRecognition knows where to start and stop the recording, Whisper is quite successful in discerning what is said, ChatGPT generates theme-appropriate responses, and ElevenLabs English v2 does a good job of generating convincing speech. However there are points in the system that require special attention and one caveat I haven't been able to mitigate.

SpeechRecognition Configuration: Timing the recording successfully requires the correct configuration of the Energy Threshold and Pause Threshold parameters. If the automatic configuration isn't working out for you check out the manual configuration guides in the appendix.
ElevenLabs Voice Lab: Getting the voice design right requires high quality of the character speaking as well as a certain amount of experimentation.
Response Latency: Unfortunately there is one aspect of the use experience I haven't been able to fix yet: the latency of the spoken response. The biggest culprit here is the ElevenLabs API call. Even with streaming enabled, the speech generation takes a long time to complete. Paired with the PauseThreshold amount of seconds SpeechRecognition waits before concluding its recording, you can expect to experience anywhere from 10 to 20 seconds of total response latency. Although this is quite high for normal conversations, the inter-dimensional nature of the conversation made it acceptable in my experience. If you have powerful enough hardware you can attempt to get around this problem by opting to run a tts model locally (such as tortoise-tts).

Finishing Touches

The basic user experience is ready, but we can take it even further by preparing a prelude and an epilogue for our phone call.

Prelude

Since this is likely to be a surprise birthday gift, let's include an introduction to the conversation. For my gift I had one of the default voices of ElevenLabs read the following text.

Initiating contact with Elle Woods. This is an inter-dimensional phone call. Voice delays and awkward pauses are expected. Connecting now...

And followed this introduction with a phone line sound effect that eventually gets picked up. Upon picking up the phone a pre-generated audio of the character saying Hello Chaala! is played. I chose to play the files using mpv.

# introduction
os.system("mpv initiating-2.mp3")
# phone line
os.system("mpv --volume=65 ring.opus")
# character greeting
os.system("mpv elle-greeting.mp3")

Epilogue

Finally, let's have the program play a disconnection sound effect when the conversation ends. In the conversation loop modify the except block like so:

except KeyboardInterrupt:
    # play disconnection effect
    os.system("mpv end.opus")
    # save the conversation memory to disk
    with open("conversation_memory", "wb") as f:
        pickle.dump(conversation.memory, f)

Moderating the Conversation

You are now ready to give the gift of inter-dimensional conversation. Let me tell you how I moderated the call. To preserve the mystery around what was about to happen I had Çağla sit across from me so that she couldn't see the screen. Gave her the headphones and went through the Manual Configuration of the SpeechRecognition Energy Threshold. Once I was happy with the configuration, I executed the prelude together with the conversation loop so that recording started as soon as Elle greeted Çağla. When the conversation was over, I interrupted the execution of the program to start the epilogue and save the conversation memory to disk. I also recommend setting up a camera to record a video of the conversation, the reaction I got made the effort of putting this together well worth it.

Appendix

Manual Configuration of the SpeechRecognition Energy Threshold

Unfortunately the automatic calibration of the energy_threshold of SpeechRecognition isn't going to cut it for all cases. Depending on your microphone configuration and environment noise levels you may need to step in and manually determine this value yourself. You can do this by disabling the dynamic energy threshold property and experimenting with different values. I recommend trying out different values yourself to get a feel of the range of appropriate values. You can then perform a sound-check before initiating the conversation to fine-tune the threshold.

r.dynamic_energy_threshold = False
r.energy_threshold = 1300

with sr.Microphone() as source:
    audio = r.listen(source)
    print("Captured next line...")

Customizing the Pause Threshold of the Recognizer

If you find SpeechRecognizer stops listening as the user is talking due to a long pause in their speech you can tell it how long to wait by modifying the pause_threshold.

r.pause_threshold = 2

ALSA Microphone Errors (Linux Users)

On some Linux systems running ALSA, the SpeechRecognition microphone interface can produce warnings. In my experience these warnings can be ignored and the system will keep working just fine. However, these warnings can clutter up the output and make it difficult to follow any text feedback from the program you may have put in to monitor the status of the conversation. If you will be running this program in Jupyter Notebooks/Lab, you can use the clear_output(wait=False) call to clear the output right after it is generated.

with sr.Microphone() as source:
    clear_output(wait=False)
    audio = r.listen(source)
    print("Captured next line...")