Creating Real Conversations with Fictional Characters
How I introduced my friend to her favorite movie character by chaining together AI models with LangChain

Table of Contents
Introduction
My friend Çağla is a die hard fan of the movie Legally Blonde and an avid admirer of the movie's heroine Elle Woods. Elle's indomitable spirit, loyalty to her friends and high held moral values are among the reasons Çağla cherishes her so much. So, I thought I would use my programming skills and the capabilities of today's artificial intelligence models to facilitate a meeting between them. In this blog post, I will walk you through the process of creating such a birthday gift. I'll share the challenges you may face, the AI models I employed, points that still need some work and how you can make this gift your own.
You can find all of the code referenced in this post at the GitHub repository of this project.
System Overview
Below is a diagram of the system as a whole. Each turn in the conversation between the user and the virtual character starts with the user's vocal input. The vocal input is turned to text and processed by a conversation chain powered by an llm. The llm then generates a text response to what was said according to the assigned character and given context, which is then used to generate an audio response to be played to the user.

API Keys & Cost
At three points during the computer's turn in the conversation I chose to outsource the computation. The speech-to-text conversion and the llm workload of the conversation chain is handled by OpenAI API calls (Whisper and ChatGPT-3.5-turbo respectively). Text-to-speech generation is done using the ElevenLabs API. If you have the hardware to run open source alternatives of these models feel free to swap out parts of the pipeline as necessary.
- OpenAI API pricing: billed according to usage as of the writing of this post, quite reasonably priced
- ElevenLabs API pricing: subscription based, I personally recommend the Starter plan with the first month only $1 as of the writing of this post
Python Environment
Let's start with setting up our Python environment and installing our
dependencies. Create a new Python virtual environment with your
preferred method and install the dependencies using the
requirements.txt
file that can be found in the
GitHub repository of this project. I recommend using
venv
if you're on Linux or macOS and
Conda
if you're on Windows.
Setting API Keys with Environment Variables
In order to authenticate ourselves during the API calls we'll need to provide our program with our keys. The LangChain OpenAI API module expects to find the API key in its dedicated environment variable while the SpeechRecognition Whisper API module and the ElevenLabs API module expect their respective keys as function arguments. To suit all needs and prevent possible leakage of the keys, I recommend setting the environment variables for both keys using your preferred method. You can load the keys to variables within the program.
import os
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") # Your OpenAI API Key goes here
ELEVENLABS_API_KEY = os.getenv("ELEVENLABS_API_KEY") # Your Eleven Labs API Key goes here
Voice Input
We'll use the SpeechRecognition library for capturing the user's voice and converting it to text. Every turn the program will start recording audio when the user starts speaking and record until they stop speaking. Once the recording is done the audio file will be sent to Whisper to be converted into text.
We'll start by initializing our recognizer object which we will use to both capture the audio and make the API call to Whisper.
import speech_recognition as sr
# obtain audio from the microphone
r = sr.Recognizer()
Our Recognizer
object uses its
energy_threshold
to detect speech over normal background
noise so that it can start recording. Let's configure it automatically
for our environment. The below code should work well for the majority,
however if you have issues with the energy threshold after automatic
calibration check out the
section
in the appendix for manual configuration.
with sr.Microphone() as source: # use the default microphone as the audio source
r.adjust_for_ambient_noise(source, duration=5) # listen for 5 seconds to calibrate the energy threshold for ambient noise levels
Once the energy threshold is set to the appropriate value, the
Recognizer
can start recording when it detects speech.
Let's listen for some input and store the recording in a variable.
with sr.Microphone() as source:
audio = r.listen(source)
print("Captured next line...")
Once the audio is captured, we can convert it to text using the Whisper API.
input_text = r.recognize_whisper_api(audio, api_key=OPENAI_API_KEY)
The Heart, Soul and Brain of the Character
Now that we have the user's input, we will process it and generate a
response as our character within the given context, while also paying
attention to the history of the conversation. We'll achieve this using
the ConversationChain
class from the
LangChain
Python library.
ConversationClass
will abstract away a lot of details for
us to be able to focus on building the character and the story without
worrying about the lower level details of prompting the LLM
iteratively or manually handling the history of the conversation.
the LLM
We use the ChatOpenAI
module offered by LangChain to
initialize our llm. We need to take care of two things when setting up
our OpenAI llm what model to use and the temperature of the model. For
this application, among the currently available models,
gpt-3.5-turbo
is the tool for the job. It has the
dialogue operation capabilities we're looking for and it is very
reasonably priced. As of the writing of this post
gpt-3.5-turbo
is the default model
ChatOpenAI
uses. As for the temperature, setting it to
0.7 yielded good results for my use case, feel free to try out
different values yourself.
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0.7)
Creating the Character Within the Context of the Prompt
We'll utilize ChatGPT for creating the character and equipping it with
context awareness. We'll weave the details about the character and
initial context together in a single prompt. It is best to keep this
prompt as short as possible but not shorter, since we will be adding
on to this prompt as the conversation progresses. Crafting the prompt
is as much an artistic process as it is analytical, so there is not a
single best way of doing it. Here is the prompt I wrote for this
occasion. Since the response audio will be generated using an English
speaking model, I swap out my friend's name Çağla with
Chaala
. Which, when spoken by the model, closely
resembles how it is actually pronounced.
The following is a friendly phone call between the character Elle Woods from the movie Legally Blonde and her best friend Chaala.
Elle is compassionate, caring, supportive, talkative and empathetic.
Chaala is on a journey to find herself, Elle will support her and encourage Chaala to believe in herself.
Today is Chaala's birthday.
Elle wants to learn more about Chaala's life and catch-up.
Elle pays attention to details from the context of the conversation and accurately represents her character from Legally Blonde.
Note: Luckily for me ChatGPT knows about the movie Legally Blonde and is familiar with Elle Woods so I can reference her directly. However you will need to give characteristic details and background information to ChatGPT in the rare case ChatGPT does not know about your chosen character and you need to build them from scratch.
Custom Prompt Template for the ConversationChain
With the core of the prompt ready, we will create a custom prompt
template with which our ConversationChain object will be able to drive
the conversation. We want our prompt template to accept two input
variables: history
to keep track of the conversation so
far and input
to inject the user's input into the prompt.
We position the input variables in curly braces inside our prompt
template, and set the template up to end on Elle:
to have
our llm generate the next line as Elle.
from langchain import PromptTemplate
conversation_prompt_template = PromptTemplate(
input_variables=['history', 'input'],
output_parser=None,
partial_variables={},
template="""\
The following is a friendly phone call between the character Elle Woods from the movie Legally Blonde and her best friend Chaala.\
Elle is compassionate, caring, supportive, talkative and empathetic.\
Chaala is on a journey to find herself, Elle will support her and encourage Chaala to believe in herself.\
Today is Chaala's birthday.\
Elle wants to learn more about Chaala's life and catch-up.\
Elle pays attention to details from the context of the conversation and accurately represents her character from Legally Blonde.\
Current conversation:
{history}
Chaala: {input}
Elle:""",
template_format='f-string',
validate_template=True)
Conversation Memory
To generate the response and drive a meaningful conversation, the llm must have an idea of what has been talked about up to the current line. We can convey this information to our llm using one of the many memory modules offered by LangChain. Our most notable options are:
-
ConversationBufferWindowMemory: Retains the last
k
turns of the conversation verbatim, wherek
is the window size - ConversationSummaryMemory: At each turn performs a call to the llm using a custom prompt to keep a running summary of the conversation history
- ConversationSummaryBufferMemory: Keeps the most recent turns of the conversation verbatim and progressively summarizes older lines that fall above a token limit
We will be using the
ConversationSummaryBufferMemory
module which in my
opinion strikes a good balance regarding information retention without
getting too complicated. To utilize the summary feature offered by
this module we need a slight modification. The default progressive
summarization prompt refers to the user as the human
and
the character as the AI
. Let's create a custom
PromptTemplate by referencing the participants appropriately, the user
as Chaala
and the character as Elle
.
summarizer_prompt_template = PromptTemplate(
input_variables=['summary', 'new_lines'],
output_parser=None,
partial_variables={},
template='Progressively summarize the lines of conversation provided, adding onto the previous summary returning a new summary.\n\nEXAMPLE\nCurrent summary:\nChaala asks what Elle thinks of artificial intelligence. Elle thinks artificial intelligence is a force for good.\n\nNew lines of conversation:\nChaala Why do you think artificial intelligence is a force for good?\nElle: Because artificial intelligence will help humans reach their full potential.\n\nNew summary:\nChaala asks what Elle thinks of artificial intelligence. Elle thinks artificial intelligence is a force for good because it will help humans reach their full potential.\nEND OF EXAMPLE\n\nCurrent summary:\n{summary}\n\nNew lines of conversation:\n{new_lines}\n\nNew summary:',
template_format='f-string',
validate_template=True)
Let's set the maximum token limit above which summarization will occur to be 350 tokens and initialize our memory object.
from langchain.memory import ConversationSummaryBufferMemory
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=350,
prompt=summarizer_prompt_template,
ai_prefix="Elle",
human_prefix="Chaala")
Initializing the ConversationChain
We can now use the parts we have built to create our ConversationChain
from langchain.chains import ConversationChain
conversation = ConversationChain(
llm=llm,
prompt = conversation_prompt_template,
memory=memory
)
and generate responses with
response_text = conversation.predict(input=input_text)
Giving Your Character a Voice
Voice Cloning
You can check out ElevenLabs for a speech synthesis interface which you can use to capture the voice of the character you have chosen. You are going to need some high quality clips of your character speaking for the best results.
Speech Generation
Once we have the voice designed to our liking, we can make the API call to have it read for us. To start we need to create identifiers for the voice we have created and the model we wish to use for generation. I was given access to the Eleven English v2 model from ElevenLabs upon request. Although this model is in Beta as of the writing of this post it produced better results for me. Make sure to refer to your voice with the name you gave it on the ElevenLabs platform.
from elevenlabs import set_api_key
from elevenlabs.api import Models
from elevenlabs.api import Voices
from elevenlabs import generate, stream
set_api_key(ELEVENLABS_API_KEY)
models = Models.from_api()
elle_model = [model for model in models if model.name == "Eleven English v2"][0]
voices = Voices.from_api()
elle_voice = [voice for voice in voices if voice.name == "Elle Woods"][0]
For generating speech, we utilize the streaming option to reduce latency and stay under the character limit.
audio_stream = generate(
text=response_text,
voice=elle_voice,
model=elle_model,
stream=True
)
stream(audio_stream)
The Conversation Loop
We are now ready to put everything together. Since our conversation will consist of the user and the character taking turns to speak, we'll wrap the whole thing in a loop and have it run until interruption. We can also wrap the loop with a try-except block to pickle the conversation memory before exiting the execution.
try:
while True:
# capture spoken user input
with sr.Microphone() as source:
audio = r.listen(source)
print("Captured next line...")
# convert the user input into text
input_text = r.recognize_whisper_api(audio, api_key=OPENAI_API_KEY)
# generate text response to the user input
response_text = conversation.predict(input=input_text)
# generate and stream the character's voice
audio_stream = generate(
text=response_text,
voice=elle_voice,
model=elle_model,
stream=True
)
stream(audio_stream)
except KeyboardInterrupt:
# save the conversation memory to disk
with open("conversation_memory", "wb") as f:
pickle.dump(conversation.memory, f)
With all our components working together, this chain of artificial intelligence models will imitate a conversation with the chosen character.
Stress Points
As you will see upon testing it out yourself, for the most part this system works pretty well. SpeechRecognition knows where to start and stop the recording, Whisper is quite successful in discerning what is said, ChatGPT generates theme-appropriate responses, and ElevenLabs English v2 does a good job of generating convincing speech. However there are points in the system that require special attention and one caveat I haven't been able to mitigate.
-
SpeechRecognition Configuration: Timing the
recording successfully requires the correct configuration of the
Energy Threshold
andPause Threshold
parameters. If the automatic configuration isn't working out for you check out the manual configuration guides in the appendix. - ElevenLabs Voice Lab: Getting the voice design right requires high quality of the character speaking as well as a certain amount of experimentation.
-
Response Latency: Unfortunately there is one aspect
of the use experience I haven't been able to fix yet: the latency of
the spoken response. The biggest culprit here is the ElevenLabs API
call. Even with streaming enabled, the speech generation takes a
long time to complete. Paired with the
PauseThreshold
amount of secondsSpeechRecognition
waits before concluding its recording, you can expect to experience anywhere from 10 to 20 seconds of total response latency. Although this is quite high for normal conversations, the inter-dimensional nature of the conversation made it acceptable in my experience. If you have powerful enough hardware you can attempt to get around this problem by opting to run a tts model locally (such as tortoise-tts).
Finishing Touches
The basic user experience is ready, but we can take it even further by preparing a prelude and an epilogue for our phone call.
Prelude
Since this is likely to be a surprise birthday gift, let's include an introduction to the conversation. For my gift I had one of the default voices of ElevenLabs read the following text.
Initiating contact with Elle Woods. This is an inter-dimensional phone call. Voice delays and awkward pauses are expected. Connecting now...
And followed this introduction with a
phone line sound effect
that eventually gets picked up. Upon picking up the phone a
pre-generated audio of the character saying
Hello Chaala!
is played. I chose to play the files using
mpv.
# introduction
os.system("mpv initiating-2.mp3")
# phone line
os.system("mpv --volume=65 ring.opus")
# character greeting
os.system("mpv elle-greeting.mp3")
Epilogue
Finally, let's have the program play a disconnection sound effect when the conversation ends. In the conversation loop modify the except block like so:
except KeyboardInterrupt:
# play disconnection effect
os.system("mpv end.opus")
# save the conversation memory to disk
with open("conversation_memory", "wb") as f:
pickle.dump(conversation.memory, f)
Moderating the Conversation
You are now ready to give the gift of inter-dimensional conversation. Let me tell you how I moderated the call. To preserve the mystery around what was about to happen I had Çağla sit across from me so that she couldn't see the screen. Gave her the headphones and went through the Manual Configuration of the SpeechRecognition Energy Threshold. Once I was happy with the configuration, I executed the prelude together with the conversation loop so that recording started as soon as Elle greeted Çağla. When the conversation was over, I interrupted the execution of the program to start the epilogue and save the conversation memory to disk. I also recommend setting up a camera to record a video of the conversation, the reaction I got made the effort of putting this together well worth it.
Appendix
Manual Configuration of the SpeechRecognition Energy Threshold
Unfortunately the automatic calibration of the
energy_threshold
of SpeechRecognition
isn't
going to cut it for all cases. Depending on your microphone
configuration and environment noise levels you may need to step in and
manually determine this value yourself. You can do this by disabling
the dynamic energy threshold property and experimenting with different
values. I recommend trying out different values yourself to get a feel
of the range of appropriate values. You can then perform a sound-check
before initiating the conversation to fine-tune the threshold.
r.dynamic_energy_threshold = False
r.energy_threshold = 1300
with sr.Microphone() as source:
audio = r.listen(source)
print("Captured next line...")
Customizing the Pause Threshold of the Recognizer
If you find SpeechRecognizer
stops listening as the user
is talking due to a long pause in their speech you can tell it how
long to wait by modifying the pause_threshold
.
r.pause_threshold = 2
ALSA Microphone Errors (Linux Users)
On some Linux systems running ALSA, the
SpeechRecognition
microphone interface can produce
warnings. In my experience these warnings can be ignored and the
system will keep working just fine. However, these warnings can
clutter up the output and make it difficult to follow any text
feedback from the program you may have put in to monitor the status of
the conversation. If you will be running this program in Jupyter
Notebooks/Lab, you can use the
clear_output(wait=False)
call to clear the output right
after it is generated.
with sr.Microphone() as source:
clear_output(wait=False)
audio = r.listen(source)
print("Captured next line...")