Enable smarter conversations with your AI agents.

The latest and greatest technologies are bound to have some quirks. This is certainly the case with Large Language Models (LLMs) like GPT, a tool many businesses are now leaning on for seamless digital interactions with customers, staff, and the general public. While these models are undoubtedly impressive, they come with their own set of limitations, including the issue of retaining chat history. As a chat progresses, there’s a threshold beyond which these models can’t remember. This threshold is known as a “token limit”, roughly translated to a limit in how many words it can consider as an input. In practice, this means that as users dive deeper into a conversation with a LLM-based AI agent, earlier parts of the conversation start fading away.

It’s akin to conversing with someone who is listening in the moment but has no long-term memory, such as a husband. And its part of the reason why, if you have ever experimented online with a LLM such as ChatGPT, you will be familiar with how chats are held in isolated “conversations”. LLMs can only process so many messages as input. The conventional solution? Trim the earlier parts of the chat as we reach the token limit. But this approach isn’t going to work for creating believable, useful autonomous agents – of course we can do better!

Truncating chat history has obvious problems: essential context gets lost, and the conversation’s depth is compromised. There’s a better way.

Limiting length without information loss: aka “compression”

Clearly, there are better ways to handle this by simply thinking about it in terms of compression. And it turns out we can even get some cost efficiency, if we do it right. Instead of simply truncating the chat history, what if we summarized it to later “prime” our AI agent with the backstory? Sounds like another great job for a language model, and by using a model with more cost-effective processing like GPT-3, we can do exactly that: distill lengthy chats into concise summaries, ensuring that our primary conversational model always has the essence of the discussion at hand.

Here’s a glimpse into how that might look in code:

def summarize_chat_history(chat_history):
    # Using GPT-3 to summarize the chat
    response = openai.Completion.create(
      model="gpt-3.5-turbo",
      messages=[
            {"role": "system", "content": "You are an AI, tasked with condensing a chat history into a comprehensive paragraph for an LLM such as GPT. Kindly return only the summarized content."},
            {"role": "user", "content": chat_history}
        ]
    )
    summary = response.choices[0].message['content']
    return summary

def chat_with_gpt4(messages):
    # Summarize if the chat history is too long
    if len(" ".join([msg["content"] for msg in messages])) > SOME_THRESHOLD:
        summarized_history = summarize_chat_history(" ".join([msg["content"] for msg in messages]))
        # Use the summarized chat as a system message to give context to GPT-4
        messages.append({"role": "system", "content": summarized_history})
    
    response = openai.Completion.create(
      model="gpt-4.0-turbo",
      messages=messages
    )
    
    return response.choices[0].message['content']

 

With this approach, the challenges posed by LLM token limits are significantly lessened. As it often happens, innovative solutions emerge from the very tools we’ve been working with all along.