In [None]:
!python --version

Python 3.10.12


# Claude 3 RAG Agents with LangChain v1

LangChain v1 brought a lot of changes and when comparing the LangChain of versions `0.0.3xx` to `0.1.x` there's plenty of changes to the preferred way of doing things. That is very much the case for agents.

The way that we initialize and use agents is generally clearer than it was in the past — there are still many abstractions, but we can (and are encouraged to) get closer to the agent logic itself. This can make for some confusion at first, but once understood the new logic can be much clearer than with previous versions.

In this example, we'll be building a RAG agent with LangChain v1. We will use Claude 3 for our LLM, Voyage AI for knowledge embeddings, and Pinecone to power our knowledge retrieval.

To begin, let's install the prerequisites:

In [2]:
!pip install -qU \
    langchain==0.1.11 \
    langchain-core==0.1.30 \
    langchain-community==0.0.27 \
    langchain-anthropic==0.1.4 \
    langchainhub==0.1.15 \
    anthropic==0.19.1 \
    voyageai==0.2.1 \
    pinecone-client==3.1.0 \
    datasets==2.16.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m848.6/848.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.0/211.0 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.8/77.8 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

And grab the required API keys. We will need API keys for [Claude](https://docs.anthropic.com/claude/reference/getting-started-with-the-api), [Voyage AI](https://docs.voyageai.com/install/), and [Pinecone](https://docs.pinecone.io/docs/quickstart).

In [None]:
# Insert your API keys here
ANTHROPIC_API_KEY="<YOUR_ANTHROPIC_API_KEY>"
PINECONE_API_KEY="<YOUR_PINECONE_API_KEY>"
VOYAGE_API_KEY="<YOUR_VOYAGE_API_KEY>"

## Finding Knowledge

The first thing we need for an agent using RAG is somewhere we want to pull knowledge from. We will use v2 of the AI ArXiv dataset, available on Hugging Face Datasets at [`jamescalam/ai-arxiv2-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-chunks).

_Note: we're using the prechunked dataset. For the raw version see [`jamescalam/ai-arxiv2`](https://huggingface.co/datasets/jamescalam/ai-arxiv2)._

In [4]:
from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv2-chunks", split="train[:20000]")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/766M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 20000
})

In [5]:
dataset[1]

{'doi': '2401.09350',
 'chunk-id': 1,
 'chunk': 'These neural networks and their training algorithms may be complex, and the scope of their impact broad and wide, but nonetheless they are simply functions in a high-dimensional space. A trained neural network takes a vector as input, crunches and transforms it in various ways, and produces another vector, often in some other space. An image may thereby be turned into a vector, a song into a sequence of vectors, and a social network as a structured collection of vectors. It seems as though much of human knowledge, or at least what is expressed as text, audio, image, and video, has a vector representation in one form or another.\nIt should be noted that representing data as vectors is not unique to neural networks and deep learning. In fact, long before learnt vector representations of pieces of dataâ\x80\x94what is commonly known as â\x80\x9cembeddingsâ\x80\x9dâ\x80\x94came along, data was often encoded as hand-crafted feature vectors. E

## Building the Knowledge Base

To build our knowledge base we need _two things_:

1. Embeddings, for this we will use `VoyageEmbeddings` using Voyage AI's embedding models, which do need an [API key](https://dash.voyageai.com/api-keys).
2. A vector database, where we store our embeddings and query them. We use Pinecone which again requires a [free API key](https://app.pinecone.io).

First we initialize our connection to Voyage AI and define an `embed` object for embeddings:

In [7]:
from langchain_community.embeddings import VoyageEmbeddings

embed = VoyageEmbeddings(
    voyage_api_key=VOYAGE_API_KEY, model="voyage-2"
)

Then we initialize our connection to Pinecone:

In [8]:
from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=PINECONE_API_KEY)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [9]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

Before creating an index, we need the dimensionality of our Voyage AI embedding model, which we can find easily by creating an embedding and checking the length:

In [10]:
vec = embed.embed_documents(["ello"])
len(vec[0])

1024

Now we create the index using our embedding dimensionality, and a metric also compatible with the model (this can be either cosine or dotproduct). We also pass our spec to index initialization.

In [11]:
import time

index_name = "claude-3-rag"

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=len(vec[0]),  # dimensionality of voyage model
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 20000}},
 'total_vector_count': 20000}

### Populating our Index

Now our knowledge base is ready to be populated with our data. We will use the `embed` helper function to embed our documents and then add them to our index.

We will also include metadata from each record.

In [12]:
from tqdm.auto import tqdm

# easier to work with dataset as pandas dataframe
data = dataset.to_pandas()

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/200 [00:00<?, ?it/s]

Create a tool for our agent to use when searching for ArXiv papers:

In [13]:
from langchain.agents import tool

@tool
def arxiv_search(query: str) -> str:
    """Use this tool when answering questions about AI, machine learning, data
    science, or other technical questions that may be answered using arXiv
    papers.
    """
    # create query vector
    xq = embed.embed_query(query)
    # perform search
    out = index.query(vector=xq, top_k=5, include_metadata=True)
    # reformat results into string
    results_str = "\n\n".join(
        [x["metadata"]["text"] for x in out["matches"]]
    )
    return results_str

tools = [arxiv_search]

When this tool is used by our agent it will execute it like so:

In [14]:
print(
    arxiv_search.run(tool_input={"query": "can you tell me about llama 2?"})
)

Model Llama 2 Code Llama Code Llama - Python Size FIM LCFT Python CPP Java PHP TypeScript C# Bash Average 7B â 13B â 34B â 70B â 7B â 7B â 7B â 7B â 13B â 13B â 13B â 13B â 34B â 34B â 7B â 7B â 13B â 13B â 34B â 34B â â â â â 14.3% 6.8% 10.8% 9.9% 19.9% 13.7% 15.8% 13.0% 24.2% 23.6% 22.2% 19.9% 27.3% 30.4% 31.6% 34.2% 12.6% 13.2% 21.4% 15.1% 6.3% 3.2% 8.3% 9.5% 3.2% 12.6% 17.1% 3.8% 18.9% 25.9% 8.9% 24.8% â â â â â â â â â â 37.3% 31.1% 36.1% 30.4% 29.2% 29.8% 38.0%

Ethical Considerations and Limitations (Section 5.2) Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2âs potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of L

## Defining XML Agent

The XML agent is built primarily to support Anthropic models. Anthropic models have been trained to use XML tags like `<input>{some input}</input` or when using a tool they use:

```
<tool>{tool name}</tool>
<tool_input>{tool input}</tool_input>
```

This is much different to the format produced by typical ReAct agents, which is not as well supported by Anthropic models.

To create an XML agent we need a `prompt`, `llm`, and list of `tools`. We can download a prebuilt prompt for conversational XML agents from LangChain hub.

In [15]:
from langchain import hub

prompt = hub.pull("hwchase17/xml-agent-convo")
prompt

ChatPromptTemplate(input_variables=['agent_scratchpad', 'input', 'tools'], partial_variables={'chat_history': ''}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['agent_scratchpad', 'chat_history', 'input', 'tools'], template="You are a helpful assistant. Help the user answer any questions.\n\nYou have access to the following tools:\n\n{tools}\n\nIn order to use a tool, you can use <tool></tool> and <tool_input></tool_input> tags. You will then get back a response in the form <observation></observation>\nFor example, if you have a tool called 'search' that could run a google search, in order to search for the weather in SF you would respond:\n\n<tool>search</tool><tool_input>weather in SF</tool_input>\n<observation>64 degrees</observation>\n\nWhen you are done, respond with a final answer between <final_answer></final_answer>. For example:\n\n<final_answer>The weather in SF is 64 degrees</final_answer>\n\nBegin!\n\nPrevious Conversation:\n{chat_history}\n\n

We can see the XML format being used throughout the prompt when explaining to the LLM how it should use tools.

Next we initialize our connection to Anthropic, for this we need an [Anthropic API key](https://console.anthropic.com/).

In [16]:
from langchain_anthropic import ChatAnthropic

# chat completion llm
llm = ChatAnthropic(
    anthropic_api_key=ANTHROPIC_API_KEY,
    model_name="claude-3-opus-20240229",  # change "opus" -> "sonnet" for speed
    temperature=0.0
)

When the agent is run we will provide it with a single `input` — this is the input text from a user. However, within the agent logic an *agent_scratchpad* object will be passed too, which will include tool information. To feed this information into our LLM we will need to transform it into the XML format described above, we define the `convert_intermediate_steps` function to handle that.

In [17]:
def convert_intermediate_steps(intermediate_steps):
    log = ""
    for action, observation in intermediate_steps:
        log += (
            f"<tool>{action.tool}</tool><tool_input>{action.tool_input}"
            f"</tool_input><observation>{observation}</observation>"
        )
    return log

We must also parse the tools into a string containing `tool_name: tool_description` — we handle that with the `convert_tools` function.

In [18]:
def convert_tools(tools):
    return "\n".join([f"{tool.name}: {tool.description}" for tool in tools])

With everything ready we can go ahead and initialize our agent object using [**L**ang**C**hain **E**xpression **L**anguage (LCEL)](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/). We add instructions for when the LLM should _stop_ generating with `llm.bind(stop=[...])` and finally we parse the output from the agent using an `XMLAgentOutputParser` object.

In [19]:
from langchain.agents.output_parsers import XMLAgentOutputParser

agent = (
    {
        "input": lambda x: x["input"],
        # without "chat_history", tool usage has no context of prev interactions
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: convert_intermediate_steps(
            x["intermediate_steps"]
        ),
    }
    | prompt.partial(tools=convert_tools(tools))
    | llm.bind(stop=["</tool_input>", "</final_answer>"])
    | XMLAgentOutputParser()
)

With our `agent` object initialized we pass it to an `AgentExecutor` object alongside our original `tools` list:

In [20]:
from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent, tools=tools, verbose=True
)

Now we can use the agent via the `invoke` method:

In [25]:
user_msg = "can you tell me about llama 2?"

out = agent_executor.invoke({
    "input": user_msg,
    "chat_history": ""
})

print(out["output"])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<tool>arxiv_search</tool>
<tool_input>llama 2[0m[36;1m[1;3mModel Llama 2 Code Llama Code Llama - Python Size FIM LCFT Python CPP Java PHP TypeScript C# Bash Average 7B â 13B â 34B â 70B â 7B â 7B â 7B â 7B â 13B â 13B â 13B â 13B â 34B â 34B â 7B â 7B â 13B â 13B â 34B â 34B â â â â â 14.3% 6.8% 10.8% 9.9% 19.9% 13.7% 15.8% 13.0% 24.2% 23.6% 22.2% 19.9% 27.3% 30.4% 31.6% 34.2% 12.6% 13.2% 21.4% 15.1% 6.3% 3.2% 8.3% 9.5% 3.2% 12.6% 17.1% 3.8% 18.9% 25.9% 8.9% 24.8% â â â â â â â â â â 37.3% 31.1% 36.1% 30.4% 29.2% 29.8% 38.0%

2
Cove Liama Long context (7B =, 13B =, 34B) + fine-tuning ; Lrama 2 Code training 20B oes Cope Liama - Instruct Foundation models â> nfilling code training = eee.â (7B =, 13B =, 34B) â 5B (7B, 13B, 348) 5008 Python code Long context Cove Liama - PyrHon (7B, 13B, 34B) > training Â» Fine-tuning > 1008 208
Figure 2: The Code Llama

That looks pretty good, but right now our agent is _stateless_ — making it hard to have a conversation with. We can give it memory in many different ways, but one the easiest ways to do so is to use `ConversationBufferWindowMemory`.

In [26]:
from langchain.chains.conversation.memory import ConversationBufferWindowMemory

# conversational memory
conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)

We haven't attached our conversational memory to our agent — so the `conversational_memory` object will remain empty:

In [27]:
conversational_memory.chat_memory.messages

[]

We must manually add the interactions between ourselves and the agent to our memory.

In [28]:
conversational_memory.chat_memory.add_user_message(user_msg)
conversational_memory.chat_memory.add_ai_message(out["output"])

conversational_memory.chat_memory.messages

[HumanMessage(content='can you tell me about llama 2?'),
 AIMessage(content='\n- Llama 2 is a large language model developed by Meta AI. It comes in sizes ranging from 7B to 70B parameters.\n\n- Code Llama is a version of Llama 2 that has been specialized for code generation through fine-tuning on code datasets. Code Llama models are available in Python, C++, Java, PHP, TypeScript, C#, and Bash.\n\n- The Code Llama specialization pipeline involves foundation model pre-training, long context training, code infilling training, and fine-tuning on specific programming languages. \n\n- Code Llama significantly outperforms the base Llama 2 models on code generation benchmarks like HumanEval and MBPP. For example, the 34B parameter Code Llama - Python achieves 48.8% pass@1 on HumanEval compared to 34.1% for the 34B Llama 2.\n\n- As with all large language models, Llama 2 has limitations and potential risks that need to be considered before deploying it in applications. Meta provides a respons

Now we can see that _two_ messages have been added, our `HumanMessage` the agent's `AIMessage` response. Unfortunately, we cannot send these messages to our XML agent directly. Instead, we need to pass a string in the format:

```
Human: {human message}
AI: {AI message}
```

Let's write a quick `memory2str` helper function to handle this for us:

In [29]:
from langchain_core.messages.human import HumanMessage

def memory2str(memory: ConversationBufferWindowMemory):
    messages = memory.chat_memory.messages
    memory_list = [
        f"Human: {mem.content}" if isinstance(mem, HumanMessage) \
        else f"AI: {mem.content}" for mem in messages
    ]
    memory_str = "\n".join(memory_list)
    return memory_str

In [30]:
print(memory2str(conversational_memory))

Human: can you tell me about llama 2?
AI: 
- Llama 2 is a large language model developed by Meta AI. It comes in sizes ranging from 7B to 70B parameters.

- Code Llama is a version of Llama 2 that has been specialized for code generation through fine-tuning on code datasets. Code Llama models are available in Python, C++, Java, PHP, TypeScript, C#, and Bash.

- The Code Llama specialization pipeline involves foundation model pre-training, long context training, code infilling training, and fine-tuning on specific programming languages. 

- Code Llama significantly outperforms the base Llama 2 models on code generation benchmarks like HumanEval and MBPP. For example, the 34B parameter Code Llama - Python achieves 48.8% pass@1 on HumanEval compared to 34.1% for the 34B Llama 2.

- As with all large language models, Llama 2 has limitations and potential risks that need to be considered before deploying it in applications. Meta provides a responsible use guide with recommendations for safe

Now let's put together another helper function called `chat` to help us handle the _state_ part of our agent.

In [31]:
def chat(text: str):
    out = agent_executor.invoke({
        "input": text,
        "chat_history": memory2str(conversational_memory)
    })
    conversational_memory.chat_memory.add_user_message(text)
    conversational_memory.chat_memory.add_ai_message(out["output"])
    return out["output"]

Now we simply chat with our agent and it will remember the context of previous interactions.

In [33]:
print(chat("was any red teaming done with the model?"))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<tool>arxiv_search</tool>
<tool_input>llama 2 red teaming[0m[36;1m[1;3mAfter conducting red team exercises, we asked participants (who had also participated in Llama 2 Chat exercises) to also provide qualitative assessment of safety capabilities of the model. Some participants who had expertise in offensive security and malware development questioned the ultimate risk posed by âmalicious code generationâ through LLMs with current capabilities.
One red teamer remarked, âWhile LLMs being able to iteratively improve on produced source code is a risk, producing source code isnât the actual gap. That said, LLMs may be risky because they can inform low-skill adversaries in production of scripts through iteration that perform some malicious behavior.â
According to another red teamer, â[v]arious scripts, program code, and compiled binaries are readily available on mainstream public websites, hacking forums or on âthe

We can ask follow up questions that miss key information but thanks to the conversational history the LLM understands the context and uses that to adjust the search query. For example we asked about `red teaming` but did not mention `llama 2` — Claude 3 added this context to the search query of `"llama 2 red teaming"` based on the chat history.

---