# Experimentally Testing Claude's Long-Context QA Abilities

[DISCLAIMER: This notebook was created using Claude 2 models and is considered legacy.]

In this notebook, we will take a look at Claude's ability to answer questions about the meeting notes from a long government document. We will also see how this ability varies depending on the location of the relevant information. The government document is split up into many smaller subsections. Each question will be about information contained in one of those subsections. All the questions and answers will be written by Claude!

Summary of what is to come:

1. Downloading and preprocessing the data
2. Using Claude to write 400 multiple-choice questions about specific sections of the data
3. Validating that Claude is able to answer those questions when given that section alone
4. Validating that Claude is unable to answer those questions when given a random other chunk
5. Testing Claude's ability to answer the questions even when the context size gets very long.

### Data Prep

To start: download the document and split it up into chunks. Each chunk corresponds to a meeting note from one department, such as the Department of Transportation.

In [None]:
import anthropic, os, re, requests, trio, pandas as pd
import numpy as np
from bs4 import BeautifulSoup
API_KEY = os.environ['ANTHROPIC_API_KEY']
CLIENT = anthropic.Anthropic(api_key=API_KEY)

In [3]:
url = 'https://www.govinfo.gov/content/pkg/FR-2023-07-13/xml/FR-2023-07-13.xml'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'xml')

text = soup.get_text()
chunks = text.split('BILLING CODE')
chunks[0] = chunks[0][chunks[0].index('DEPARTMENT OF TRANSPORTATION'):]  # First chunk has some extra material at the beginning.

# We'll throw out the chunks that are extra-long or extra-short.
tokenizer = CLIENT.get_tokenizer()
chunks = [c for c in chunks if len(tokenizer.encode(c)) <= 5000 and len(tokenizer.encode(c)) > 200]
print(len(chunks))
print(chunks[2])

88
 4910–13–P



NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
14 CFR Part 1204
[NASA Document No: NASA–23–054; NASA Docket No: NASA–2023–0003]
RIN 2700–AE70
Delegations and Designations; Correction

AGENCY:
National Aeronautics and Space Administration.


ACTION:
Direct final rule; correction.


SUMMARY:

                        NASA published a document in the 
                        Federal Register
                         on July 5, 2023, concerning Delegations and Designations. The document contained an error in amendatory instruction 2.a.
                    


DATES:

                        This correction is effective September 5, 2023. If adverse comments are received on the direct final rule published at 88 FR 42870, NASA will publish a timely withdrawal of the rule and this correction to the rule in the 
                        Federal Register
                        .
                    


FOR FURTHER INFORMATION CONTACT:
Daniela Cruzado, 202–295–7589.



SUPPLEMENTARY

### Question and Answer Generation With Claude

Now, it's time to use Claude to generate questions and answers! We'll use a two-shot prompt template that includes two example (chunks, questions, answers) groups along with instructions. We'll ask for five questions about each chunk, with 3 wrong answers and 1 right answer.

In [4]:
example_passage1 = """DEPARTMENT OF HOUSING AND URBAN DEVELOPMENT
[Docket No. FR–6381–N–01]
Improving Access to Public Benefit Programs; Request for Comment
AGENCY:
Office of Policy Development and Research, Department of Housing and Urban Development, HUD.
ACTION:
Request for comments.
SUMMARY:
The Department of Housing and Urban Development is seeking comments from the public regarding the burden faced when applying for or maintaining eligibility for HUD's housing programs. HUD recognizes that these administrative hurdles and paperwork burdens disproportionately fall on the most vulnerable populations and prevent individuals and entities from accessing benefits for which they are legally eligible. Public comment submitted in response to this request for comment will assist HUD in better understanding, identifying, and reducing HUD's public program administrative burden and ultimately further its mission to pursue transformative housing and community-building policies and programs.
DATES:
Comment Due Date: August 14, 2023.
ADDRESSES:
Interested persons are invited to submit comments responsive to this request for comment. There are three methods for submitting public comments. All submissions must refer to the above docket number and title.
1. Electronic Submission of Comments. Comments may be submitted electronically through the Federal eRulemaking Portal at www.regulations.gov. HUD strongly encourages commenters to submit comments electronically through www.regulations.gov. Electronic submission of comments allows the commenter maximum time to prepare and submit a comment, ensures timely receipt by HUD, and enables HUD to make comments immediately available to the public. Comments submitted electronically through www.regulations.gov can be viewed by other commenters and interested members of the public. Commenters should follow the instructions provided on that website to submit comments electronically.
2. Submission of Comments by Mail. Comments may be submitted by mail to the Regulations Division, Office of General Counsel, Department of Housing and Urban Development, 451 7th Street SW, Room 10276, Washington, DC 20410–0500.
3. Submission of Comments by Electronic Mail. Comments may be submitted by electronic mail to the Regulations Division, Office of General Counsel, Department of Housing and Urban Development at improvingaccesstopublicbenefitprograms@hud.gov.
Note: To receive consideration as a public comment, comments must be submitted through one of the three methods specified above.
Public Inspection of Public Comments. Copies of all comments submitted will be available for inspection and downloading at www.regulations.gov. HUD will also make all properly submitted comments and communications available for public inspection and copying during regular business hours at the above address. Due to security measures at the HUD Headquarters building, you must schedule an appointment in advance to review the public comments by calling the Regulations Division at 202–708–3055 (this is not a toll-free number). HUD welcomes and is prepared to receive calls from individuals who are deaf or hard of hearing, as well as individuals with speech or communication disabilities. To learn more about how to make an accessible telephone call, please visit https://www.fcc.gov/consumers/guides/telecommunications-relay-service-trs. Copies of all comments submitted are available for inspection and downloading at www.regulations.gov.
FOR FURTHER INFORMATION CONTACT:
Todd Richardson, General Deputy Assistant Secretary, Office of Policy Development and Research, Department of Housing and Urban Development, 451 7th Street SW, Room 8100, Washington, DC 20410, telephone 202–402–5706 (this is not a toll-free number). HUD welcomes and is prepared to receive calls from individuals who are deaf or hard of hearing, as well as individuals with speech or communication disabilities. To learn more about how to make an accessible telephone call, please visit https://www.fcc.gov/consumers/guides/telecommunications-relay-service-trs.
SUPPLEMENTARY INFORMATION:
I. Background
Applying for and maintaining eligibility for public benefits and services, including housing programs, often requires completing and submitting a variety of forms. HUD and its housing partners that administer its programs (including Public Housing Authorities, State and local governments, non-profit recipients of CDBG programs, Multifamily Housing owners, and FHA lenders) use the information collected by these forms to determine whether applicants are eligible or if current recipients continue to be eligible. These forms and other methods of information collections may create burdens that disproportionately fall on the most vulnerable populations and prevent individuals and entities from accessing services for which they are legally eligible. These burdens include the expenditure of time, effort, or financial resources to generate, maintain, or provide information to HUD or its housing partners. For example, individuals may be required to provide a list of family members, the family's total annual family income, the assets available to each family member in the household, and the value of such assets in order to access public housing. Individuals applying for or maintaining eligibility for public benefits or services may also face burdens such as time spent gathering records and documentation needed to prove eligibility, travel time associated with developing and submitting the collection, or even time waiting to speak with agency personnel.
Consistent with the Paperwork Reduction Act of 1995 (PRA), 1 agencies must ensure that both the quantitative burden estimates and the narrative description supporting its information collection requests reflect the beginning-to-end experience of completing the information collection activity. Specifically, the burden faced by individuals applying for and maintaining eligibility for public benefits should also include:
1  Public Law 104–13 (1995) (codified at 44 U.S.C. 3501–3520).
—Information and learning costs, which refer to the time, effort, money, and other resources that individuals need to expend to learn about the existence of a public service or benefit, rules governing their eligibility and application, certification, benefits maintenance, and post-award reporting or recertification processes.
—Compliance costs, which refer to the time, effort, money, and other resources that individuals need to expend to follow through with program application, certification, or recertification, including filling out necessary paperwork, waiting for correspondence from program agencies, planning for in-person meetings, and producing documentation to confirm their eligibility (for instance, records of household composition, income, or assets)."""
questions1 = """<Question 1>
What is the Department of Housing and Urban Development seeking comments from the public about?
</Question 1>
<Answers 1>
1. Difficulties in obtaining access to HUD's housing program.
2. Potential changes in national zoning regulations for mixed-use housing.
3. Minimum notice for evictions of long-time tenants.
4. Insurance requirements for HUD-sponsored new construction in disaster-prone areas.
</Answers 1>
<Question 2>
When is the due date for public comment on the burdens placed on individuals applying for HUD's housing programs?
</Question 2>
<Answers 2>
1. August 14, 2023
2. September 9, 2023
3. January 2, 2024
4. July 31, 2023
</Answers 2>
<Question 3>
What do "compliance costs" refer to in the context of access to HUD's public benefit programs?
</Question 3>
<Answers 3>
1. Time, effort, money, and resources needed to behave in accordance with paperwork requirements.
2. Information and self-education required to familiarize oneself with the public services available.
3. Disclosure requirements for proving your organization has not shared information unduly with others.
4. Cognitive load, distress, anxiety, distrust, or loss of autonomy and dignity.
</Answers 3>
"""
questions2 = """<Question 1>
What agency published the document on July 5 concerning Delegations and Designations?
</Question 1>
<Answers 1>
1. National Aeronautics and Space Administration 
2. Federal Aviation Administration
3. Department of Defense
4. National Oceanic and Atmospheric Administration
</Answers 1>
<Question 2> 
What is the purpose of the document published in the Federal Register by NASA?
</Question 2>
<Answers 2>
1. To correct an error in a previous document regarding Delegations and Designations
2. To announce a new policy regarding procurement of launch services 
3. To solicit public comments on proposed changes to  Rule 210.12(b)(2) regarding astronaut training requirements
4. To provide guidance on sharing satellite data with foreign partners
</Answers 2>
<Question 3>
What will NASA do if it receives adverse comments on the direct final rule published on July 5, 2023?
</Question 3>
<Answers 3>
1. Publish a timely withdrawal of the rule and this correction to the rule
2. Extend the comment period by 30 days
3. Schedule public hearings to discuss the comments and reaactions to the comments
4. Proceed with implementing the rule as planned
</Answers 3>
<Question 4>  
What specifically needs to be corrected in the original NASA Federal Register document?
</Question 4>
<Answers 4>
1. The amendatory instruction for section 1204.501 paragraph (a)
2. The chapter heading for section 1107.323 paragraph (b) describing responsible disclosure of satellite data
3. The effective date of the delegations and designations, July 29, 2023
4. The point of contact for further information, Todd Richardson
</Answers 4>"""

example_passage2 = """NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
14 CFR Part 1204
[NASA Document No: NASA–23–054; NASA Docket No: NASA–2023–0003]
RIN 2700–AE70
Delegations and Designations; Correction
AGENCY:
National Aeronautics and Space Administration.
ACTION:
Direct final rule; correction.
SUMMARY:
NASA published a document in the Federal Register on July 5, 2023, concerning Delegations and Designations. The document contained an error in amendatory instruction 2.a.
DATES:
This correction is effective September 5, 2023. If adverse comments are received on the direct final rule published at 88 FR 42870, NASA will publish a timely withdrawal of the rule and this correction to the rule in the Federal Register .
FOR FURTHER INFORMATION CONTACT:
Daniela Cruzado, 202–295–7589.
SUPPLEMENTARY INFORMATION:
Correction
In the Federal Register of July 5, 2023, in FR Doc. 2023–14042, published at 88 FR 42870, the following correction is made:
§ 1204.501
[Amended]
1. On page 42871, in the first column, correct amendatory instruction 2.a. for § 1204.501 to read: “a. In paragraph (a) introductory text, add the words “the Office of” before the word “Strategic” and remove the words “Integrated Asset Management” and add in their place the words “Facilities and Real Estate.”
Nanette Smith,
Team Lead, NASA Directives and Regulations.
[FR Doc. 2023–14794 Filed 7–12–23; 8:45 am]"""
mc_qa3 = """\n\nHuman: Hello Claude. Here is a section from the minutes of a government meeting. Please read it carefully and devise five factual questions about it, along with three wrong answers and the right answer for each. Put questions in <Question></Question> tags and answers in <Answer></Answer> tags, as in the examples.

Here are two examples.

<Example>
<Passage>
{example_passage1}
</Passage>
{questions1}
</Example>
<Example>
<Passage>
{example_passage2}
</Passage>
{questions2}
</Example>

Now here is the passage I would like you to write questions for.

<Passage>
{test_passage}
</Passage>

Please write five factual questions about this document that can be answered with reference to it and without any outside knowledge. For each question, give three wrong answers and the right answer. Always put the correct answer first. Write 4 non-numerical questions and one numerical one. Make sure the wrong answers are highly detailed. Put the question inside <Question N></Question N> tags, and the answers inside <Answers N></Answers N> tags, where N is the index of the question, as in the examples. 

Guidelines:
Make sure that each question clearly and independently identifies the section/minutes/government meeting from which it derives; avoid terms like "this document", "this passage", "this notice" in favor of more specific descriptions. The goal is to future-proof the questions and answers in the event that they became divorced from their subject in the filing system.
Make the questions specific to their source text. Eschew generic questions about date of publication or name of agency. Instead, prefer questions that could not apply to notes produced by any other department/agency.

Assistant:
"""

A key detail to pay attention to in the prompt above: the instruction to make the wrong answers "highly detailed". Without this instruction, the wrong answers tended to be relatively short and the right answer stood out on length alone. Put a pin in the instruction to "Make sure that each question clearly and independently identifies the section/minutes/government meeting from which it derives"; we'll come back to it later.

Now, we'll make a dataframe with a column where we fill in the prompt template for each chunk, excluding the two chunks we used in the two-shot.

In [5]:
chunks = [c for c in chunks if example_passage1[20:80] not in c and example_passage2[20:80] not in c]
df = pd.DataFrame(
    {'chunk': chunks, 'chunk_idx': range(len(chunks))}
)
df['prompt'] = [mc_qa3.format(
    example_passage1=example_passage1, example_passage2=example_passage2, questions1=questions1, questions2=questions2, test_passage=c
    ) for c in chunks]
print(len(df))

86


In this notebook, we'll use Claude Instant, which has a 100K context window just like Claude 2. You can also run it with Claude 2 to similar results. First, we design helper code to allow us to call the API in parallel if your org allows. If not, you can just set the CapacityLimiter to 1.

In [6]:
def get_completion(client, prompt, max_tokens=3000, model='claude-instant-1.2', temperature=0):
    return client.completions.create(
        prompt=prompt, max_tokens_to_sample=max_tokens, model=model, temperature=temperature, stop_sequences=['\n\nHuman:', '\n\nAssistant:']
    ).completion

async def process_case(limiter, client, prompt, results, output_col_name='completion'):

    async with limiter:
        completion = await trio.to_thread.run_sync(get_completion, client, prompt)

    results.append({'prompt': prompt, output_col_name: completion})

    if len(results) % 10 == 0:
        print(f"{len(results)} test cases processed")  # Optional "progress bar"

async def get_completions_parallel(client, prompts, output_col_name='completion'):
    async with trio.open_nursery() as nursery:
        limiter = trio.CapacityLimiter(10)  # Set this to the maximum concurrency allowed on your API key, which may just be 1.
        results = []
        for prompt in prompts:
            nursery.start_soon(process_case, limiter, CLIENT, prompt, results, output_col_name)
    return results

In [None]:
# Get questions and answers for every prompt
qas = await get_completions_parallel(CLIENT, df.prompt.values, output_col_name='qas')
df = df.merge(pd.DataFrame(qas), on='prompt')

Next, we'll do some minor cleanup on the output:
- Remove the numbers for ease of reshuffling
- Extract the material between XML tags
- Make a separate row for every (question + answers) pair

In [8]:
def remove_numbered_bullets(answer):
    return re.sub(r'^\d+\. ', '', answer)

In [9]:
def extract_between_tags(tag: str, string: str, strip: bool = True, alt=True) -> list[str]:
    # Helper function for parsing Claude's output
    try:
        ext_list = re.findall(f"<{tag}\s?>(.+?)</{tag}\s?>", string, re.DOTALL)
        if strip:
            ext_list = [e.strip() for e in ext_list]
        if alt and not ext_list:
            ext_list = re.findall(f"<{tag}\s?>(.+?)<{tag}\s?>", string, re.DOTALL)
            if strip:
                ext_list = [e.strip() for e in ext_list]
        return ext_list
    except:
        return extract_between_tags(tag, string+'</' + tag + '>', strip, alt)

def extract_answer(sample):
    return extract_between_tags('Answer', sample)[0][0] if extract_between_tags(
        'Answer', sample) else extract_between_tags('Answer', sample + '</Answer>')[0][0] if extract_between_tags('Answer', sample + '</Answer>') else '_'

def extract_qs_as(qas, n=5):
    # Parse each of Claude's answers to the QA generation prompt into a question and a list of answers.
    flattened_qas = []
    for i in range(1, n + 1):
        try:
            question = extract_between_tags(f'Question {i}', qas)[0]
            answers = extract_between_tags(f'Answers {i}', qas)[0]
        except:
            continue
        flattened_qas.append({
          'question': question,
          'right_answer': remove_numbered_bullets(answers.split('\n')[0].strip()),
          'wrong_answers': [remove_numbered_bullets(a.strip()) for a in answers.split('\n')[1:]]
        })
    return flattened_qas

We started out with 86 sections after devoting 2 of the original 88 to examples, yielding 86 * 5 = 430 questions.

In [10]:
qs_as = df['qas'].apply(extract_qs_as)
df['questions'] = [[q['question'] for q in qa] for qa in qs_as]
df['right_answers'] = [[q['right_answer'] for q in qa] for qa in qs_as]
df['wrong_answers'] = [[q['wrong_answers'] for q in qa] for qa in qs_as]
qa_df_rows = []
for i, row in df.iterrows():
    for j, q in enumerate(row.questions):
        qa_df_rows.append(row.to_dict() | {'question': q, 'right_answer': row['right_answers'][j], 'wrong_answers_q': row['wrong_answers'][j]})
qa_df = pd.DataFrame(qa_df_rows)
print(len(qa_df))

430


It's a good time to look at some of the questions and answers to make sure they look mostly reasonable.

In [12]:
for i in range(28, 38):
    for c in ['question', 'right_answer', 'wrong_answers_q']:
        print(qa_df.iloc[i][c])

### Establishing Baselines + Quality Control

Let's create an answering prompt that tells Claude to read the material and answer a question about it.

In [13]:
mc_answer_one_chunk_prompt = """\n\nHuman: Please read the following government record closely and then answer the multiple choice question below.
<Government Record>
{chunk}
</Government Record>
Here is the question:
<Question>
{question}
</Question>
Based on the government record above, select the correct answer to the question from the list below and write the corresponding letter (A, B, C, or D) in <Answer></Answer> tags.
<Answers>
{answers}
</Answers>

Assistant: Based on the government record provided, the correct answer to the question is:
"""

Randomize answers and track which one is correct in the 'correct_answer_letter' column.

In [14]:
def randomize_answers(answers_list):
    # Assign a letter A-D randomly to each answer
    shuffled = np.random.permutation(answers_list[:4])
    letters = ['A. ', 'B. ', 'C. ', 'D. ']
    numbered = [letters[i] + answer for i, answer in enumerate(shuffled)]
    s_numbered = sorted(numbered)
    return s_numbered

qa_df.apply(lambda row: randomize_answers(row['wrong_answers_q'] + [row['right_answer']]), axis=1)

qa_df['randomized_answers'] = qa_df.apply(lambda row: randomize_answers(row['wrong_answers_q'] + [row['right_answer']]), axis=1)

def pluck_answer_letter(qa_df_row):
    # Find the letter of the correct answer
    answer = qa_df_row['right_answer']
    for ra in qa_df_row['randomized_answers']:
        if ra[3:] == answer:
            return ra[0]

qa_df['correct_answer_letter'] = qa_df.apply(lambda row: pluck_answer_letter(row), axis=1)

First, we will test Claude's ability to answer the question when it sees the relevant chunk and only the relevant chunk.

In [15]:
qa_df['qa_with_right_chunk_prompt'] = qa_df.apply(lambda row: mc_answer_one_chunk_prompt.format(
    chunk=row['chunk'], question=row['question'], answers=row['randomized_answers']),
    axis=1
) # Populate prompt column

In [None]:
qa_answer_right_chunk = await get_completions_parallel(CLIENT, qa_df['qa_with_right_chunk_prompt'].values, output_col_name='qa_answer_right_chunk')

In [17]:
qa_df = qa_df.merge(pd.DataFrame(qa_answer_right_chunk), left_on='qa_with_right_chunk_prompt', right_on='prompt', suffixes=['', '_x']).drop(columns=['prompt_x'])

Now let's see how many it got right.

In [18]:
def print_results(df, results):
    cs, ics = 0, 0
    j = 0
    for i, row in df.iterrows():
        if results[j] == row['correct_answer_letter']:
            cs += 1
        else:
            ics += 1
        j += 1
    print("Results:", cs, ics)

In [19]:
qa_df['qa_answer_right_chunk'] = [extract_answer(sample) for sample in qa_df['qa_answer_right_chunk'].values]
print_results(qa_df, qa_df['qa_answer_right_chunk'])

Results: 387 43


It got 90% of them right. Now, we'll see how Claude does when, instead of giving Claude the chunk with the answer, we give it some random other chunk. Poor Claude!

In [20]:
shift_val = int(len(qa_df) / 2)
qa_df['shifted_chunk'] = qa_df['chunk'].shift(shift_val)
qa_df['shifted_chunk'].iloc[:shift_val] = qa_df['chunk'].iloc[-1 * shift_val:].values

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qa_df['shifted_chunk'].iloc[:shift_val] = qa_df['chunk'].iloc[-1 * shift_val:].values


In [21]:
qa_df['qa_with_shift_chunk_prompt'] = qa_df.apply(
    lambda row: mc_answer_one_chunk_prompt.format(chunk=row['shifted_chunk'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

In [None]:
qa_answer_shift_chunk = await get_completions_parallel(CLIENT, qa_df['qa_with_shift_chunk_prompt'].values, output_col_name='qa_answer_shift_chunk')

In [23]:
qa_df = qa_df.merge(pd.DataFrame(qa_answer_shift_chunk), left_on='qa_with_shift_chunk_prompt', right_on='prompt', suffixes=['', '_x']).drop(columns=['prompt_x'])

In [24]:
qa_df['qa_answer_shift_chunk'] = [extract_answer(sample) for sample in qa_df['qa_answer_shift_chunk'].values]
print_results(qa_df, qa_df['qa_answer_shift_chunk'])

Results: 155 275


By sheer chance Claude would be expected to get 25% right. In practice, Claude got 36% right. Just as smart humans like us have the ability to guess above chance on a standardized test, so does Claude. Still a far cry from Claude's accuracy when given the right chunk, so the experiment is meaningful. We'll filter out the questions where Claude didn't get the correct answer even with the relevant chunk, as those are "too difficult" for testing the impact of long context.

In [25]:
too_hard_qa_df = qa_df[qa_df.correct_answer_letter != qa_df.qa_answer_right_chunk]
qa_df = qa_df[qa_df.correct_answer_letter == qa_df.qa_answer_right_chunk]
len(qa_df)

387

### Test Time

Now for the long context part! We will create long contexts by taking random chunks until we've made a nice big pile of tokens. We will create a different long context for each question. We try two different prompts here: one basic prompt, and one including a "scratchpad" where we ask Claude to pull relevant quotes from the document that may be helpful.

In [26]:
mc_answer_one_chunk_prompt = """\n\nHuman: Please read the following government record closely and then answer the multiple choice question below.
<Government Record>
{chunk}
</Government Record>
Here is the question:
<Question>
{question}
</Question>
Based on the government record above, select the correct answer to the question from the list below and write the corresponding letter (A, B, C, or D) in <Answer></Answer> tags.
<Answers>
{answers}
</Answers>

Assistant: Based on the government record provided, the correct answer to the question is:
"""

In [27]:
mc_answer_one_chunk_prompt_scratchpad = """\n\nHuman: Please read the following government record closely and then answer the multiple choice question below.
<Government Record>
{chunk}
</Government Record>
Now here is the question for you to answer:
<Question>
{question}
</Question>
Pull 2-3 relevant quotes from the record that pertain to the question and write them inside <scratchpad></scratchpad> tags. Then, select the correct answer to the question from the list below and write the corresponding letter (A, B, C, or D) in <Answer></Answer> tags.
<Answers>
{answers}
</Answers>

Assistant:
"""

To create long contexts, we use a technique we call "randomized collage" -- start with the relevant chunk, concatenate random chunks until we reach the maximum length we want to test on, randomize the chunks, then move the relevant chunk to the desired location in the collage. We will experiment with putting the relevant chunk in the beginning, middle, and beginning of the context.

In [None]:
def create_long_context(chunk, other_chunks, main_chunk_idx, max_tokens=70000):  # Can also use 95000.
    doc_len = len(tokenizer.encode(chunk))
    chunks_ctx = [chunk]
    np.random.shuffle(other_chunks)
    i = 0
    # Add chunks until we exceed the context length
    while doc_len < max_tokens:
        chunks_ctx.append(other_chunks[i])
        doc_len += len(tokenizer.encode(other_chunks[i]))
        i += 1
    # Put the relevant chunk in the desired position.
    chunks_ctx = chunks_ctx[:-1]
    chunks_ctx_ordered = chunks_ctx[1:main_chunk_idx] + [chunk] + chunks_ctx[main_chunk_idx:]
    return '\n\n\n\n'.join(chunks_ctx_ordered)

In [29]:
qa_df['long_context_end'] = qa_df.apply(lambda row: create_long_context(row['chunk'], [c for c in chunks if c != row['chunk']], len(chunks)), axis=1)
qa_df['long_context_middle'] = qa_df.apply(lambda row: create_long_context(row['chunk'], [c for c in chunks if c != row['chunk']], 20), axis=1)
qa_df['long_context_beginning'] = qa_df.apply(lambda row: create_long_context(row['chunk'], [c for c in chunks if c != row['chunk']], 0), axis=1)

In [30]:
# Create prompts for each question/context
qa_df['qa_long_ctx_prompt_end'] = qa_df.apply(lambda row: mc_answer_one_chunk_prompt.format(
    chunk=row['long_context_end'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

qa_df['qa_long_ctx_prompt_middle'] = qa_df.apply(lambda row: mc_answer_one_chunk_prompt.format(
    chunk=row['long_context_middle'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

qa_df['qa_long_ctx_prompt_beginning'] = qa_df.apply(lambda row: mc_answer_one_chunk_prompt.format(
    chunk=row['long_context_beginning'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

Now we'll do another round of sampling for beginning, middle, and end. 

*Note: Each of these cells takes a while to run.* If you're just following along for fun, you probably want to run this only on a few rows of qa_df.

In [31]:
async def sample_from_prompt(exp_name, prompt_col):
    global qa_df
    answers = await get_completions_parallel(CLIENT, qa_df[prompt_col].values, output_col_name=exp_name)
    qa_df = qa_df.merge(pd.DataFrame(answers), left_on=prompt_col, right_on='prompt', suffixes=['', '_x'], how='left').drop(columns=['prompt_x'])
    qa_df[exp_name] = [extract_answer(sample) for sample in qa_df[exp_name].values]

In [None]:
# We reuse this code block throughout to first sample each prompt and get Claude's answer to each question, then analyze the results
# ...and to do this for the relevant chunk being in the beginning, middle, or end.
# Note: for a table with results for each row, see the blog post on Anthropic's website.
# Note: if this block takes unacceptably long for you, you can downsample qa_df.
for position in ['beginning', 'middle', 'end']:
    exp_name = 'qa_answers_long_ctx_' + position
    prompt_col = 'qa_long_ctx_prompt_' + position
    _ = await sample_from_prompt(exp_name, prompt_col)
    print("Results for " + exp_name)
    print_results(qa_df, qa_df[exp_name].values)

Now we'll repeat the experiment, but with Claude having access to a scratchpad in which to put exact quotes from the context.

In [36]:
qa_df['qa_long_ctx_prompt_scratchpad_end'] = qa_df.apply(lambda row: mc_answer_one_chunk_prompt_scratchpad.format(
    chunk=row['long_context_end'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

qa_df['qa_long_ctx_prompt_scratchpad_middle'] = qa_df.apply(lambda row: mc_answer_one_chunk_prompt_scratchpad.format(
    chunk=row['long_context_middle'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

qa_df['qa_long_ctx_prompt_scratchpad_beginning'] = qa_df.apply(lambda row: mc_answer_one_chunk_prompt_scratchpad.format(
    chunk=row['long_context_beginning'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

In [None]:
for position in ['beginning', 'middle', 'end']:
    exp_name = 'qa_answers_long_ctx_scratchpad_' + position
    prompt_col = 'qa_long_ctx_prompt_scratchpad_' + position
    _ = await sample_from_prompt(exp_name, prompt_col)
    print("Results for " + exp_name)
    print_results(qa_df, qa_df[exp_name].values)

Next, we'll try adding some examples of correctly-answered multiple-choice questions to the prompt. To start, we'll use some made-up examples. We'll test with and without a scratchpad.

In [42]:
mc_answer_lc_with_nongov_examples_prompt = """\n\nHuman: Please read the following government record closely and then answer the multiple choice question below.
<Government Record>
{chunk}
</Government Record>
First, here are two example questions with correct answers.
<Question>
Who was the first president of the United States?
</Question>
<Answers>
A. Thomas Jefferson
B. George Washington
C. Abraham Lincoln
D. John Adams
</Answers>
Here, the correct answer is:
<Answer>
B. George Washington
</Answer>
<Question>
What is the boiling temperature of water, in degrees Fahrenheit?
</Question>
<Answers>
A. 200
B. 100
C. 287
D. 212
</Answers>
Here, the correct answer is:
<Answer>
D. 212
</Answer>
Now, based on the government record you've just read, please answer this question:
<Question>
{question}
</Question>
Select the correct answer to the question from the list below and write the corresponding letter (A, B, C, or D) in <Answer></Answer> tags.
<Answers>
{answers}
</Answers>

Assistant:
"""

In [41]:
mc_answer_lc_with_nongov_examples_prompt_scratchpad = """\n\nHuman: Please read the following government record closely and then answer the multiple choice question below.
<Government Record>
{chunk}
</Government Record>
Based on the government record above, select the correct answer to the question from the list below and write the corresponding letter (A, B, C, or D) in <Answer></Answer> tags.
First, here are two example questions.
<Question>
Who was the first president of the United States?
</Question>
<Answers>
A. Thomas Jefferson
B. George Washington
C. Abraham Lincoln
D. John Adams
</Answers>
Here, the correct answer is:
<Answer>
B. George Washington
</Answer>
<Question>
What is the boiling temperature of water, in degrees Fahrenheit?
</Question>
<Answers>
A. 200
B. 100
C. 287
D. 212
</Answers>
Here, the correct answer is:
<Answer>
D. 212
</Answer>
Now, based on the government record you've just read, please answer this question:
<Question>
{question}
</Question>
Pull 2-3 relevant quotes from the record that pertain to the question and write them inside <scratchpad></scratchpad> tags. Then, select the correct answer to the question from the list below and write the corresponding letter (A, B, C, or D) in <Answer></Answer> tags.
<Answers>
{answers}
</Answers>

Assistant:
"""

In [43]:
# Create prompts, non-scratchpad version
qa_df['qa_long_ctx_prompt_nongov_examples_end'] = qa_df.apply(lambda row: mc_answer_lc_with_nongov_examples_prompt.format(
    chunk=row['long_context_end'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

qa_df['qa_long_ctx_prompt_nongov_examples_middle'] = qa_df.apply(lambda row: mc_answer_lc_with_nongov_examples_prompt.format(
    chunk=row['long_context_middle'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

qa_df['qa_long_ctx_prompt_nongov_examples_beginning'] = qa_df.apply(lambda row: mc_answer_lc_with_nongov_examples_prompt.format(
    chunk=row['long_context_beginning'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

In [None]:
# Get answers and print accuracy.
for position in ['beginning', 'middle', 'end']:
    exp_name = 'qa_long_ctx_answers_nongov_examples_' + position
    prompt_col = 'qa_long_ctx_prompt_nongov_examples_' + position
    _ = await sample_from_prompt(exp_name, prompt_col)
    print("Results for " + exp_name)
    print_results(qa_df, qa_df[exp_name].values)

In [45]:
# Create prompts, with-scratchpad version
qa_df['qa_long_ctx_prompt_nongov_examples_scratchpad_end'] = qa_df.apply(lambda row: mc_answer_lc_with_nongov_examples_prompt_scratchpad.format(
    chunk=row['long_context_end'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

qa_df['qa_long_ctx_prompt_nongov_examples_scratchpad_middle'] = qa_df.apply(lambda row: mc_answer_lc_with_nongov_examples_prompt_scratchpad.format(
    chunk=row['long_context_middle'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

qa_df['qa_long_ctx_prompt_nongov_examples_scratchpad_beginning'] = qa_df.apply(lambda row: mc_answer_lc_with_nongov_examples_prompt_scratchpad.format(
    chunk=row['long_context_beginning'], question=row['question'], answers=row['randomized_answers']),
    axis=1
)

In [None]:
# Get answers and print accuracy.
for position in ['beginning', 'middle', 'end']:
    exp_name = 'qa_long_ctx_answers_nongov_examples_scratchpad_' + position
    prompt_col = 'qa_long_ctx_prompt_nongov_examples_scratchpad_' + position
    _ = await sample_from_prompt(exp_name, prompt_col)
    print("Results for " + exp_name)
    print_results(qa_df, qa_df[exp_name].values)

The results do not show much improvement if any. Can we do better by adding "few-shot" examples that are more germane to the task? 

The procedure for generating these few_shot examples is as follows. For each question, find its associated chunk, then choose random QAs from other chunks that aren't that chunk.

We will experiment with using 2 and 5 examples, with and without a scratchpad.

In [54]:
# Function to generate a prompt using examples from the context.
def gen_mc_answer_lc_with_examples_prompt(num_examples): 
    examples_section = "some example questions that refer to the government record above, along with correct answers."
    for i in range(num_examples):
        examples_section += """
<Question>
{sample_question""" + str(i+1) + """}
</Question>
<Answers>
{sample_answers""" + str(i+1) + """}
</Answers>
Here, the correct answer is:
<Answer>
{correct_answer""" + str(i+1) + """}
</Answer>"""
    return """\n\nHuman: Please read the following government record closely and then answer the multiple choice question below.
<Government Record>
{chunk}
</Government Record>
First, here are """ + examples_section + """
Now here is the question for you to answer.
<Question>
{question}
</Question>
Select the correct answer to the question from the list below and write the corresponding letter (A, B, C, or D) in <Answer></Answer> tags.
<Answers>
{answers}
</Answers>

Assistant:
"""

In [55]:
# Same as above, but includes scratchpad.
def gen_mc_answer_lc_with_examples_prompt_scratchpad(num_examples): 
    examples_section = "some example questions that refer to the government record above, along with correct answers."
    for i in range(num_examples):
        examples_section += """
<Question>
{sample_question""" + str(i+1) + """}
</Question>
<Answers>
{sample_answers""" + str(i+1) + """}
</Answers>
Here, the correct answer is:
<Answer>
{correct_answer""" + str(i+1) + """}
</Answer>"""
    return """\n\nHuman: Please read the following government record closely and then answer the multiple choice question below.
<Government Record>
{chunk}
</Government Record>
First, here are """ + examples_section + """
Now here is the question for you to answer.
<Question>
{question}
</Question>
Pull 2-3 relevant quotes from the record that pertain to the question and write them inside <scratchpad></scratchpad> tags. Then, select the correct answer to the question from the list below and write the corresponding letter (A, B, C, or D) in <Answer></Answer> tags.
<Answers>
{answers}
</Answers>

Assistant:
"""

In [56]:
# Get examples randomly
def grab_example_qas(long_context_row, long_context_col, qa_df, num_examples=2):
    examples = []
    for i, row in qa_df.sample(frac=1).iterrows():  # Randomize order of questions
        if row['chunk'] in long_context_row[long_context_col] and row['chunk'] != long_context_row.chunk:
            # Examples must pertain to chunks that were included in the collage, but must not be the exact question in question.
            examples.append({
                'question': row.question, 'answers': row.randomized_answers, 
                'correct_answer': [a for a in row.randomized_answers if row.right_answer in a][0][0]})
        if len(examples) >= num_examples:
            break
    examples_numbered = {}
    for i in range(num_examples):
        examples_numbered['sample_question' + str(i+1)] = examples[i]['question']
        examples_numbered['sample_answers' + str(i+1)] = examples[i]['answers']
        examples_numbered['correct_answer' + str(i+1)] = examples[i]['correct_answer']
    return examples_numbered

In [57]:
def format_for_long_ctx_with_examples(row, chunk_col, long_context_col, qa_df, num_examples=2):
    # Get examples QA pairs and plug them into the prompt
    example_qas = grab_example_qas(long_context_row=row, long_context_col=long_context_col, qa_df=qa_df, num_examples=num_examples)
    format_args = {}
    for i in range(1, num_examples+1):
        format_args['sample_question'+str(i)] = example_qas['sample_question'+str(i)] 
        format_args['sample_answers'+str(i)] = example_qas['sample_answers'+str(i)]
        format_args['correct_answer'+str(i)] = example_qas['correct_answer'+str(i)]
    return gen_mc_answer_lc_with_examples_prompt(num_examples).format(
        chunk=row[chunk_col], question=row['question'], answers=row['randomized_answers'],
        **format_args
    )

In [58]:
def format_for_long_ctx_with_examples_scratchpad(row, chunk_col, long_context_col, qa_df, num_examples=2):
    # Same as above, but with scratchpad.
    example_qas = grab_example_qas(long_context_row=row, long_context_col=long_context_col, qa_df=qa_df, num_examples=num_examples)
    format_args = {}
    for i in range(1, num_examples+1):
        # The examples are indexed from 1.
        format_args['sample_question'+str(i)] = example_qas['sample_question'+str(i)] 
        format_args['sample_answers'+str(i)] = example_qas['sample_answers'+str(i)]
        format_args['correct_answer'+str(i)] = example_qas['correct_answer'+str(i)]
    return gen_mc_answer_lc_with_examples_prompt_scratchpad(num_examples).format(
        chunk=row[chunk_col], question=row['question'], answers=row['randomized_answers'],
        **format_args
    )

First, we'll experiment with just 2 examples.

In [None]:
num_examples = 2
# Generate prompts that include examples, have Claude answer questions, print accuracy numbers for (beginning, middle, end)
qa_df[f'long_ctx_with_{num_examples}_examples_prompt_end'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples(row, 'long_context_end', 'qa_long_ctx_prompt_end', qa_df, num_examples=num_examples), axis=1)

qa_df[f'long_ctx_with_{num_examples}_examples_prompt_middle'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples(row, 'long_context_middle', 'qa_long_ctx_prompt_middle', qa_df, num_examples=num_examples), axis=1)

qa_df[f'long_ctx_with_{num_examples}_examples_prompt_beginning'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples(row, 'long_context_beginning', 'qa_long_ctx_prompt_beginning', qa_df, num_examples=num_examples), axis=1)

for position in ['beginning', 'middle', 'end']:
    exp_name = f'long_ctx_with_{num_examples}_examples_answers_' + position
    prompt_col = f'long_ctx_with_{num_examples}_examples_prompt_' + position
    _ = await sample_from_prompt(exp_name, prompt_col)
    print("Results for " + exp_name)
    print_results(qa_df, qa_df[exp_name].values)

Definitely better! What if we increase the number of examples to 5?

In [None]:
num_examples = 5
# Same as above, but with 5 examples
qa_df[f'long_ctx_with_{num_examples}_examples_prompt_end'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples(row, 'long_context_end', 'qa_long_ctx_prompt_end', qa_df, num_examples=num_examples), axis=1)
    lambda row: format_for_long_ctx_with_examples(row, 'long_context_end', 'qa_long_ctx_prompt_end', qa_df, num_examples=num_examples), axis=1)

qa_df[f'long_ctx_with_{num_examples}_examples_prompt_middle'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples(row, 'long_context_middle', 'qa_long_ctx_prompt_middle', qa_df, num_examples=num_examples), axis=1)

qa_df[f'long_ctx_with_{num_examples}_examples_prompt_beginning'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples(row, 'long_context_beginning', 'qa_long_ctx_prompt_beginning', qa_df, num_examples=num_examples), axis=1)

for position in ['beginning', 'middle', 'end']:
    exp_name = f'long_ctx_with_{num_examples}_examples_answers_' + position
    prompt_col = f'long_ctx_with_{num_examples}_examples_prompt_' + position
    _ = await sample_from_prompt(exp_name, prompt_col)
    print("Results for " + exp_name)
    print_results(qa_df, qa_df[exp_name].values)

Now trying 2 and 5 examples with scratchpad.

In [None]:
num_examples = 2
qa_df[f'long_ctx_with_{num_examples}_examples_scratchpad_prompt_end'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples_scratchpad(row, 'long_context_end', 'qa_long_ctx_prompt_end', qa_df, num_examples=num_examples), axis=1)

qa_df[f'long_ctx_with_{num_examples}_examples_scratchpad_prompt_middle'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples_scratchpad(row, 'long_context_middle', 'qa_long_ctx_prompt_middle', qa_df, num_examples=num_examples), axis=1)

qa_df[f'long_ctx_with_{num_examples}_examples_scratchpad_prompt_beginning'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples_scratchpad(row, 'long_context_beginning', 'qa_long_ctx_prompt_beginning', qa_df, num_examples=num_examples), axis=1)

for position in ['beginning', 'middle', 'end']:
    exp_name = f'long_ctx_with_{num_examples}_examples_scratchpad_answers_' + position
    prompt_col = f'long_ctx_with_{num_examples}_examples_scratchpad_prompt_' + position
    _ = await sample_from_prompt(exp_name, prompt_col)
    print("Results for " + exp_name)
    print_results(qa_df, qa_df[exp_name].values)

In [None]:
num_examples = 5
qa_df[f'long_ctx_with_{num_examples}_examples_scratchpad_prompt_end'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples_scratchpad(row, 'long_context_end', 'qa_long_ctx_prompt_end', qa_df, num_examples=num_examples), axis=1)

qa_df[f'long_ctx_with_{num_examples}_examples_scratchpad_prompt_middle'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples_scratchpad(row, 'long_context_middle', 'qa_long_ctx_prompt_middle', qa_df, num_examples=num_examples), axis=1)

qa_df[f'long_ctx_with_{num_examples}_examples_scratchpad_prompt_beginning'] = qa_df.apply(
    lambda row: format_for_long_ctx_with_examples_scratchpad(row, 'long_context_beginning', 'qa_long_ctx_prompt_beginning', qa_df, num_examples=num_examples), axis=1)

for position in ['beginning', 'middle', 'end']:
    exp_name = f'long_ctx_with_{num_examples}_examples_scratchpad_answers_' + position
    prompt_col = f'long_ctx_with_{num_examples}_examples_scratchpad_prompt_' + position
    _ = await sample_from_prompt(exp_name, prompt_col)
    print("Results for " + exp_name)
    print_results(qa_df, qa_df[exp_name].values)

# Conclusion

- Including a scratchpad always helps.
- Including random examples does not particularly help.
- Including contextual examples does help, and 5 is better than 2

We hope you've enjoyed reading through this notebook and that the tips and code it contains are useful to you.