Evaluating Claude 3.5, GPT-4o and Gemini 1.5 Pro: How Well Do They Extract Information and Count Items?

Testing LLMs for information extraction and item counting

Jul 11, 2024

Frontier models now have longer context windows, enabling users to contextualize interactions by uploading their own documents. But how good are they at extracting information? In this test, we will evaluate three leading LLMs: Anthropic’s Claude 3.5 Sonnet, OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro.

Method

The three LLM are tasked with extracting information from a document containing a list of job titles. The LLMs were required to identify distinct job titles, group them accordingly and determine the count for each job title. The prompt used for this task is as follows:

Analyze the attached document containing job titles. For each unique job title:
1. Count the number of employees
- Think through this step by step; show your steps.
- Find and count rows for each job title (1 job title = 1 employee). 
- Sum the counts to determine the total number of employees for each job title.

2. Present the results in a table format with these columns:
- No (serial number)
- Job Title (Example: Chief Technology Officer, Manager, etc.)
- Number of Employees

You must:
- Double-check each row individually.
- Use a systematic method (e.g., striking out counted rows).
- Be extra cautious with non-consecutive, similar roles.
- Think through step by step, as accuracy is top priority. 

To verify the accuracy of your output:
1. Add up the numbers in the ‘Number of Employees’ column in your table.
2. Show your step-by-step calculation.
3. The total should match the attached document. If not, review the document again for any missed entries.

The LLMs were tested on two dimensions:

Listing format: One list in a somewhat randomized, non-alphabetically order, and another list in alphabetical order
List length: Four sets of lists with 100, 75, 50 and 25 job titles.

Below are examples of an alphabetically ordered list (left) and a non-alphabetically ordered list (right):

Examples of Alphabetical and Non-Alphabetical Lists

Each LLM was tested on alphabetical lists with 100, 75, 50 and 25 records, follow by non-alphabetical list with 100, 75, 50 and 25 records. Results were captured after a single prompt was sent to the LLM.

Results

The best performer was Claude 3.5 Sonnet, achieving 100% accuracy of information extraction in 7 out of 8 tests. It only failed on the non-alphabetical list with 100 records.

GPT-4o achieved 100% accuracy only for lists of 50 records and below. It failed to achieve 100% accuracy for lists with 75 and 100 records, both alphabetical and non-alphabetical. Gemini 1.5 Pro performed the worst comparatively. During the test runs, it only achieved 100% accuracy for the alphabetical list of 50 records.

While all LLMs accurately extracted all job titles listed in the source document, they struggled with determining the counts of each job title.

The table shows a summarizes of test results, where ✅ indicates 100% accuracy and ❌ indicates less than 100% accuracy:

Test Results for Variations on Claude 3.5 Sonnet, GPT-4o and Gemini 1.5 Pro

The table below shows the number of errors made by the LLMs during testing, specifically related to incorrect counts of job titles.

Number of Incorrect Counts of Job Titles by LLMs

Key Takeaways

Even with long context windows, such as Gemini 1.5 Pro with 2 million input tokens, LLMs have difficulty extracting and processing long lists to perform item counts. This task, which can be easily done in a spreadsheet, appears to be a hit-or-miss for LLMs.

As more companies use LLMs to connect to internal data, they must be clear about their acceptable error tolerance levels for LLM output. For companies with low tolerance for errors, high output accuracy can be achieved through a combination of model fine-tuning, search-retrieval setup (RAG), and advanced prompting techniques.

Thank you for reading AI is Accelerating Innovation. This post is public so feel free to share it.

Join Lim Yung-Hui’s subscriber chat

Available in the Substack app and on web

User-Friendly

Discussion about this post