Logo
Home

Benchmarking Local Language Models for Named Entity Recognition

With large language models at the top of the hype cycle, it is an interesting exercise to benchmark locally runnable models for named entity recognition (NER) and compare each for speed by running them on a MacBook Pro. This post walks through the process, from selecting the models to evaluating their performance.

This post is by Chris McCabe, and you can contact me on Twitter/X.

Models Used

The models used for this benchmark are pulled from the Ollama models page, sorted by popularity, with tools and easier versions removed. Additionally, only models up to 70B parameters were used, as the MacBook Pro used for testing has 64GB of RAM. The models tested include:

Test Data

A snippet from the UK MP declarations of member benefits was used as the test string. This snippet contains information about payments and royalties received by MPs. The data was pulled using Beautiful Soup from the source link.


test_string = """
Payments from Hodder and Stoughton UK, Carmelite House, 50 Victoria Embankment, London EC4Y 0DZ, via United Agents, 12-26 Lexington St, London W1F 0LE: 12 July 2022, received \u00a3439.82 for royalties on book already written. Hours: no additional hours. (Registered 28 July 2022) 10 August 2022, received \u00a3519.69 for royalties on book already written. Hours: no additional hours. (Registered 23 August 2022) 5 October 2022, received \u00a31,771.82 for royalties on book already written. Hours: no additional hours. (Registered 27 October 2022) 14 March 2023, received \u00a3673.11 for royalties on book already written. Hours: no additional hours. (Registered 03 April 2023) 5 April 2023, received \u00a32,590.85 for royalties on book already written. Hours: no additional hours. (Registered 24 April 2023)",
      "Payments from HarperCollins UK, 1 London Bridge St, London SE1 9GF, via Rogers, Coleridge and White Ltd, 20 Powis Mews, London W11 1JN: 30 April 2022, received \u00a3382.03 for royalties on books already written. Hours: no additional hours. (Registered 27 May 2022) 18 October 2022, received \u00a3171.03 for royalties on books already written. Hours: no additional hours. (Registered 27 October 2022) 6 January 2023, received \u00a3510,000 as an advance on an upcoming book yet to be published. Hours: approx. 10 hrs to date. (Registered 12 January 2023) 4 May 2023, received \u00a3402.81 for royalties on books already written. Hours: no additional hours. (Registered 16 May 2023)
"""

Test Cases

Six items were selected as the basis for rudimentary test cases to see if the responses contain these specific entities.


test_cases = [
  'Rogers, Coleridge and White Ltd',
  'HarperCollins UK',
  '27 May 2022',
  'Hodder and Stoughton UK',
  'EC4Y 0DZ',
  '27 October 2022'
]

Prompt Used

To keep things simple, the prompt used was:

return as json all the entities in the following string:

Execution and Timing

The call to the Ollama function was made using a basic Python call with the temperature parameter passed to see which would give better results.


import ollama
import time

response = ollama.chat(
    model=model_name,
    messages=[
        {
            'role': 'user',
            'content': test_string,
        },
    ],
    options={'temperature': temperature}
)

start = time.time()
# Make request
time_taken = time.time() - start

JSON Validation

To check if the returned response was valid JSON, a JSON checker method was created.


@staticmethod
def is_valid_json(response_content: str) -> bool:
    try:
        json.loads(response_content)
        return True
    except json.JSONDecodeError:
        return False

Response Processing

A response handler class was created to filter responses and prevent exceptions when the entity count was checked.


class ResponseProcessor:
    @staticmethod
    def extract_json(response_content: str) -> List[Dict[str, Any]]:
        if '```json' in response_content and '```' in response_content:
            json_start = response_content.index('```json') + len('```json')
            json_end = response_content.rindex('```')
            json_str = response_content[json_start:json_end].strip()
        elif response_content.startswith('json') and '<|end-output|>' in response_content:
            json_start = response_content.index('json') + len('json')
            json_end = response_content.index('<|end-output|>')
            json_str = response_content[json_start:json_end].strip()
        else:
            json_str = response_content

        try:
            entities = json.loads(json_str)
        except json.JSONDecodeError:
            entities = []

        return entities

Temperature Variations

All models were iterated through with temperatures set to 0.0 and 1.0 to see which would give better results.


for model in ollama.list()['models']:
    parameter_size = model['details']['parameter_size']
    for temperature in [0.0, 1.0]:
        start = time.time()
        chat = OllamaChat(model_name=model['name'], temperature=temperature)
        response = chat.get_response(test_string)
        time_taken = time.time() - start

Results and Analysis

Model Temperature Time Taken (s) Entity Count Test Case Count Valid JSON Parameter Size
codellama:13b 0.0 46.29746913909912 47 6 True 13B
qwen2:7b 0.0 15.578981876373291 1 6 True 7.6B
mistral:7.2B 1.0 17.690881967544556 1 4 True 7.2B
phi3:latest 1.0 5.13061785697937 2 4 True 3.8B
codestral:22.2B 0.0 60.03098917007446 2 4 True 22.2B
codestral:22.2B 1.0 10.355426788330078 1 4 True 22.2B
qwen2:1.5b 0.0 5.158870697021484 3 3 True 1.5B
qwen2:1.5b 1.0 0.6319458484649658 3 3 True 1.5B
mistral:7.2B 0.0 14.273169755935669 2 3 True 7.2B
qwen2:72b 0.0 87.5780291557312 0 0 False 72.7B
qwen2:72b 1.0 72.38707995414734 0 0 False 72.7B
codellama:70b 0.0 77.14227676391602 0 0 False 69B
codellama:70b 1.0 22.082919120788574 0 0 False 69B
codellama:13b 1.0 3.1175529956817627 0 0 False 13B
codellama:7b 0.0 9.17280912399292 0 0 False 7B
codellama:7b 1.0 2.719773054122925 0 0 False 7B
qwen2:0.5b 0.0 2.595174789428711 0 0 False 494.03M
qwen2:0.5b 1.0 1.9541561603546143 0 0 False 494.03M
gemma2:9b 0.0 41.70087671279907 0 0 False 9.2B
gemma2:9b 1.0 32.398983001708984 0 0 False 9.2B
glm4:9.4B 0.0 25.727184057235718 0 0 False 9.4B
glm4:9.4B 1.0 15.27392578125 0 0 False 9.4B
phi3:14b 0.0 18.503488063812256 0 0 False 14.0B
phi3:14b 1.0 9.32719898223877 0 0 False 14.0B
llama3.1:70b 0.0 107.25502014160156 0 0 False 70.6B
llama3.1:70b 1.0 82.39711213111877 0 0 False 70.6B
nuextract:3.8B 0.0 6.197690963745117 0 0 False 3.8B
nuextract:3.8B 1.0 6.118343114852905 0 0 False 3.8B
gemma2:27b 0.0 107.82883405685425 0 0 False 27.2B
gemma2:27b 1.0 79.56572103500366 0 0 False 27.2B
codegemma:9B 0.0 24.519654035568237 0 0 False 9B
codegemma:9B 1.0 15.721749067306519 0 0 False 9B
phi3:latest 0.0 6.98612117767334 0 0 False 3.8B
qwen2:7b 1.0 12.073869943618774 0 0 False 7.6B
llama3.1:8.0B 0.0 11.302338123321533 0 0 False 8.0B
llama3.1:8.0B 1.0 11.914089918136597 0 0 False 8.0B
moondream:latest 0.0 2.1046509742736816 4 0 True 1B
moondream:latest 1.0 0.015331745147705078 0 0 False 1B

If we are looking for the model that performs the fastest and has all the entities returned, qwen2:7b seems to be the fastest. However, if you check the GitHub repo, the results may not be formatted in the way we expected. I ran this test for all the models I had downloaded through Ollama, so some existing ones were already on my machine.

Command used to time the full command: python3 ollama_test.py 0.27s user 0.13s system 0% cpu 17:51.24 total

Possible Improvements

Conclusion

Among the models tested, qwen2:7b with a temperature of 0.0 performed the best in terms of speed, taking only 15.58 seconds, and correctly identifying all 6 test cases. This experiment highlights the capabilities and limitations of various models and paves the way for further optimization and improvements in NER tasks.

By no means am I an expert in using LLMs, but I thought this would be a fun experiment. All these were run on my local machine, which is a 2023 MacBook Pro with 64GB RAM. I am open to new ideas to improve speed and performance. You can get in touch on Twitter/X.

GitHub Repository

You can find the complete code and JSON result files in our GitHub repository at a previous commit. Visit the repository to explore the data and further insights.

View the GitHub Repository

Announcement

Data Signal is going all in on using the latest technologies and techniques in named entity recognition for best in class performance. We are building a secure REST API around named entity recognition and have special previews.

If you are interested, then check out our home page.

Sign Up Now