Benchmarking Local Language Models for Named Entity Recognition
With large language models at the top of the hype cycle, it is an interesting exercise to benchmark locally runnable models for named entity recognition (NER) and compare each for speed by running them on a MacBook Pro. This post walks through the process, from selecting the models to evaluating their performance.
This post is by Chris McCabe, and you can contact me on Twitter/X.
Models Used
The models used for this benchmark are pulled from the Ollama models page, sorted by popularity, with tools and easier versions removed. Additionally, only models up to 70B parameters were used, as the MacBook Pro used for testing has 64GB of RAM. The models tested include:
- Mistral: 7B
- Phi3: 3B, 14B
- glm4: 9B
- codellama: 7B, 13B, 34B
- llama3.1: 8B, 70B
- Gemma2: 9B, 27B
- qwen2: 0.5B, 1.5B, 7B, 72B
Test Data
A snippet from the UK MP declarations of member benefits was used as the test string. This snippet contains information about payments and royalties received by MPs. The data was pulled using Beautiful Soup from the source link.
test_string = """
Payments from Hodder and Stoughton UK, Carmelite House, 50 Victoria Embankment, London EC4Y 0DZ, via United Agents, 12-26 Lexington St, London W1F 0LE: 12 July 2022, received \u00a3439.82 for royalties on book already written. Hours: no additional hours. (Registered 28 July 2022) 10 August 2022, received \u00a3519.69 for royalties on book already written. Hours: no additional hours. (Registered 23 August 2022) 5 October 2022, received \u00a31,771.82 for royalties on book already written. Hours: no additional hours. (Registered 27 October 2022) 14 March 2023, received \u00a3673.11 for royalties on book already written. Hours: no additional hours. (Registered 03 April 2023) 5 April 2023, received \u00a32,590.85 for royalties on book already written. Hours: no additional hours. (Registered 24 April 2023)",
"Payments from HarperCollins UK, 1 London Bridge St, London SE1 9GF, via Rogers, Coleridge and White Ltd, 20 Powis Mews, London W11 1JN: 30 April 2022, received \u00a3382.03 for royalties on books already written. Hours: no additional hours. (Registered 27 May 2022) 18 October 2022, received \u00a3171.03 for royalties on books already written. Hours: no additional hours. (Registered 27 October 2022) 6 January 2023, received \u00a3510,000 as an advance on an upcoming book yet to be published. Hours: approx. 10 hrs to date. (Registered 12 January 2023) 4 May 2023, received \u00a3402.81 for royalties on books already written. Hours: no additional hours. (Registered 16 May 2023)
"""
Test Cases
Six items were selected as the basis for rudimentary test cases to see if the responses contain these specific entities.
test_cases = [
'Rogers, Coleridge and White Ltd',
'HarperCollins UK',
'27 May 2022',
'Hodder and Stoughton UK',
'EC4Y 0DZ',
'27 October 2022'
]
Prompt Used
To keep things simple, the prompt used was:
return as json all the entities in the following string:
Execution and Timing
The call to the Ollama function was made using a basic Python call with the temperature parameter passed to see which would give better results.
import ollama
import time
response = ollama.chat(
model=model_name,
messages=[
{
'role': 'user',
'content': test_string,
},
],
options={'temperature': temperature}
)
start = time.time()
# Make request
time_taken = time.time() - start
JSON Validation
To check if the returned response was valid JSON, a JSON checker method was created.
@staticmethod
def is_valid_json(response_content: str) -> bool:
try:
json.loads(response_content)
return True
except json.JSONDecodeError:
return False
Response Processing
A response handler class was created to filter responses and prevent exceptions when the entity count was checked.
class ResponseProcessor:
@staticmethod
def extract_json(response_content: str) -> List[Dict[str, Any]]:
if '```json' in response_content and '```' in response_content:
json_start = response_content.index('```json') + len('```json')
json_end = response_content.rindex('```')
json_str = response_content[json_start:json_end].strip()
elif response_content.startswith('json') and '<|end-output|>' in response_content:
json_start = response_content.index('json') + len('json')
json_end = response_content.index('<|end-output|>')
json_str = response_content[json_start:json_end].strip()
else:
json_str = response_content
try:
entities = json.loads(json_str)
except json.JSONDecodeError:
entities = []
return entities
Temperature Variations
All models were iterated through with temperatures set to 0.0
and 1.0
to see which would give better results.
for model in ollama.list()['models']:
parameter_size = model['details']['parameter_size']
for temperature in [0.0, 1.0]:
start = time.time()
chat = OllamaChat(model_name=model['name'], temperature=temperature)
response = chat.get_response(test_string)
time_taken = time.time() - start
Results and Analysis
Model | Temperature | Time Taken (s) | Entity Count | Test Case Count | Valid JSON | Parameter Size |
---|---|---|---|---|---|---|
codellama:13b | 0.0 | 46.29746913909912 | 47 | 6 | True | 13B |
qwen2:7b | 0.0 | 15.578981876373291 | 1 | 6 | True | 7.6B |
mistral:7.2B | 1.0 | 17.690881967544556 | 1 | 4 | True | 7.2B |
phi3:latest | 1.0 | 5.13061785697937 | 2 | 4 | True | 3.8B |
codestral:22.2B | 0.0 | 60.03098917007446 | 2 | 4 | True | 22.2B |
codestral:22.2B | 1.0 | 10.355426788330078 | 1 | 4 | True | 22.2B |
qwen2:1.5b | 0.0 | 5.158870697021484 | 3 | 3 | True | 1.5B |
qwen2:1.5b | 1.0 | 0.6319458484649658 | 3 | 3 | True | 1.5B |
mistral:7.2B | 0.0 | 14.273169755935669 | 2 | 3 | True | 7.2B |
qwen2:72b | 0.0 | 87.5780291557312 | 0 | 0 | False | 72.7B |
qwen2:72b | 1.0 | 72.38707995414734 | 0 | 0 | False | 72.7B |
codellama:70b | 0.0 | 77.14227676391602 | 0 | 0 | False | 69B |
codellama:70b | 1.0 | 22.082919120788574 | 0 | 0 | False | 69B |
codellama:13b | 1.0 | 3.1175529956817627 | 0 | 0 | False | 13B |
codellama:7b | 0.0 | 9.17280912399292 | 0 | 0 | False | 7B |
codellama:7b | 1.0 | 2.719773054122925 | 0 | 0 | False | 7B |
qwen2:0.5b | 0.0 | 2.595174789428711 | 0 | 0 | False | 494.03M |
qwen2:0.5b | 1.0 | 1.9541561603546143 | 0 | 0 | False | 494.03M |
gemma2:9b | 0.0 | 41.70087671279907 | 0 | 0 | False | 9.2B |
gemma2:9b | 1.0 | 32.398983001708984 | 0 | 0 | False | 9.2B |
glm4:9.4B | 0.0 | 25.727184057235718 | 0 | 0 | False | 9.4B |
glm4:9.4B | 1.0 | 15.27392578125 | 0 | 0 | False | 9.4B |
phi3:14b | 0.0 | 18.503488063812256 | 0 | 0 | False | 14.0B |
phi3:14b | 1.0 | 9.32719898223877 | 0 | 0 | False | 14.0B |
llama3.1:70b | 0.0 | 107.25502014160156 | 0 | 0 | False | 70.6B |
llama3.1:70b | 1.0 | 82.39711213111877 | 0 | 0 | False | 70.6B |
nuextract:3.8B | 0.0 | 6.197690963745117 | 0 | 0 | False | 3.8B |
nuextract:3.8B | 1.0 | 6.118343114852905 | 0 | 0 | False | 3.8B |
gemma2:27b | 0.0 | 107.82883405685425 | 0 | 0 | False | 27.2B |
gemma2:27b | 1.0 | 79.56572103500366 | 0 | 0 | False | 27.2B |
codegemma:9B | 0.0 | 24.519654035568237 | 0 | 0 | False | 9B |
codegemma:9B | 1.0 | 15.721749067306519 | 0 | 0 | False | 9B |
phi3:latest | 0.0 | 6.98612117767334 | 0 | 0 | False | 3.8B |
qwen2:7b | 1.0 | 12.073869943618774 | 0 | 0 | False | 7.6B |
llama3.1:8.0B | 0.0 | 11.302338123321533 | 0 | 0 | False | 8.0B |
llama3.1:8.0B | 1.0 | 11.914089918136597 | 0 | 0 | False | 8.0B |
moondream:latest | 0.0 | 2.1046509742736816 | 4 | 0 | True | 1B |
moondream:latest | 1.0 | 0.015331745147705078 | 0 | 0 | False | 1B |
If we are looking for the model that performs the fastest and has all the entities returned, qwen2:7b seems to be the fastest. However, if you check the GitHub repo, the results may not be formatted in the way we expected. I ran this test for all the models I had downloaded through Ollama, so some existing ones were already on my machine.
Command used to time the full command: python3 ollama_test.py 0.27s user 0.13s system 0% cpu 17:51.24 total
Possible Improvements
- Expand Model Range: Testing a broader range of models, including those with more than 70B parameters, on machines with higher RAM capacity.
- Advanced Metrics: Incorporating additional metrics such as precision, recall, and F1-score for a more comprehensive evaluation of NER performance.
- Diverse Datasets: Using more varied datasets to assess model performance across different text types and domains.
- Better Prompts: The new llama models should do better. But what prompts would Generate responses
Conclusion
Among the models tested, qwen2:7b with a temperature of 0.0 performed the best in terms of speed, taking only 15.58 seconds, and correctly identifying all 6 test cases. This experiment highlights the capabilities and limitations of various models and paves the way for further optimization and improvements in NER tasks.
By no means am I an expert in using LLMs, but I thought this would be a fun experiment. All these were run on my local machine, which is a 2023 MacBook Pro with 64GB RAM. I am open to new ideas to improve speed and performance. You can get in touch on Twitter/X.
GitHub Repository
You can find the complete code and JSON result files in our GitHub repository at a previous commit. Visit the repository to explore the data and further insights.
View the GitHub RepositoryAnnouncement
Data Signal is going all in on using the latest technologies and techniques in named entity recognition for best in class performance. We are building a secure REST API around named entity recognition and have special previews.
If you are interested, then check out our home page.
Sign Up Now