How not to Evaluate NER Systems

nlp
hnbooks
ner
Published

January 16, 2023

A typical way to evaluate NER (Named Entity Recognition) Systems is to look at the F1 score, however this is a bad idea as stated in Chris Manning’s 2006 blog post Doing Named Entity Recognition? Don’t optimize for F1. F1 score will penalise a partial match twice (once as a false negative and once as a false positive), but in many cases a partial match is a better result and may make the overall system the NER model is a part of better.

I’ve been building a system for showing book mentions in HackerNews posts. The core idea is to count all the mentions of books and show the posts that mention them. This requires finding the books (via the NER system), linking them to a knowledge base (such as Open Library) and then aggregating them. The NER system doesn’t need to be perfect; some false negatives are acceptable, and some false positives may not be able to be linked.

This article will look at some example data from this project and demonstrate why the F1 score isn’t ideal, and explore some alternatives.

The data

The data I’m using is a set of manually annotated HackerNews posts (to find out more about the methodology look at my other posts on this or the code) with annotations of all PERSON (the name of a person) or WORK_OF_ART (the name of a book, album, artwork, movie, etc.).

These were annotated with Prodigy and then exported to this JSON lines file:

  • the text is cleaned text of the post (e.g. removing HTML tags)
  • meta.id is the HackerNews comment id (you can view the original using the link https://news.ycombinator.com/item?id={meta.id}
  • The tokens are the Spacy tokens from the Prodigy annotation usign the English tokenizer
  • the spans contain the annotations
    • token_start, token_end: Specify the span range in terms of tokens (inclusive)
    • start, end: Specify the span range in terms of characters
    • text is the plain text of the span
    • label is the Category label
    • source indicates where the label came from; I corrected labels from en_core_web_trf and this indicates this particular label came from that model
  • answer: Should be accept if the example is correct
data[0]
{'text': 'Second this, Becoming Steve Jobs is the superior book.',
 'meta': {'id': 29022116},
 'tokens': [{'text': 'Second', 'start': 0, 'end': 6, 'id': 0, 'ws': True},
  {'text': 'this', 'start': 7, 'end': 11, 'id': 1, 'ws': False},
  {'text': ',', 'start': 11, 'end': 12, 'id': 2, 'ws': True},
  {'text': 'Becoming', 'start': 13, 'end': 21, 'id': 3, 'ws': True},
  {'text': 'Steve', 'start': 22, 'end': 27, 'id': 4, 'ws': True},
  {'text': 'Jobs', 'start': 28, 'end': 32, 'id': 5, 'ws': True},
  {'text': 'is', 'start': 33, 'end': 35, 'id': 6, 'ws': True},
  {'text': 'the', 'start': 36, 'end': 39, 'id': 7, 'ws': True},
  {'text': 'superior', 'start': 40, 'end': 48, 'id': 8, 'ws': True},
  {'text': 'book', 'start': 49, 'end': 53, 'id': 9, 'ws': False},
  {'text': '.', 'start': 53, 'end': 54, 'id': 10, 'ws': True}],
 'spans': [{'token_start': 3,
   'token_end': 5,
   'start': 13,
   'end': 32,
   'text': 'Becoming Steve Jobs',
   'label': 'WORK_OF_ART',
   'source': 'en_core_web_trf',
   'input_hash': 1983767390}],
 'answer': 'accept'}

Most answers were accepted, some were rejected due to bad tokenization making it impossible to annotate. Some edge cases were ignored.

from collections import Counter

Counter(d['answer'] for d in data)
Counter({'accept': 305, 'reject': 12, 'ignore': 8})

Filter only to accepted data

data = [d for d in data if d['answer'] == 'accept']

Making predictions with SpaCy

The SpaCy en_core_web_trf model provides a good baseline, let’s make the predictions with this model to score against the ground truth annotations.

Use the GPU when it’s available to speed things up.

import spacy
from spacy.tokens import Doc, Span

spacy.prefer_gpu()
True

Convert every annotation into a SpaCy Doc to make predictions on.

This is straightforward but we need to be careful with the fact the Prodigy span ends are inclusive, but the SpaCy Spans exclude the end.

def annotation_to_doc(vocab, annotation, set_ents=True):
    doc = Doc(
        vocab=vocab,
        words=[token['text'] for token in annotation['tokens']],
        spaces=[token['ws'] for token in annotation['tokens']]
    )
    
    spans = [Span(doc=doc,
                  start=span['token_start'],
                  # N.B. Off by one due to Prodigy including the end but SpaCy excluding it
                  end=span['token_end'] + 1, 
                  label=span['label'])
             for span in annotation['spans']]
    
    if set_ents:
        doc.set_ents(spans)
    
    return doc

Use en_core_web_trf, an English Transformer model that has PERSON, WORK_OF_ART and many other named entities.

nlp = spacy.load('en_core_web_trf')

Convert all the docs

docs = [annotation_to_doc(nlp.vocab, d) for d in data]
len(docs)
305

An example annotated document.

from spacy import displacy

displacy.render(docs[0], style='ent')
Second this, Becoming Steve Jobs WORK_OF_ART is the superior book.

Run these through the SpaCy model to make our predictions.

%%time
preds = list(nlp.pipe(annotation_to_doc(nlp.vocab, d, set_ents=False) for d in data))
CPU times: user 3.76 s, sys: 116 ms, total: 3.88 s
Wall time: 3.94 s

An example prediction; it got the Work of Art and an additional ORDINAL.

displacy.render(preds[0], style='ent')
Second ORDINAL this, Becoming Steve Jobs WORK_OF_ART is the superior book.

Getting BILOU tags

For evaluating NER the tags generally need to be in some standard form like Inside-Outside-Beginning (IOB) (also known as BIO).

This function will convert them to the equivalent use Beginning-Inside-Last-Outside-Unit (BILOU or equivalently IOBES or BMEWO) because SpaCy has a handy function to do it.

from spacy.training import offsets_to_biluo_tags
def get_biluo(doc, include_labels=None):
    if include_labels is None:
        include_labels = [ent.label_ for ent in doc.ents]
        
    return offsets_to_biluo_tags(doc,
                                 [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents if ent.label_ in include_labels])

Filter to the entities of interest in the annotation.

ENTS = ['WORK_OF_ART', 'PERSON']

So this

displacy.render(docs[0], style='ent')
Second this, Becoming Steve Jobs WORK_OF_ART is the superior book.

Becomes

get_biluo(docs[0], ENTS)
['O',
 'O',
 'O',
 'B-WORK_OF_ART',
 'I-WORK_OF_ART',
 'L-WORK_OF_ART',
 'O',
 'O',
 'O',
 'O',
 'O']

It’s clearer when the tokens and labels are lined up

import pandas as pd

pd.DataFrame({'token': [token.text for token in docs[0]],
              'label': get_biluo(docs[0], ENTS)}).T
0 1 2 3 4 5 6 7 8 9 10
token Second this , Becoming Steve Jobs is the superior book .
label O O O B-WORK_OF_ART I-WORK_OF_ART L-WORK_OF_ART O O O O O

Now extract the ground truth and predictions in BILOU format from the SpaCy docs.

y_true = [get_biluo(doc, ENTS) for doc in docs]
y_pred = [get_biluo(doc, ENTS) for doc in preds]

The token counts in the documents seem plausible.

Counter(l for ls in y_true for l in ls)
Counter({'O': 6038,
         'B-WORK_OF_ART': 72,
         'I-WORK_OF_ART': 162,
         'L-WORK_OF_ART': 72,
         'U-PERSON': 26,
         'B-PERSON': 50,
         'L-PERSON': 50,
         'I-PERSON': 16,
         'U-WORK_OF_ART': 11})

And in the predictions.

Counter(l for ls in y_pred for l in ls)
Counter({'O': 6108,
         'B-WORK_OF_ART': 57,
         'I-WORK_OF_ART': 122,
         'L-WORK_OF_ART': 57,
         'U-PERSON': 28,
         'B-PERSON': 51,
         'L-PERSON': 51,
         'I-PERSON': 16,
         'U-WORK_OF_ART': 7})

Seqeval

The seqeval library is a well tested library for calculating the standard classification metrics on sequence data such as F1, Recall, and Precision.

from seqeval.metrics import classification_report, f1_score, precision_score, recall_score

from seqeval.scheme import BILOU

The model has an f1 score of 89%

f1_score(y_true, y_pred, mode='strict', scheme=BILOU)
0.8874172185430464

But the full classification report tells us a lot more:

  • The metrics are much worse on WORK_OF_ART which is the most important part of our system
  • Precision is higher than recall for WORK_OF_ART, and precision is likely more important in our use case
print(classification_report(y_true, y_pred, digits=3, mode='strict', scheme=BILOU))
              precision    recall  f1-score   support

      PERSON      0.949     0.987     0.968        76
 WORK_OF_ART      0.922     0.711     0.803        83

   micro avg      0.937     0.843     0.887       159
   macro avg      0.936     0.849     0.885       159
weighted avg      0.935     0.843     0.882       159

But what about partial matches? These count against precision and recall (so it’s worse than predicting no match), but for our use case it would be better if we can still link the entity.

Let’s look more closely at ways to evaluate NER systems.

NEREvaluate

David Batista has an excellent blog post on Named Entity Evaluation. In short the f1-score treats NER like a binary classification problem, but it’s not there are lots of ways to get it almost right.

This has been implemented in the nervaluate Python library.

from nervaluate import Evaluator

evaluator = Evaluator(y_true, y_pred, ENTS, loader='list')

results, results_by_tag = evaluator.evaluate()

To make the results a bit easier to visualise we’re going to switch the rows and columns

from collections import defaultdict

def flip_nested_dict(dd):
    result = defaultdict(dict)
    for k1, d in dd.items():
        for k2, v in d.items():
            result[k2][k1] = v
    return dict(result)

The types are, from David Batista’s post

  • Strict: exact boundary surface string match and entity type
  • Exact: exact boundary match over the surface string, regardless of the type
  • Partial: partial boundary match over the surface string, regardless of the type;
  • Type: some overlap between the system tagged entity and the gold annotation is required;

Strict is the same as seqeval uses and the scores match the micro average.

pd.DataFrame(flip_nested_dict(results))
correct incorrect partial missed spurious possible actual precision recall f1
ent_type 138 2 0 19 3 159 143 0.965035 0.867925 0.913907
partial 134 0 6 19 3 159 143 0.958042 0.861635 0.907285
strict 134 6 0 19 3 159 143 0.937063 0.842767 0.887417
exact 134 6 0 19 3 159 143 0.937063 0.842767 0.887417

Looking by entity type is even more revealing, for work of art we have very high precision on partial matches showing this could actually be a better solution than it first appears, with a precision closer to 94% than the strict 89%.

Note there’s a discrepency here; the strict f1 for WORK_OF_ART is 79.1%, when seqeval gave 80.3%. This is because seqeval ignores the other types of tags when evaluating at a tag level, but nervaluate includes them.

from IPython.display import display

for tag, tag_results in results_by_tag.items():
    display(pd.DataFrame(flip_nested_dict(tag_results)).style.set_caption(tag))
WORK_OF_ART
  correct incorrect partial missed spurious possible actual precision recall f1
ent_type 63 2 0 18 1 83 66 0.954545 0.759036 0.845638
partial 59 0 6 18 1 83 66 0.939394 0.746988 0.832215
strict 59 6 0 18 1 83 66 0.893939 0.710843 0.791946
exact 59 6 0 18 1 83 66 0.893939 0.710843 0.791946
PERSON
  correct incorrect partial missed spurious possible actual precision recall f1
ent_type 75 0 0 1 2 76 77 0.974026 0.986842 0.980392
partial 75 0 0 1 2 76 77 0.974026 0.986842 0.980392
strict 75 0 0 1 2 76 77 0.974026 0.986842 0.980392
exact 75 0 0 1 2 76 77 0.974026 0.986842 0.980392

Comparing the evaluations by tag

They first disagree at index 133

idx = 133
subevaluator = Evaluator(y_true[:idx], y_pred[:idx], ENTS, loader='list')

sub_results, sub_results_by_tag = subevaluator.evaluate()

sub_results_by_tag['WORK_OF_ART']['strict']['precision'], classification_report(y_true[:idx], y_pred[:idx], mode='strict', scheme=BILOU, output_dict=True)['WORK_OF_ART']['precision']
(0.8918918918918919, 0.9166666666666666)

But they agree at 132

idx = 132

subevaluator = Evaluator(y_true[:idx], y_pred[:idx], ENTS, loader='list')

sub_results, sub_results_by_tag = subevaluator.evaluate()

sub_results_by_tag['WORK_OF_ART']['strict']['precision'], classification_report(y_true[:idx], y_pred[:idx], mode='strict', scheme=BILOU, output_dict=True)['WORK_OF_ART']['precision']
(0.9166666666666666, 0.9166666666666666)

As can be seen this is where the wrong entitiy type is predicted

pd.DataFrame([docs[idx], y_true[idx], y_pred[idx]])
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 What Remains of Edith Finch is one of my favorite games of all time .
1 B-WORK_OF_ART I-WORK_OF_ART I-WORK_OF_ART I-WORK_OF_ART L-WORK_OF_ART O O O O O O O O O O
2 O O O B-PERSON L-PERSON O O O O O O O O O O

Looking at examples

Aggregate statistics are great for summaries, but nothing beats actually looking at examples for diagnostics.

Unfortunately the neervaluate code doesn’t make this easy to do, so we’ll implement this ourselves.

Let’s focus on partial matches, and not type mismatches.

Comparing entities

Internally nerevaluate converts spans back into Prodigy’s span format (with the inclusive ends).

y_true[0]
['O',
 'O',
 'O',
 'B-WORK_OF_ART',
 'I-WORK_OF_ART',
 'L-WORK_OF_ART',
 'O',
 'O',
 'O',
 'O',
 'O']
evaluator.true[0]
[{'label': 'WORK_OF_ART', 'start': 3, 'end': 5}]

Find if two segments overlap

def overlaps(a, b):
    if a['start'] > b['start']:
        return overlaps(b, a)
    assert a['start'] <= b['start']
    return a['end'] >= b['start']

Run some tests

assert not overlaps({'start': 0, 'end': 1}, {'start': 2, 'end': 3})
assert not overlaps({'start': 2, 'end': 3}, {'start': 0, 'end': 1})
assert overlaps({'start': 0, 'end': 1}, {'start': 1, 'end': 3})
assert overlaps({'start': 1, 'end': 3}, {'start': 0, 'end': 1})
assert overlaps({'start': 1, 'end': 3}, {'start': 2, 'end': 2})

A partial overlap is when two items that are not identical have the same label and overlap

def same_label(a, b):
    return a['label'] == b['label']
def has_partial_overlaps_with_same_label(true_item, pred_item):
    return (true_item != pred_item) and same_label(true_item, pred_item) and overlaps(true_item, pred_item)

Get all the document indices with a partial overlap

partial_overlap_idx = [idx for idx in range(len(evaluator.true)) if any(has_partial_overlaps_with_same_label(true_item, pred_item) for true_item in evaluator.true[idx] for pred_item in evaluator.pred[idx])]
len(partial_overlap_idx)
4

Three of the four examples only differ in surrounding punctuation and whitespace, and the other includes the author (so is not ambiguous).

All these examples would work perfectly in the entity linking stage.

for idx in partial_overlap_idx:
    display(idx)
    displacy.render(docs[idx], style='ent')
    displacy.render(preds[idx], style='ent')
98
After I posted I remembered the book " How to Invent Everything WORK_OF_ART " which takes the case of a time traveler stuck in the past with a guide to invent civilization from scratch.
After I posted I remembered the book " How to Invent Everything" WORK_OF_ART which takes the case of a time traveler stuck in the past with a guide to invent civilization from scratch.
105
I found this: https://us.macmillan.com/books/9781250280374

<<
The Vaccine: Inside the Race to Conquer the COVID-19 Pandemic WORK_OF_ART

Author: Joe Miller PERSON with Dr. Özlem Türeci PERSON and Dr. Ugur Sahin PERSON
>>
I found this: https://us.macmillan.com/books/9781250280374

<<
The Vaccine: Inside the Race to Conquer the COVID-19 Pandemic WORK_OF_ART Author: Joe Miller PERSON with Dr. Özlem Türeci PERSON and Dr. Ugur Sahin PERSON
>>
234
The ideas in the article are similar (in a good way) to Shape Up WORK_OF_ART by Basecamp [0].
The ideas in the article are similar (in a good way) to Shape Up by Basecamp WORK_OF_ART [0].
278
From " The Elements of Journalism WORK_OF_ART " by Bill Kovach PERSON and Tom Rosenstiel PERSON : "Originality is a bulwark of better journalism, deeper understanding, and more accurate reporting.
From "The Elements of Journalism" WORK_OF_ART by Bill Kovach PERSON and Tom Rosenstiel PERSON : "Originality is a bulwark of better journalism, deeper understanding, and more accurate reporting.

Final Thoughts

When choosing a metric always make sure it aligns with your final goals. F1 score is fine when comparing similar systems, but in this case it actually gives worse results than the usecase and a better metric would give credit to substantial partial overlaps.