I am new to SpaCy. I noticed that there are a number of NER categories listed in the documentation of all
'CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART'
I need to access the raw data used to assign each word the correct category. In other words, what’s the list of words labelled as
'WORK_OF_ART', and is this list available?
The reason I ask this question is that I want to build a custom model that uses some of the default NER categories, as well as my own.
Depending on which variant of
en_core_web, the data varies,
|OntoNotes 5||LDC Non-Members||https://catalog.ldc.upenn.edu/LDC2013T19||✓||✓||✓||✓|
|Wordnet 3.0||WordNet License||https://wordnet.princeton.edu/download||✓||✓||✓||✓|
|ClearNLP Constituent-to-Dependency Conversion||Apache 2.0||dependency_conversion.md||✓||✓||✓||✓|
|GloVe Common Crawl||Apache 2.0||https://nlp.stanford.edu/projects/glove/||✕||✓||✓||✕|
|Roberta Base||???||Fairseq Roberta|
The NER labelling scheme as described from https://spacy.io/models/en is from OntoNotes that contains NER tags, see Section 2.6 of https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
The NER tags adopts the CONLL BIO format, see https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and when read properly, each sentence should be a list of tuples, e.g. Get Stanford NER result through NLTK with IOB format
Also take a look at https://github.com/flairNLP/flair/ when it comes to training NER using Ontonotes, it might help.