SpaCy 3: how to get the raw data used to train en_core_web_sm?

I am new to SpaCy. I noticed that there are a number of NER categories listed in the documentation of all en_core_web models:


I need to access the raw data used to assign each word the correct category. In other words, what’s the list of words labelled as 'WORK_OF_ART', and is this list available?

The reason I ask this question is that I want to build a custom model that uses some of the default NER categories, as well as my own.


Depending on which variant of en_core_web, the data varies,

Dataset License URL web_sm web_md eweb_lg web_trf
OntoNotes 5 LDC Non-Members
Wordnet 3.0 WordNet License
ClearNLP Constituent-to-Dependency Conversion Apache 2.0
GloVe Common Crawl Apache 2.0
Roberta Base ??? Fairseq Roberta

The NER labelling scheme as described from is from OntoNotes that contains NER tags, see Section 2.6 of

The NER tags adopts the CONLL BIO format, see and when read properly, each sentence should be a list of tuples, e.g. Get Stanford NER result through NLTK with IOB format

Also take a look at when it comes to training NER using Ontonotes, it might help.