Introduction
Computers are bad at language. spaCy makes them less bad. That's the honest pitch -- don't expect magic.
spaCy is opinionated. Unlike NLTK, which hands you every algorithm and lets you figure it out, spaCy picks one approach per task and optimizes it for speed and accuracy. Less flexibility. But the prototype code actually survives into the deployed version, which is more than I can say for NLTK.
We're covering what spaCy actually does: tokenization, NER, dependency parsing, text classification, word vectors, custom pipelines. Tokenization gets a short section because it works out of the box. NER and text classification get longer ones because those are the features people actually ship.
Setting Up spaCy and Language Models
The linguistic knowledge lives in separate model packages. Three English model sizes: small (en_core_web_sm, ~12 MB) for development, medium (en_core_web_md, ~40 MB) which adds word vectors, large (en_core_web_lg, ~560 MB) for maximum accuracy. Use medium. The extra 30 MB over small gets you word vectors, and retrofitting later means re-running your entire pipeline.
# Install spaCy
pip install spacy
# Download the medium English model (includes word vectors)
python -m spacy download en_core_web_md
# For lighter tasks, the small model works fine:# python -m spacy download en_core_web_smLoading a model takes one line. Call the resulting nlp object on any string and it runs tokenizer, POS tagger, dependency parser, NER, and lemmatizer in sequence:
import spacy
# Load the medium English model
nlp = spacy.load("en_core_web_md")
# Process a text string -- this runs the full pipeline
doc = nlp("Apple is looking at buying a startup in San Francisco for $1 billion.")
# The doc object contains tokens with rich annotationsfor token in doc:
print(f"{token.text:<15} {token.pos_:<8} {token.dep_:<12} {token.head.text}")
# Output:# Apple PROPN nsubj looking# is AUX aux looking# looking VERB ROOT looking# at ADP prep looking# buying VERB pcomp at# a DET det startup# startup NOUN dobj buying# in ADP prep startup# San PROPN compound Francisco# Francisco PROPN pobj in# for ADP prep buying# $ SYM quantmod billion# 1 NUM compound billion# billion NUM pobj for# . PUNCT punct lookingEach token carries its part of speech, syntactic role, grammatical head. Full linguistic analysis, almost no setup. That pipeline architecture is spaCy's defining feature.
Tokenization and Linguistic Features
Don't write your own tokenizer. Feed spaCy a URL with query parameters, "Dr. Smith's", and an emoji. It handles all of it. Rule-based system tuned over years.
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("The developers couldn't deploy the v2.0 update to https://api.example.com.")
# Explore token attributesfor token in doc:
print(
f"Text: {token.text:<15}"
f"Lemma: {token.lemma_:<12}"
f"POS: {token.pos_:<8}"
f"Stop: {token.is_stop}"
)
# Notice how spaCy handles "couldn't" --# it splits into "could" and "n't" automatically# Sentence detection comes built infor sent in doc.sents:
print(f"Sentence: {sent.text}")
# Noun chunks extract meaningful phrases
doc2 = nlp("The quick brown fox jumped over the lazy dog near the old wooden fence.")
for chunk in doc2.noun_chunks:
print(f"Chunk: {chunk.text:<30} Root: {chunk.root.text:<10} Dep: {chunk.root.dep_}")
# Output:# Chunk: The quick brown fox Root: fox Dep: nsubj# Chunk: the lazy dog Root: dog Dep: pobj# Chunk: the old wooden fence Root: fence Dep: pobjlemma_ gives base forms: "running" to "run", "better" to "good", "was" to "be". Essential for search. is_stop flags common words that carry little meaning. Noun chunks extract multi-word noun phrases -- what you want for topic extraction.
For finer-grained POS tags, use tag_ instead of pos_. Penn Treebank set with distinctions like singular noun (NN) versus plural (NNS). But pos_ is usually enough.
Named Entity Recognition
This is the NLP feature that actually ships to production. What the model is doing: scanning text left to right, identifying spans that refer to real-world objects, and classifying each span into a category. People, companies, locations, dates, monetary values. Under the hood, spaCy's NER uses a transition-based parser -- it reads tokens sequentially and decides at each step whether to begin an entity, continue one, or end one. Trained on the OntoNotes 5 corpus, 18 entity types out of the box.
Extracting company names and dollar amounts from SEC filings? Without NER, that's weeks of regex. With spaCy, a screenful of code:
import spacy
from collections import defaultdict
nlp = spacy.load("en_core_web_md")
text = """
Microsoft announced on Tuesday that CEO Satya Nadella will visit the
European Union headquarters in Brussels next month. The tech giant,
valued at over $3 trillion, is expected to discuss AI regulation with
EU Commissioner Thierry Breton. The meeting follows similar talks
Google held with officials in Berlin last September.
"""
doc = nlp(text)
# Extract all named entitiesprint("=== Named Entities ===")
for ent in doc.ents:
print(f"{ent.text:<25} {ent.label_:<12} ({spacy.explain(ent.label_)})")
# Output:# Microsoft ORG (Companies, agencies, institutions, etc.)# Tuesday DATE (Absolute or relative dates or periods)# Satya Nadella PERSON (People, including fictional)# European Union ORG (Companies, agencies, institutions, etc.)# Brussels GPE (Countries, cities, states)# next month DATE (Absolute or relative dates or periods)# over $3 trillion MONEY (Monetary values, including unit)# Thierry Breton PERSON (People, including fictional)# Google ORG (Companies, agencies, institutions, etc.)# Berlin GPE (Countries, cities, states)# last September DATE (Absolute or relative dates or periods)# Group entities by type -- useful for building knowledge bases
entities_by_type = defaultdict(list)
for ent in doc.ents:
entities_by_type[ent.label_].append(ent.text)
print("\n=== Entities Grouped by Type ===")
for label, entities in entities_by_type.items():
print(f"{label}: {', '.join(entities)}")
# Check if a specific span is an entityfor token in doc:
if token.ent_type_:
print(f"{token.text} -> {token.ent_iob_}-{token.ent_type_}")
# B = beginning of entity, I = inside entity, O = outsidespacy.explain() when you hit an unfamiliar label. Main types: PERSON, ORG, GPE (countries, cities, states), DATE, MONEY, PRODUCT. Each entity carries start and end character offsets for UI highlighting or mapping back to the original text.
NER models are not perfect. "Apple" -- company or fruit? Domain-specific jargon breaks them. Text that looks nothing like the training data breaks them. But here's the thing: even a small set of domain-specific annotated examples goes a long way after fine-tuning. Fifty examples per entity type can be enough to fix the worst misses.
Dependency Parsing and Sentence Structure
Who did what to whom. Dependency parsing identifies how each word relates to every other word -- arrows from modifiers to heads, verb as the root of the tree.
Not academic. Processing customer complaints: "The delivery driver broke my package." Agent is "driver," action is "broke," object is "package." Information you can act on programmatically.
import spacy
nlp = spacy.load("en_core_web_md")
# Analyze sentence structure
doc = nlp("The senior engineer quickly fixed the critical bug in production.")
# Walk the dependency treeprint("=== Dependency Tree ===")
for token in doc:
print(
f"{token.text:<15}"
f"dep: {token.dep_:<12}"
f"head: {token.head.text:<12}"
f"children: {[child.text for child in token.children]}"
)
# Extract subject-verb-object triplesdefextract_svo(doc):
"""Extract subject-verb-object triples from a parsed document."""
triples = []
for token in doc:
if token.dep_ == "ROOT":
verb = token
subject = None
obj = Nonefor child in verb.children:
if child.dep_ in ("nsubj", "nsubjpass"):
# Get the full noun phrase for the subject
subject = " ".join([t.text for t in child.subtree])
elif child.dep_ in ("dobj", "attr"):
# Get the full noun phrase for the object
obj = " ".join([t.text for t in child.subtree])
if subject and obj:
triples.append((subject, verb.lemma_, obj))
return triples
# Test with multiple sentences
texts = [
"The senior engineer quickly fixed the critical bug in production.",
"Our team released a new feature last Friday.",
"The client approved the final design mockup.",
]
print("\n=== Subject-Verb-Object Triples ===")
for text in texts:
doc = nlp(text)
triples = extract_svo(doc)
for subj, verb, obj in triples:
print(f" Subject: {subj}")
print(f" Verb: {verb}")
print(f" Object: {obj}")
print()subtree is what makes this practical. Yields every token descending from a given token in the dependency tree -- so instead of just "engineer" as the subject, you get "The senior engineer" with all modifiers.
Key labels: nsubj (who's acting), dobj (what's acted on), amod (adjectival modifier), ROOT (main verb). Visualize with spacy.displacy.serve(doc, style="dep") -- renders SVG dependency arcs in a browser or Jupyter notebook.
Word Vectors and Similarity
Words as numbers. Each word becomes a dense vector in high-dimensional space, positioned so similar meanings cluster together. "King" near "queen." "Python" near "programming." "Banana" nowhere near "aircraft." Semantic search, document similarity, analogy detection -- all possible through basic vector math.
Medium and large models include pre-trained 300-dimensional vectors. The small model doesn't. Similarity between tokens, spans, or entire documents:
import spacy
nlp = spacy.load("en_core_web_md")
# Compare individual words
tokens = nlp("cat dog car bicycle king queen")
print("=== Word Similarity Matrix ===")
for token1 in tokens:
similarities = []
for token2 in tokens:
similarities.append(f"{token1.similarity(token2):.2f}")
print(f"{token1.text:<10} {' '.join(similarities)}")
# Compare entire documents -- spaCy averages the word vectors
doc1 = nlp("I love building machine learning models with Python.")
doc2 = nlp("Python is great for developing AI and deep learning systems.")
doc3 = nlp("The restaurant serves excellent Italian pasta dishes.")
print(f"\nML sentence vs AI sentence: {doc1.similarity(doc2):.4f}")
print(f"ML sentence vs food sentence: {doc1.similarity(doc3):.4f}")
# Practical example: find the most similar documentdeffind_most_similar(query, documents):
"""Find the document most similar to a query string."""
query_doc = nlp(query)
best_match = None
best_score = -1
for doc_text in documents:
doc = nlp(doc_text)
score = query_doc.similarity(doc)
if score > best_score:
best_score = score
best_match = doc_text
return best_match, best_score
articles = [
"How to train a neural network for image classification",
"Best practices for REST API design and documentation",
"Understanding database indexing and query optimization",
"Introduction to convolutional neural networks for computer vision",
]
query = "deep learning for recognizing objects in photos"
match, score = find_most_similar(query, articles)
print(f"\nQuery: '{query}'")
print(f"Best match ({score:.4f}): '{match}'")Scores range 0 to 1 (occasionally slightly negative). Same-topic documents land above 0.7; unrelated below 0.4.
spaCy's default similarity averages word vectors across the whole document. Works for short texts. Loses nuance for longer ones. Sentence transformers or spaCy's Transformer pipeline component will do better there. But averaged word vectors are underrated. Too many teams jump straight to transformer models, burn time on GPU setup, and end up with something marginally better for their use case. Cosine similarity on averaged vectors gets you most of the way. Try it first.
Text Classification with spaCy
Train it on your labeled data. The API is the easy part. Spam filters, support ticket routing, content moderation, sentiment analysis -- the workflow is always the same: prepare labeled data, configure pipeline, train, evaluate. The example below builds a sentiment classifier for product reviews:
import spacy
from spacy.training import Example
import random
# Create a blank English model for training
nlp = spacy.blank("en")
# Add the text classifier to the pipeline
textcat = nlp.add_pipe("textcat")
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")
# Prepare training data -- each item is (text, annotations)
train_data = [
("This product is amazing, best purchase I ever made!",
{"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("Absolutely love it, works perfectly every time.",
{"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("Great quality, highly recommend to everyone.",
{"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("Terrible product, broke after one day of use.",
{"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
("Worst purchase ever, complete waste of money.",
{"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
("Disappointed with quality, returning immediately.",
{"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
("Excellent build and fast shipping, very happy.",
{"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("Does not work as advertised, very frustrating.",
{"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
]
# Train the model
nlp.initialize(lambda: [Example.from_dict(nlp.make_doc(t), a) for t, a in train_data])
for epoch inrange(20):
random.shuffle(train_data)
losses = {}
for text, annotations in train_data:
example = Example.from_dict(nlp.make_doc(text), annotations)
nlp.update([example], losses=losses)
if epoch % 5 == 0:
print(f"Epoch {epoch:<4} Loss: {losses['textcat']:.4f}")
# Test the trained model
test_texts = [
"I really enjoy using this, it exceeded my expectations.",
"Cheap material, fell apart within a week.",
"Solid product with great customer support.",
]
print("\n=== Predictions ===")
for text in test_texts:
doc = nlp(text)
sentiment = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[sentiment]
print(f"Text: {text}")
print(f" Prediction: {sentiment} ({confidence:.2%})\n")Training data here is deliberately tiny. A real project needs hundreds of examples per category. Ideally thousands. We're using spacy.blank("en") because we're training from scratch; in production, start from a pre-trained model and add the classifier on top.
Multi-label classification -- a support ticket that's both "billing" and "urgent" -- swap in textcat_multilabel. Same API.
And for anything going to production, use spaCy's config system and the spacy train CLI. Not the manual loop above. The CLI handles batching, learning rate scheduling, evaluation, model selection. Config files make experiments reproducible. The manual loop is for understanding what's happening. The CLI is for shipping.
Building Custom Pipeline Components
Every nlp(text) call passes text through a sequence: tokenizer, tagger, parser, NER. You can insert your own components anywhere.
Take a Doc, modify it, return it. Business logic baked directly into the NLP pipeline. A component that detects programming language mentions and tags them as custom entities:
import spacy
from spacy.language import Language
from spacy.tokens import Doc, Span
from spacy.matcher import PhraseMatcher
# Register the custom component with spaCy
@Language.factory("programming_language_detector")
defcreate_lang_detector(nlp, name):
return ProgrammingLanguageDetector(nlp)
classProgrammingLanguageDetector:
def__init__(self, nlp):
self.languages = {
"Python": "PROG_LANG",
"JavaScript": "PROG_LANG",
"TypeScript": "PROG_LANG",
"Rust": "PROG_LANG",
"Go": "PROG_LANG",
"Java": "PROG_LANG",
"C++": "PROG_LANG",
"Ruby": "PROG_LANG",
"Kotlin": "PROG_LANG",
"Swift": "PROG_LANG",
}
# Use PhraseMatcher for efficient matching
self.matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(lang) for lang in self.languages]
self.matcher.add("PROG_LANG", patterns)
def__call__(self, doc):
matches = self.matcher(doc)
new_ents = list(doc.ents) # Keep existing entitiesfor match_id, start, end in matches:
span = Span(doc, start, end, label="PROG_LANG")
# Only add if it doesn't overlap with existing entitiesifnotany(
span.start < ent.end and span.end > ent.start
for ent in new_ents
):
new_ents.append(span)
doc.ents = new_ents
return doc
# Use the custom component
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("programming_language_detector", after="ner")
# Process text with our enhanced pipeline
text = """
Our team uses Python for data processing and JavaScript for the frontend.
We are considering migrating our backend services from Java to Rust for
better performance. The mobile app is built with Kotlin and Swift.
"""
doc = nlp(text)
print("=== All Entities (including programming languages) ===")
for ent in doc.ents:
print(f"{ent.text:<15} {ent.label_}")
# Filter for just programming languages
prog_langs = [ent.text for ent in doc.ents if ent.label_ == "PROG_LANG"]
print(f"\nProgramming languages found: {', '.join(prog_langs)}")
# Inspect the pipelineprint(f"\nPipeline components: {nlp.pipe_names}")@Language.factory registers the component, making it serializable -- models with your custom component save and load without extra work. Placing it after="ner" means it runs after built-in NER so we can merge custom entities with spaCy's detections. The overlap check prevents conflicts when "Go" gets tagged by both detectors.
You can also attach custom attributes to tokens, spans, documents. Whatever your domain needs:
import spacy
from spacy.tokens import Doc
# Register a custom extension on the Doc class
Doc.set_extension("word_count", getter=lambda doc: len([t for t in doc if not t.is_punct]))
Doc.set_extension("reading_time_seconds", getter=lambda doc: len([t for t in doc if not t.is_punct]) / 4.0)
nlp = spacy.load("en_core_web_md")
doc = nlp("Natural language processing with spaCy makes text analysis straightforward and efficient.")
print(f"Word count: {doc._.word_count}")
print(f"Reading time: {doc._.reading_time_seconds:.1f} seconds")Custom extensions live under doc._. Getters compute dynamically on access; you can also set static defaults or use setters. Domain-specific annotations layered on top of spaCy's standard features without touching the library source.
Next Steps
If you need a chatbot, use an LLM. If you need translation, use a translation API. If you need to generate text, spaCy is the wrong tool entirely. spaCy is for structured extraction from text -- entity recognition, classification, parsing. Don't try to make it do everything. It's good at a narrow set of things and bad at the things outside that set, and knowing the boundary will save you from building something that doesn't work.