How Accurate Is Automated Entity Extraction in 2026
Accuracy by Approach
Fine-tuned transformer NER (BERT, DeBERTa): 90 to 93% F1 on standard types (OntoNotes benchmark). This is the gold standard for accuracy on entity types covered by training data. The accuracy drops to 82 to 88% on domain-specific types with sufficient training data (300+ examples per type).
SpaCy pre-trained models: 87 to 90% F1 with the transformer pipeline, 82 to 85% with the smaller convolutional models. SpaCy trades a few points of accuracy for dramatically better throughput and simpler deployment.
LLM zero-shot extraction: 85 to 88% F1 on standard types, 80 to 87% on domain-specific types. Adding 2 to 3 examples (few-shot) pushes standard types to 88 to 92%. The accuracy gap with fine-tuned models has narrowed substantially since 2024, when zero-shot LLM extraction scored 75 to 82% F1.
Relationship extraction (LLM): 75 to 85% F1 depending on relationship complexity. Simple, explicitly stated relationships ("X depends on Y") score at the high end. Implicit or multi-sentence relationships score at the low end. There is no good non-LLM baseline for typed relationship extraction, making the LLM the only practical option for most applications.
What the Numbers Mean in Practice
F1 of 90% means that for every 100 entities in the text, the system correctly extracts 90, misses about 5, and hallucinates about 5. For a knowledge graph, this translates to: 90% of entity nodes are correct, 5% of nodes are missing (with their connections), and 5% of nodes are spurious (with potentially misleading connections).
At 85% F1, you have 85 correct, 7 to 8 missing, and 7 to 8 spurious per 100 entities. This is the threshold where most applications find the graph useful despite the errors. Below 80% F1, the error rate starts to undermine confidence in graph traversal results.
Relationship extraction at 75% F1 means about 1 in 4 extracted relationships has an error. This sounds high, but the errors are distributed: some are wrong predicate types (depends_on vs uses), some are reversed directions, and some are hallucinated. For graph traversal purposes, wrong predicate types are the least harmful error (the edge still connects the right nodes), while hallucinated relationships are the most harmful (they create false connections).
Factors That Affect Accuracy
Text quality: Clean, well-structured documentation yields 5 to 10% higher F1 than noisy sources like chat logs, meeting transcripts, or OCR'd documents. Preprocessing to clean up noise before extraction makes a measurable difference.
Entity type specificity: Well-defined, distinct entity types score higher than vague, overlapping types. "AWS S3 bucket" is easier to extract correctly than "infrastructure component."
Chunk size: Passages of 500 to 1,000 tokens score 3 to 5% higher than very short passages (under 200 tokens) because longer context helps the model resolve ambiguity. Very long passages (over 2,000 tokens) show diminishing returns as the model loses focus.
Domain familiarity: LLMs perform better on domains well-represented in their training data (software engineering, general business) than on highly specialized domains (semiconductor manufacturing, marine biology). For niche domains, fine-tuned models or few-shot prompting with domain examples is essential.
Improving Over Time
First-pass extraction typically achieves 70 to 80% F1. Two to three iterations of prompt refinement (for LLMs) or annotation correction (for fine-tuned models) bring accuracy to 85 to 90%. Further improvement becomes incremental: each additional iteration yields 1 to 2% F1 improvement at increasing effort. Most teams reach their target accuracy within a week of iteration.
Adaptive Recall's extraction pipeline is tuned for accuracy on the entity types that matter for AI memory systems. As your memory store grows, the entity inventory provides better context for disambiguation, which improves extraction accuracy on new memories. The system gets more accurate the more you use it.
Adaptive Recall handles entity extraction at production-quality accuracy from the first memory. Start storing and the extraction improves with every interaction.
Try It Free