How to Extract Entities from PDFs and Documents
The Document Extraction Challenge
Entity extraction tools (NER models, LLM-based extraction) work on text. PDFs, Word documents, PowerPoint files, and scanned documents are not text. They are visual layouts that happen to contain text. The gap between the visual representation and clean, extractable text is where most quality problems originate. A PDF that looks perfectly readable to a human can produce garbled text output that confuses entity extraction, yielding missed entities or false positives from layout artifacts.
The most common problems: multi-column layouts where text from adjacent columns is interleaved, tables where cell boundaries are lost and columns merge into a single line, headers and footers that repeat on every page and get inserted into the middle of paragraphs, hyphenated words at line breaks, and scanned documents where OCR introduces character recognition errors. Each problem requires a specific handling strategy.
Step-by-Step Process
Different document formats require different extraction approaches. Classify each document before processing: text-based PDFs (most modern PDFs created from word processors), scanned PDFs (images of paper documents, requiring OCR), Word documents (.docx), HTML pages, and plain text files. The classification determines which extraction library to use.
import magic
import fitz # PyMuPDF
def classify_document(file_path):
mime = magic.from_file(file_path, mime=True)
if mime == 'application/pdf':
doc = fitz.open(file_path)
text = doc[0].get_text() if len(doc) > 0 else ""
if len(text.strip()) < 50:
return "scanned_pdf"
return "text_pdf"
elif mime in ('application/msword',
'application/vnd.openxmlformats-officedocument'
'.wordprocessingml.document'):
return "word"
elif mime == 'text/html':
return "html"
elif mime.startswith('text/'):
return "text"
return "unknown"Use the right library for each format. PyMuPDF (fitz) is the best general-purpose PDF text extractor in 2026, handling most layouts correctly. For scanned PDFs, use Tesseract OCR or a cloud OCR service. For Word documents, use python-docx. For HTML, use BeautifulSoup.
import fitz
from docx import Document
from bs4 import BeautifulSoup
def extract_text_pdf(file_path):
doc = fitz.open(file_path)
pages = []
for page_num, page in enumerate(doc):
text = page.get_text("text")
pages.append({
"page": page_num + 1,
"text": text
})
return pages
def extract_text_word(file_path):
doc = Document(file_path)
paragraphs = []
for para in doc.paragraphs:
if para.text.strip():
paragraphs.append(para.text)
return [{"page": 1, "text": "\n".join(paragraphs)}]
def extract_text_html(file_path):
with open(file_path, 'r') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
return [{"page": 1, "text": soup.get_text("\n")}]Tables are the biggest source of extraction errors in PDFs. Standard text extraction merges table cells into a single line, losing the row/column structure that gives the data meaning. Use a table-aware extraction method that detects table boundaries and extracts cell values with their positions.
import pdfplumber
def extract_tables(file_path):
tables = []
with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages):
for table in page.extract_tables():
if table and len(table) > 1:
headers = table[0]
for row in table[1:]:
description = ", ".join(
f"{h}: {v}" for h, v in
zip(headers, row) if v
)
tables.append({
"page": page_num + 1,
"text": description
})
return tablesRemove extraction artifacts: repeated headers/footers, page numbers, hyphenation at line breaks, excessive whitespace. Then segment the cleaned text into passages of 500 to 1,000 tokens for entity extraction. Respect paragraph and section boundaries when possible.
import re
def clean_text(text):
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r'[ \t]+', ' ', text)
text = re.sub(r'-\n(\w)', r'\1', text) # rejoin hyphenated words
text = re.sub(r'Page \d+ of \d+', '', text)
return text.strip()
def segment_into_passages(text, target_words=600):
paragraphs = text.split('\n\n')
passages = []
current = []
current_len = 0
for para in paragraphs:
words = len(para.split())
if current_len + words > target_words and current:
passages.append('\n\n'.join(current))
current = [para]
current_len = words
else:
current.append(para)
current_len += words
if current:
passages.append('\n\n'.join(current))
return passagesFeed the cleaned passages through your entity extraction pipeline (NER model or LLM-based extraction). Attach document-level metadata (file name, page number, section heading) to each extracted entity so you know where it came from.
Store the page number, section heading, and character offsets for each extracted entity mention. This provenance data is essential for two things: verifying extraction accuracy by tracing entities back to their source, and providing context when an entity is returned in a search result so the user can find the original passage.
--psm 1 flag (automatic page segmentation with OSD) for best results on mixed layouts. Cloud OCR services (Google Document AI, AWS Textract) handle complex layouts better than Tesseract but cost $1.50 to $3.00 per 1,000 pages.
Store text from any source as memories. Adaptive Recall extracts entities automatically, whether the text comes from PDFs, code, conversations, or any other source.
Try It Free