Home » Entity Extraction and NER » Extract from PDFs

How to Extract Entities from PDFs and Documents

Extracting entities from PDFs and documents requires two stages: converting the document into clean text, then running entity extraction on that text. The first stage is where most problems occur, because PDFs were designed for visual layout, not text extraction. Tables, multi-column layouts, headers, footers, and scanned images all create challenges that must be handled before entity extraction can produce reliable results.

The Document Extraction Challenge

Entity extraction tools (NER models, LLM-based extraction) work on text. PDFs, Word documents, PowerPoint files, and scanned documents are not text. They are visual layouts that happen to contain text. The gap between the visual representation and clean, extractable text is where most quality problems originate. A PDF that looks perfectly readable to a human can produce garbled text output that confuses entity extraction, yielding missed entities or false positives from layout artifacts.

The most common problems: multi-column layouts where text from adjacent columns is interleaved, tables where cell boundaries are lost and columns merge into a single line, headers and footers that repeat on every page and get inserted into the middle of paragraphs, hyphenated words at line breaks, and scanned documents where OCR introduces character recognition errors. Each problem requires a specific handling strategy.

Step-by-Step Process

Step 1: Classify document types.
Different document formats require different extraction approaches. Classify each document before processing: text-based PDFs (most modern PDFs created from word processors), scanned PDFs (images of paper documents, requiring OCR), Word documents (.docx), HTML pages, and plain text files. The classification determines which extraction library to use.

import magic
import fitz  # PyMuPDF

def classify_document(file_path):
    mime = magic.from_file(file_path, mime=True)
    if mime == 'application/pdf':
        doc = fitz.open(file_path)
        text = doc[0].get_text() if len(doc) > 0 else ""
        if len(text.strip()) < 50:
            return "scanned_pdf"
        return "text_pdf"
    elif mime in ('application/msword',
                  'application/vnd.openxmlformats-officedocument'
                  '.wordprocessingml.document'):
        return "word"
    elif mime == 'text/html':
        return "html"
    elif mime.startswith('text/'):
        return "text"
    return "unknown"

Step 2: Extract raw text.
Use the right library for each format. PyMuPDF (fitz) is the best general-purpose PDF text extractor in 2026, handling most layouts correctly. For scanned PDFs, use Tesseract OCR or a cloud OCR service. For Word documents, use python-docx. For HTML, use BeautifulSoup.

import fitz
from docx import Document
from bs4 import BeautifulSoup

def extract_text_pdf(file_path):
    doc = fitz.open(file_path)
    pages = []
    for page_num, page in enumerate(doc):
        text = page.get_text("text")
        pages.append({
            "page": page_num + 1,
            "text": text
        })
    return pages

def extract_text_word(file_path):
    doc = Document(file_path)
    paragraphs = []
    for para in doc.paragraphs:
        if para.text.strip():
            paragraphs.append(para.text)
    return [{"page": 1, "text": "\n".join(paragraphs)}]

def extract_text_html(file_path):
    with open(file_path, 'r') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
    for tag in soup(['script', 'style', 'nav', 'footer']):
        tag.decompose()
    return [{"page": 1, "text": soup.get_text("\n")}]

Step 3: Handle tables and structured data.
Tables are the biggest source of extraction errors in PDFs. Standard text extraction merges table cells into a single line, losing the row/column structure that gives the data meaning. Use a table-aware extraction method that detects table boundaries and extracts cell values with their positions.

import pdfplumber

def extract_tables(file_path):
    tables = []
    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            for table in page.extract_tables():
                if table and len(table) > 1:
                    headers = table[0]
                    for row in table[1:]:
                        description = ", ".join(
                            f"{h}: {v}" for h, v in
                            zip(headers, row) if v
                        )
                        tables.append({
                            "page": page_num + 1,
                            "text": description
                        })
    return tables

Step 4: Clean and segment text.
Remove extraction artifacts: repeated headers/footers, page numbers, hyphenation at line breaks, excessive whitespace. Then segment the cleaned text into passages of 500 to 1,000 tokens for entity extraction. Respect paragraph and section boundaries when possible.

import re

def clean_text(text):
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r'[ \t]+', ' ', text)
    text = re.sub(r'-\n(\w)', r'\1', text)  # rejoin hyphenated words
    text = re.sub(r'Page \d+ of \d+', '', text)
    return text.strip()

def segment_into_passages(text, target_words=600):
    paragraphs = text.split('\n\n')
    passages = []
    current = []
    current_len = 0
    for para in paragraphs:
        words = len(para.split())
        if current_len + words > target_words and current:
            passages.append('\n\n'.join(current))
            current = [para]
            current_len = words
        else:
            current.append(para)
            current_len += words
    if current:
        passages.append('\n\n'.join(current))
    return passages

Step 5: Run entity extraction.
Feed the cleaned passages through your entity extraction pipeline (NER model or LLM-based extraction). Attach document-level metadata (file name, page number, section heading) to each extracted entity so you know where it came from.

Step 6: Link entities back to document locations.
Store the page number, section heading, and character offsets for each extracted entity mention. This provenance data is essential for two things: verifying extraction accuracy by tracing entities back to their source, and providing context when an entity is returned in a search result so the user can find the original passage.

Scanned PDF tip: For scanned documents, run Tesseract OCR with the --psm 1 flag (automatic page segmentation with OSD) for best results on mixed layouts. Cloud OCR services (Google Document AI, AWS Textract) handle complex layouts better than Tesseract but cost $1.50 to $3.00 per 1,000 pages.

Store text from any source as memories. Adaptive Recall extracts entities automatically, whether the text comes from PDFs, code, conversations, or any other source.

Try It Free

How to Extract Entities from PDFs and Documents

The Document Extraction Challenge

Step-by-Step Process

Related Articles