Introducing PaperShift: An Open Source PDF Parser with OCR

PaperShift is a powerful open-source tool for converting PDF documents to well-formatted Markdown with advanced OCR capabilities.

Introducing PaperShift: Transforming PDFs into Accessible Markdown

I remember the first time I had to extract content from a PDF. It was 2023, and I was working on a project that required analyzing hundreds of legal documents. I thought it would be straightforward: open the PDF, copy the text, paste it somewhere useful. Three hours later, I was still manually fixing garbled text, reconstructing tables, and cursing whoever invented the PDF format.

PDFs are like those fancy desserts that look perfect on the plate but fall apart the moment you try to eat them. They're designed for viewing, not for interaction. And yet, in 2025, they remain the lingua franca of document exchange.

The PDF Paradox

There's a curious paradox with PDFs. They're simultaneously ubiquitous and frustrating. Everyone uses them, but no one enjoys working with their contents. It's like how email became the standard for business communication despite everyone complaining about it.

Last month, one of our developers at askjunior spent three days extracting data from a 200-page PDF report. The document contained handwritten notes in the margins, tables nested within tables, and images with embedded text. By the end, he had created a custom Python script that worked for that specific document—and would likely never be useful again.

This scenario repeats itself across thousands of organizations daily. Someone receives a PDF, needs the content in a different format, and ends up in one of three situations:

Manual retyping (slow and error-prone)
Using generic OCR tools (mediocre results requiring extensive cleanup)
Writing custom extraction code (effective but not reusable)

It's like everyone is repeatedly reinventing slightly different wheels, none of which roll particularly well.

The Birth of PaperShift

At askjunior, we've been working with case files for the past year. These documents come in every conceivable format: handwritten notes, evidence photos, single-column reports, multi-column academic papers. For each project, we found ourselves building custom PDF-to-Markdown conversion pipelines.

It reminded me of the early days of web development, when every project required writing the same authentication system from scratch. Eventually, someone created libraries that solved the problem once and for all. That's what we needed for PDF conversion.

After the fifth time recreating a PDF extraction pipeline, I had that moment of clarity that comes from sufficient frustration: this problem needed a generalized solution.

PaperShift emerged from this realization. Not as a perfect solution to all PDF problems—no such thing exists—but as a tool that handles 90% of cases well enough that you can focus on the remaining 10% that actually matters to your specific use case.

What Makes PaperShift Different

Most PDF tools feel like they were designed by people who have never actually had to extract content from a complex document under deadline pressure.

I once watched a colleague try to extract data from a financial statement using a popular PDF tool. The tool proudly announced it had completed the extraction, but somehow managed to merge two columns of a table, interpret a logo as text, and completely ignore a crucial footnote. My colleague ended up doing it manually.

PaperShift takes a different approach. It combines advanced PDF processing with state-of-the-art AI models, but more importantly, it's designed by people who have felt the pain of poor PDF extraction firsthand.

The key features emerged organically from our own needs:

High-Quality Conversion: Because we got tired of fixing broken formatting
Parallel Processing: Because waiting for large documents to process is maddening
Memory Optimization: Because we crashed our systems one too many times with large files
Fast Mode Option: Because sometimes "good enough now" beats "perfect later"
Detailed Progress Reporting: Because staring at a progress bar that says "processing" for 20 minutes makes you question your life choices
Customizable AI Models: Because different documents need different approaches
Adaptive Resolution: Because one-size-fits-all settings rarely fit anything well

Getting Started: Simpler Than You'd Expect

One thing I've noticed about developer tools is that they often solve one problem while creating another: a steep learning curve. We wanted PaperShift to be different.

Installation is as simple as:

pip install papershift

And basic usage isn't much more complicated:

from papershift import convert_pdf_to_markdown
 
# Basic usage
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    api_key="your-openrouter-api-key"
)
 
# Save the output
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

Of course, for those who need more control (and in PDF processing, that's often necessary), we offer more advanced options:

markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    output_dir="output_folder",
    dpi=300,
    target_height_px=2048,
    model="openrouter/google/gemini-2.0-flash-001",
    api_key="your-openrouter-api-key",
    max_workers=4,
    batch_size=5,
    fast_mode=True
)

The Engine Room: How PaperShift Works

Under the hood, PaperShift is like a well-orchestrated assembly line. Each component has a specific job:

Document Analysis: Think of this as the foreman, examining the document and planning the approach
Image Extraction: The photographer, capturing high-quality images of each page
Parallel Processing: The team of workers, each handling different pages simultaneously
AI-Powered OCR: The linguists, interpreting the text from the images
Markdown Formatting: The editors, arranging everything into clean, usable format

We've built this on top of several excellent libraries:

PyMuPDF: For robust PDF processing
litellm: For seamless LLM API integration
openrouter: For accessing various AI models
python-dotenv: For environment variable management

The Development Journey: Lessons from the Trenches

Building PaperShift has been a journey of discovery, much like raising a child. You start with certain expectations, and reality quickly teaches you how naive those were.

When we began, we thought the main challenge would be text recognition. It wasn't. Modern OCR is surprisingly good. The real challenges were:

Layout Understanding: Recognizing when text is in multiple columns, when it spans pages, when it's a header vs. body text
Table Detection: Identifying tables and preserving their structure
Context Preservation: Maintaining the relationship between text elements
Edge Cases: Handling documents that seem designed specifically to break OCR systems

Each of these problems taught us something valuable. Table detection, for instance, showed us that what humans easily recognize as a table is actually a complex pattern of spatial relationships that algorithms struggle to identify consistently.

Working with case files at askjunior provided the perfect testing ground. Each new project brought documents that broke our system in new and interesting ways, forcing us to adapt and improve.

Measuring Success: Benchmarking Against Reality

One question that kept coming up was: "How do we know if PaperShift is actually good?"

It's easy to test a tool against simple documents and declare victory. It's much harder to evaluate performance on the messy, complex documents people encounter in the real world.

That's why we're benchmarking PaperShift against the OmniAI OCR Benchmark, a comprehensive dataset designed to evaluate OCR and data extraction capabilities across different systems.

This benchmark is valuable because it:

Compares OCR and data extraction capabilities across different multimodal LLMs
Evaluates both text extraction accuracy and JSON formatting precision
Uses a methodology that measures how well a model can OCR a document and return content in a format that an LLM can parse
Includes diverse document types that match our real-world use cases

But benchmarks only tell part of the story. The real test is how PaperShift performs on the documents people actually need to convert. That's why we're also testing against:

Academic papers with complex formatting
Legal documents with multi-column layouts
Handwritten notes with varying legibility
Forms with tables and structured data
Documents with mixed text and image content

Based on these benchmarks, we're developing a secure API service that will provide enterprise-grade security, consistent performance, scalable processing, and detailed quality metrics.

Real-World Applications: Beyond the Technical

The technical aspects of PaperShift are interesting, but what really matters is how it helps people solve real problems.

A researcher I know spent weeks manually extracting data from published papers for a meta-analysis. PaperShift reduced that process from days to hours, allowing her to include twice as many papers in her study.

A legal team used it to convert case files dating back to the 1990s into searchable text, uncovering precedents that had been effectively lost in their paper archives.

A documentation team integrated it into their workflow to incorporate legacy PDF manuals into their modern knowledge base.

These examples highlight something important: tools like PaperShift don't just save time; they unlock possibilities that were previously impractical.

The Road Ahead: What's Next for PaperShift

PaperShift today is useful, but it's far from finished. The PDF ecosystem is vast and complex, and there are many areas where we can improve.

Our roadmap includes:

Advanced Table Extraction: We already handle standard tables effectively, but we're working on more complex scenarios:
- Tables containing embedded images or diagrams
- Nested tables (tables within tables)
- Tables with merged cells or irregular structures
- Preserving table styling and formatting in Markdown output
Support for Complex Layouts:
- Multi-column academic papers with footnotes
- Magazine-style layouts with text flowing around images
- Documents with sidebars and callout boxes
- Forms with complex field arrangements
Enhanced Image Handling:
- Improved image extraction and placement in Markdown
- Caption detection and preservation
- Diagram and chart recognition with optional alt text generation
- Image quality optimization for web publishing
Additional Output Formats:
- HTML with preserved formatting
- Structured JSON for data extraction
- Integration with popular documentation platforms
- Custom templating options
Performance Optimizations:
- Reduced memory footprint for large documents
- Faster processing through enhanced parallelization
- Smarter batching strategies for multi-page documents
- Optimized AI model selection based on document characteristics

Each of these improvements represents dozens of edge cases we've encountered and want to handle better.

Contributing: Join the PDF Liberation Front

As an open-source project, PaperShift welcomes contributions from the community. Whether you're interested in adding features, fixing bugs, or improving documentation, your help is valuable.

I've found that the best open-source projects are those that scratch a real itch for their creators. If you've ever struggled with PDF extraction, you probably have insights that could make PaperShift better.

Final Thoughts: The Future of Document Processing

PDFs aren't going away anytime soon. Despite their limitations, they remain the standard for document exchange across industries. But that doesn't mean we have to accept the status quo of painful extraction processes.

PaperShift represents a step toward a future where the content of documents is as accessible as their appearance. Where information isn't locked away in formats designed primarily for printing.

I invite you to try PaperShift for your PDF conversion needs and share your feedback. The tool will improve fastest if it's shaped by the real needs of its users.

In the end, tools like PaperShift aren't just about converting documents from one format to another. They're about making information more accessible, more useful, and more valuable. And that's a goal worth pursuing.

Note: This post will be updated with additional details and examples. Stay tuned for more information about PaperShift!