Skip to content
Repository Radar

olmOCR

BTR ACTIVE BREAKOUT

allenai/olmocr

Toolkit for linearizing PDFs for LLM datasets/training

stars 18k
last activity 3mo ago
open issues 55
language Python
license Apache-2.0
latest release v0.4.27
momentum · per month since covered + 642/mo (+8%/mo) · + 10k total since PR#3

metrics as of today

star history

Exact curve on star-history.com ↗
PR#3 · 8k18k★ now Sep 2024Jul 2026
  1. PR#3 8k★ 2025-03-05
  2. now 18k★ + 10k since first covered

curve is sampled from GitHub's star history; the dashed stretch is before we first covered it, the solid line since. figures at coverage are the numbers we printed then (approx.), current count is live.

covered in

  • PR#3 2025-03-05 below the radar

    Toolkit for linearizing PDFs for LLM datasets/training

// comments

COMING SOON

Sign in with GitHub to weigh in on olmOCR. We're wiring this up; check back soon.