How to chunk a pdf document for a rag pipeline
- Step 1Upload the PDF — Drop the document into the PDF chunker.
- Step 2Set chunk size and overlap — Configure chunk size (e.g., 512 tokens) and overlap (e.g., 50 tokens) based on your embedding model's context window.
- Step 3Download the chunks as JSON — Save the chunked output.
- Step 4Ingest into your vector database — Feed the chunks to your embedding model and store in Pinecone, Chroma, or pgvector.
Frequently asked questions
What chunk size should I use for OpenAI's text-embedding-ada-002?+
text-embedding-ada-002 supports up to 8191 tokens. A chunk size of 256-512 tokens with 50-token overlap balances retrieval precision and context.
Should chunks respect sentence boundaries?+
Yes — semantic chunking that splits at sentence or paragraph boundaries produces more coherent chunks than fixed-character splits.
Does the chunker handle multi-column PDFs?+
For single-column PDFs, reading order is preserved. Multi-column PDFs may require pre-processing to restore correct reading order before chunking.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.