How to convert a pdf to text chunks for vector database ingestion
- Step 1Upload the PDF — Drop the document into the chunker.
- Step 2Configure chunk size and output format — Set chunk size and enable JSON output with metadata.
- Step 3Download the chunks JSON — Save the output file.
- Step 4Embed and upsert to your vector database — Iterate over the chunks, call your embedding API, and upsert to your vector store.
Frequently asked questions
What metadata should I include per chunk?+
Include document name, page number, chunk index, and section heading. This allows precise citation when chunks are retrieved in a RAG response.
Should I deduplicate chunks?+
Yes — if processing multiple overlapping PDFs, compute a hash of each chunk and skip duplicates before upserting to the vector database.
What is the correct chunk overlap for a RAG pipeline?+
10-15% overlap of the chunk size is common (e.g., 50 tokens for 512-token chunks). Overlap ensures context is not lost at chunk boundaries.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.