The Chunker (GitHub)
the_chunker is a focused chunking engine for building LLM-friendly datasets without tying you to any embedding or vector store. It uses Tree-sitter for AST-aware chunking when possible, with reliable fallbacks for everything else.
What it does
- Creates semantic chunks from code and text files
- Merges chunks with configurable overlap to preserve context windows
- Counts tokens in a model-aware way to hit target ranges
Why it exists
Most pipelines mix chunking with embedding and storage logic. the_chunker keeps chunking independent so you can plug it into any RAG, summarization, or code search workflow without rewriting core logic.