Document Ingestion
Uploading PDFs, DOCX files, spreadsheets, and presentations for automatic chunking, embedding, and knowledge extraction.
How Document Ingestion Works
When you send a document to your AI employee, it goes through a multi-step ingestion process. First, the document is parsed to extract all text content. Then the text is split into meaningful chunks — sections, paragraphs, or logical blocks. Each chunk is converted into a vector embedding that captures its semantic meaning. Finally, the chunks are stored in your knowledge base and become searchable. This process enables the AI to reference specific parts of your documents when answering questions.
Supported Document Types
The ingestion system supports a wide range of document formats. PDF files (including scanned documents with OCR), Microsoft Word documents (DOCX, DOC), plain text files (TXT, MD), spreadsheets (XLSX, CSV), PowerPoint presentations (PPTX), and HTML pages. For scanned PDFs and images containing text, the system uses OCR (optical character recognition) to extract the text before processing. Each format is handled by a specialized parser to ensure maximum extraction accuracy.
For best results with scanned documents, ensure the scan quality is at least 300 DPI. Blurry or low-resolution scans may produce less accurate OCR results.
Uploading Documents
To ingest a document, simply send it to your AI employee through Telegram and tell it to learn from the document. You can say "learn this," "add this to your knowledge base," or "ingest this document." The AI will process the file and confirm when ingestion is complete, including a summary of what it learned. You can also send multiple documents at once — the AI will process them in sequence.
Ingesting a company handbook
Upload a document for the AI to learn from.
Auto-Chunking & Embedding
The chunking strategy is optimized for retrieval accuracy. Documents are split at natural boundaries — section headings, paragraph breaks, and topic changes. Each chunk is sized to contain enough context to be useful on its own (typically 500-1000 tokens). Overlapping text between chunks ensures that information at boundaries is not lost. Each chunk is embedded using a high-quality embedding model, enabling semantic search that understands meaning rather than just matching keywords.