Document Ingestion

Uploading PDFs, DOCX files, spreadsheets, and presentations for automatic chunking, embedding, and knowledge extraction.

Last updated: April 14, 2026

documentsuploadpdfdocxingestionchunkingembeddingspreadsheet

How Document Ingestion Works

When you send a document to your AI employee, it goes through a multi-step ingestion process. First, the document is parsed to extract all text content. Then the text is split into meaningful chunks — sections, paragraphs, or logical blocks. Each chunk is converted into a vector embedding that captures its semantic meaning. Finally, the chunks are stored in your knowledge base and become searchable. This process enables the AI to reference specific parts of your documents when answering questions.

Supported Document Types

The ingestion system supports a wide range of document formats. PDF files (including scanned documents with OCR), Microsoft Word documents (DOCX, DOC), plain text files (TXT, MD), spreadsheets (XLSX, CSV), PowerPoint presentations (PPTX), and HTML pages. For scanned PDFs and images containing text, the system uses OCR (optical character recognition) to extract the text before processing. Each format is handled by a specialized parser to ensure maximum extraction accuracy.

For best results with scanned documents, ensure the scan quality is at least 300 DPI. Blurry or low-resolution scans may produce less accurate OCR results.

Uploading Documents

To ingest a document, simply send it to your AI employee through Telegram and tell it to learn from the document. You can say "learn this," "add this to your knowledge base," or "ingest this document." The AI will process the file and confirm when ingestion is complete, including a summary of what it learned. You can also send multiple documents at once — the AI will process them in sequence.

Ingesting a company handbook

Upload a document for the AI to learn from.

You say:

[Attaches employee-handbook.pdf] Learn this document. It's our company handbook.

Sarudo responds:

I've ingested the employee handbook (47 pages, 128 sections). Key topics stored include: company policies, PTO rules, expense reporting procedures, code of conduct, benefits overview, and organizational structure. I can now answer questions about any of these topics. For example, I know your PTO policy allows 20 days per year with a 5-day carryover maximum. Anything specific you'd like me to look up?

Auto-Chunking & Embedding

The chunking strategy is optimized for retrieval accuracy. Documents are split at natural boundaries — section headings, paragraph breaks, and topic changes. Each chunk is sized to contain enough context to be useful on its own (typically 500-1000 tokens). Overlapping text between chunks ensures that information at boundaries is not lost. Each chunk is embedded using a high-quality embedding model, enabling semantic search that understands meaning rather than just matching keywords.

How Sarudo Learns

The automatic knowledge extraction pipeline, how learning happens continuously, and how your AI employee improves over time.

Semantic Search

How vector search and hybrid search work to find information by meaning, not just keywords.

File Sharing

How to send files to your AI employee and receive generated files back, including supported formats and download links.