Ingest 50+ File Types

From PDFs and spreadsheets to raw text and web pages, Orckai parses virtually any document format your organization works with. Every file is automatically chunked, embedded, and indexed for instant semantic retrieval.

PDF Documents

Full-text extraction from single-page and multi-hundred-page PDFs. Orckai handles scanned documents, embedded tables, multi-column layouts, and form fields. Headers, footers, and page numbers are stripped automatically so your search index stays clean and relevant.

Word Documents (DOCX)

Native DOCX parsing preserves heading hierarchy, bullet lists, numbered lists, and table structures. Orckai reads styled content and uses heading levels to create semantically meaningful chunks, so a section titled "Refund Policy" stays together as a single retrievable unit.

Plain Text & TXT

Raw text files, log files, configuration files, and any unstructured text content. Orckai applies intelligent splitting to produce high-quality searchable segments even when the source has no formatting or markup to guide segmentation.

CSV & Tabular Data

Each row in your CSV becomes a searchable record with column headers preserved as context. Product catalogs, employee directories, inventory lists, and pricing tables are all indexed with their column names so queries like "price of Widget X" return the exact row.

Excel Spreadsheets (XLSX)

Multi-sheet workbooks are processed sheet by sheet. Orckai reads cell values, respects merged cells, and uses header rows to label data. Financial reports, project trackers, and data exports are converted into text representations that preserve the tabular relationships your team relies on.

Markdown & Web Scraping

Markdown files retain their heading structure for natural chunk boundaries. Website scraping fetches live pages, strips navigation and boilerplate, and indexes the meaningful content. Follow links to crawl multiple pages from a single URL, building a knowledge base from an entire documentation site or wiki.

Semantic Search That Understands Meaning

Traditional keyword search fails when users phrase questions differently from how the document was written. Orckai converts your documents and every query into semantic representations, then finds the closest matches by meaning rather than exact words.

A search for "how do I get a refund" will match a document section titled "Return and Cancellation Policy" even though the words barely overlap. That is the difference between semantic search and keyword grep.

Meaning-based matching — results are ranked by semantic relevance, not keyword frequency
Sub-second queries on knowledge bases with thousands of documents and hundreds of thousands of segments
No external services required — search runs entirely within your Orckai deployment
Scales with your data — optimized indexing keeps queries fast as your knowledge base grows

How It Works

User asks:

"What is the refund window?"

Best Match High relevance

"Returns must be initiated within 30 days of purchase."

refund-policy.pdf, page 2

Related High relevance

"Cancellations after 14 days are subject to a restocking fee of 15%."

refund-policy.pdf, page 3

No keyword overlap needed — matched by meaning.

Every Answer Backed by a Source

Orckai doesn't just generate answers — it grounds every response in your actual documents. When your AI agent or widget answers a question, each claim is backed by a traceable citation like [Source: refund-policy.pdf].

Your users can verify every answer against the original document. No hallucination, no guessing, no "I think" responses. Just facts from your data with clear attribution.

Inline citations — every fact links back to its source document and page
Zero hallucination — responses are generated only from your uploaded content
Multi-document synthesis — answers can pull from multiple sources in a single response
Works everywhere — citations appear in AI agents, widgets, and workflow steps automatically

The entire pipeline runs automatically every time a question is asked. Upload your documents, connect to an agent or widget, and your users get cited, accurate answers immediately.

From Question to Cited Answer

Upload

Files & URLs

Process

Parse & prep

Index

Ready to search

Ask

User query

Retrieve

Best matches

Answer

Cited facts

Intelligent Document Processing

Orckai doesn't just dump your files into a search index. It understands document structure — headings, paragraphs, tables, lists — and preserves that structure so search results are coherent, complete, and useful.

A section titled "Benefits" stays together as one retrievable unit instead of being split arbitrarily across multiple fragments. Tables remain intact. Lists stay grouped. The result: when your AI retrieves information, it gets full, meaningful context rather than broken snippets.

Structure-aware processing — headings, paragraphs, and tables are preserved as logical units
Clean extraction — headers, footers, page numbers, and boilerplate are stripped automatically
Source tracking — every segment retains its source filename, page number, and section heading
Reprocessing on demand — re-index any knowledge base after adding new documents or changing settings

employee-handbook.pdf

Section 1

Company Overview

"Orckai was founded with a mission to simplify enterprise AI operations..."

page 1

Section 2

Benefits

"All full-time employees are eligible for health insurance after 90 days. Dental and vision are included at no extra cost."

page 4

Each section = one searchable unit with source, page, and heading.

Enterprise-Ready Knowledge Management

Orckai knowledge bases are designed for organizations that need data isolation, auditability, and integration with existing storage infrastructure. Every knowledge base is scoped to your organization and protected by role-based access control.

Organization-Scoped KBs

Each organization has its own isolated set of knowledge bases. Documents uploaded by one organization are never visible to another. Row-level security in PostgreSQL enforces this at the database layer, not just the application layer. Reprocess any knowledge base at any time to re-chunk and re-embed documents after you change chunking settings or upgrade embedding models.

Website Scraping with Link Following

Point Orckai at a URL and it will fetch the page content, strip navigation chrome and boilerplate, and index the meaningful text. Enable link following to crawl an entire documentation site, help center, or wiki from a single seed URL. Scraped content is chunked and embedded just like uploaded files, so your knowledge base can include both internal documents and public web content.

Storage Provider Integration

Connect Google Drive, OneDrive, SharePoint, or Dropbox as document sources. Orckai pulls files directly from your cloud storage, processes them through the RAG pipeline, and keeps your knowledge base current. Combined with the built-in MinIO S3-compatible storage and local filesystem backends, you can centralize knowledge from every system your organization uses.

AI Knowledge Base with RAG-Powered Search