Training Sources

Three ways to train Orac on your content.

File Upload

Drag and drop files directly into the Sources tab. Supported formats:

PDF — product manuals, whitepapers, guides. Text is extracted from all pages. Scanned PDFs without selectable text are not supported.

TXT / Markdown — plain text files, README files, knowledge base articles.

DOCX — Word documents with formatting preserved as plain text.

Maximum file size is 10 MB. Files are processed immediately — you'll see the status change from “Processing” to “Ready” within a few seconds.

Single URL

Paste a specific page URL and Orac will fetch the page, extract text content, and add it to your knowledge base. This is useful for importing specific pages from external sites, competitor pages for comparison training, or individual help articles.

Orac extracts clean text from the HTML — stripping navigation, footers, sidebars, and scripts. Only the main content is indexed.

Full Site Crawl

The most powerful training method. Enter your domain and Orac discovers all pages via sitemap.xml, crawls each one, and builds a complete knowledge base.

How it works: Orac fetches your sitemap.xml, extracts all page URLs, then visits each page to extract text content. Each page becomes a separate document in your project.

Crawl limits by plan:

Free — up to 500 pages. Pro — up to 5,000 pages. Business — up to 25,000 pages.

Supported platforms: WordPress, Shopify, Webflow, Next.js, Hugo, Jekyll, and any site with a publicly accessible sitemap.xml.

No sitemap? Orac will attempt to discover pages by following links from your homepage. For best results, add a sitemap.xml to your site.

24h Auto-Sync

After the initial crawl, Orac automatically re-crawls your site every 24 hours at 3am UTC. The sync uses content hash diffing — a SHA-256 hash of each page's content is compared to the stored hash:

Hash matches — page is unchanged, skipped (zero cost).

Hash differs — page has been updated. Old chunks are deleted, new content is chunked and re-embedded.

URL removed — page was deleted (404/410). All chunks for that URL are removed.

New URL found — new page added to sitemap. Content is extracted and embedded.

Only the delta is processed. If 5 out of 4,000 pages changed, only those 5 are re-processed. Typical sync cost: less than $0.001 for 5 pages.

You can also trigger a manual sync anytime from the Sources tab by clicking “Sync Now”. Sync history is visible in the Sources tab showing pages added, updated, and removed.

RAG Pipeline

Behind the scenes, Orac processes your content through a sophisticated RAG (Retrieval-Augmented Generation) pipeline:

1. Extract — text is extracted from your file, URL, or crawled page.

2. Chunk — text is split into ~375-token chunks with overlap to avoid breaking mid-sentence.

3. Embed — each chunk is converted to a 1536-dimensional vector using OpenAI text-embedding-3-small.

4. Store — vectors are stored in pgvector alongside metadata (source URL, page title, content hash).

5. Query — when a visitor asks a question, the query is rewritten for clarity, embedded, then matched against your chunks using hybrid search (vector similarity + keyword matching with RRF fusion). The top chunks are re-ranked using a 7-signal scoring system, then fed to the LLM for a grounded response with source citations.