Preparing Documents

Before you can upload documents to Archivist, they need to be converted into plain text that the AI can read. The Pre-Process tab handles this conversion for you.

Think of it like photocopying a document into a format the AI can understand. A PDF might have complex formatting, images, and layouts — pre-processing extracts the readable text from all of that.

Supported File Formats

Archivist can convert a wide range of file types:

Category	Formats
Documents	PDF, Word (.docx, .doc), PowerPoint (.pptx, .ppt), Excel (.xlsx, .xls), OpenDocument (.odt, .odp, .ods), RTF
Text files	Plain text (.txt), Markdown (.md), CSV, TSV, JSON, YAML
Web files	HTML, XHTML, XML
Images	PNG, JPG, JPEG, WebP, BMP, TIFF, GIF

Note

Image files are processed using OCR (optical character recognition), which reads printed text from pictures. This also works on scanned PDFs that contain images of text rather than actual text.

How to Convert Files

Open the Pre-Process tab.
Drag and drop your files onto the upload area, or click it to open a file browser.
Your files will appear in a queue. You can remove individual files or clear the entire queue if you change your mind.
Click Convert to start processing.

You'll see a timer showing how long the conversion is taking, and when it's done, you'll get a summary showing which files succeeded and how long each one took.

Using Force OCR

If you have scanned PDFs or documents where the text didn't extract cleanly with the normal conversion, try the Force OCR button instead. This tells Archivist to treat the document as an image and use optical character recognition to read the text, which can produce better results for scanned or image-heavy documents.

Reviewing Results

After conversion, each file appears in a results table showing:

Whether the conversion succeeded or failed
The page count (for documents that have pages)
How long it took to process

You can click the expand button next to any file to preview the extracted text right in the app. This is useful for checking that the conversion captured everything correctly before you upload.

Downloading Converted Text

You have several options for saving the converted text:

Individual files — Click the TXT or MD button next to any file to download just that one as plain text or Markdown.
All files at once — Use the Download All as Text or Download All as Markdown buttons to save everything in a single file.

These downloaded text files are what you'll upload to Archivist in the next step.

Tip

Markdown format preserves some structure like headings and lists, which can make the text easier to review. Either format works for uploading.

Starting Over

Click Reset and Start Over to clear the results and process a new batch of files. If you haven't downloaded your results yet, you'll get a reminder before they're cleared.

Next Steps

Once you have your converted text files, head to the Upload tab to load them into Archivist's database.