File Management Guide

This guide covers uploading, processing, and managing files in Satori enclaves.

Supported File Types

Satori supports a wide variety of file types:

Documents

PDF: application/pdf
Text: text/plain, text/csv, text/tsv
Word: .docx, .doc
Excel: .xlsx, .xls
PowerPoint: .pptx, .ppt
OpenDocument: .odt, .ods, .odp
Other: JSON, XML, RTF

Images

JPEG, PNG, GIF, WebP, SVG, TIFF, BMP

Video (with transcription)

MP4, MPEG, AVI, MOV, WMV, WebM, MKV, FLV

Audio (with transcription)

MP3, WAV, OGG, M4A, AAC, MIDI

Uploading Files

Basic Upload

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@/path/to/document.pdf"

Upload with Metadata

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@document.pdf" \
  -F 'metadata={"author": "John Doe", "category": "research", "date": "2025-01-15"}'

Upload with Webhook

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@document.pdf" \
  -F "webhook_url=https://your-server.com/webhook/file-processed"

Python Example

import requests

def upload_file(file_path, enclave_id, metadata=None, webhook_url=None):
    url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/"
    headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

    files = {"file": open(file_path, "rb")}
    data = {}

    if metadata:
        data["metadata"] = json.dumps(metadata)
    if webhook_url:
        data["webhook_url"] = webhook_url

    response = requests.post(url, headers=headers, files=files, data=data)
    return response.json()

# Usage
file_info = upload_file(
    "document.pdf",
    enclave_id,
    metadata={"author": "John Doe", "category": "research"},
    webhook_url="https://myapp.com/webhook"
)
print(f"File uploaded: {file_info['id']}, Status: {file_info['status']}")

JavaScript/TypeScript Example

async function uploadFile(
  file: File,
  enclaveId: string,
  metadata?: Record<string, any>,
  webhookUrl?: string
) {
  const formData = new FormData();
  formData.append("file", file);

  if (metadata) {
    formData.append("metadata", JSON.stringify(metadata));
  }
  if (webhookUrl) {
    formData.append("webhook_url", webhookUrl);
  }

  const response = await fetch(
    `/api/tenants/${tenantId}/enclaves/${enclaveId}/files/`,
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${token}`,
      },
      body: formData,
    }
  );

  return await response.json();
}

File Processing Pipeline

Files go through several processing stages:

pending → File uploaded, queued for processing
processing → Content extraction in progress
clearing_artifacts → Cleaning up temporary files
building_artifacts → Creating vector embeddings
classifying → AI classification (optional)
ready → File ready for queries
failed → Processing failed (check logs)

Processing Times

Small PDFs (< 10MB): 30-60 seconds
Large PDFs (> 100MB): 2-5 minutes
Videos: 1-10 minutes (depends on length)
Audio: 30 seconds - 3 minutes
Images: 10-30 seconds

Monitoring File Status

Check Single File Status

curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

List All Files

curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

Polling for Ready Status

import time

def wait_for_file_ready(file_id, max_wait=300, poll_interval=5):
    """Wait for file to be ready, with timeout."""
    start_time = time.time()

    while time.time() - start_time < max_wait:
        response = requests.get(
            f"{BASE_URL}/files/{file_id}",
            headers={"Authorization": f"Bearer {JWT_TOKEN}"}
        )
        file = response.json()

        if file["status"] == "ready":
            return file
        elif file["status"] == "failed":
            raise Exception(f"File processing failed: {file_id}")

        time.sleep(poll_interval)

    raise TimeoutError(f"File not ready within {max_wait} seconds")

Webhooks

Webhooks notify your server when file processing completes.

Webhook Payload

{
  "event": "file.status_changed",
  "file_id": "850e8400-e29b-41d4-a716-446655440000",
  "status": "ready",
  "tenant_id": "550e8400-e29b-41d4-a716-446655440000",
  "enclave_id": "750e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2025-01-15T10:05:30Z",
  "metadata": {
    "file_name": "contract.pdf",
    "size_bytes": 245000
  }
}

Webhook Implementation

from fastapi import FastAPI, Request

app = FastAPI()

@app.post("/webhook/file-processed")
async def handle_file_webhook(request: Request):
    payload = await request.json()

    if payload["status"] == "ready":
        file_id = payload["file_id"]
        # File is ready - start querying
        await process_ready_file(file_id)
    elif payload["status"] == "failed":
        # Handle failure
        await handle_failed_upload(payload["file_id"])

    return {"status": "received"}

Webhook Requirements

HTTPS only: Webhook URLs must use HTTPS
Retry logic: Satori retries failed webhooks (3 attempts with exponential backoff)
Response: Your endpoint should return 200 OK

File Metadata

Adding Metadata

Metadata is stored as JSON and can include any key-value pairs:

metadata = {
    "author": "John Doe",
    "date": "2025-01-15",
    "category": "research",
    "department": "engineering",
    "project": "project-alpha",
    "version": "1.0",
    "tags": ["important", "reviewed"]
}

Best Practices

Keep under 10KB: Large metadata can slow processing
Use searchable fields: Include fields you might want to filter by
Consistent structure: Use the same fields across similar files
Include timestamps: Track when files were created/uploaded

Retrieving Metadata

response = requests.get(
    f"{BASE_URL}/files/{file_id}",
    headers={"Authorization": f"Bearer {JWT_TOKEN}"}
)
file = response.json()
metadata = file.get("file_meta", {})
print(f"Author: {metadata.get('author')}")

File Limits

Size Limits

Maximum file size: 512MB
Recommended: Keep files under 100MB for faster processing
Large files: Consider splitting into multiple files

Handling Large Files

def split_large_pdf(file_path, max_size_mb=100):
    """Split large PDF into smaller chunks."""
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)

    if file_size_mb > max_size_mb:
        # Use PDF splitting library
        # Upload each chunk separately
        pass

Getting Transcripts

For video and audio files, retrieve transcripts:

curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/transcript" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

Response:

{
  "file_id": "850e8400-e29b-41d4-a716-446655440000",
  "filename": "meeting_recording.mp4",
  "content_type": "video/mp4",
  "transcript": "Welcome everyone to today's meeting...",
  "keywords": ["quarterly results", "revenue increase"],
  "created_at": "2025-01-15T10:05:00Z",
  "updated_at": "2025-01-15T10:05:00Z"
}

Downloading Files

Download files in their original format with their original filename. Files are streamed directly from storage for efficiency.

Basic Download

curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/download" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -O -J

Flags: - -O: Save file to disk - -J: Use Content-Disposition filename from response

Python Example

import requests

def download_file(file_id, enclave_id, output_path=None):
    """Download a file from the enclave."""
    url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/download"
    headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

    # Stream download
    response = requests.get(url, headers=headers, stream=True)
    response.raise_for_status()

    # Get filename from Content-Disposition header or use default
    if output_path is None:
        content_disp = response.headers.get('Content-Disposition', '')
        if 'filename=' in content_disp:
            # Extract filename from header
            filename = content_disp.split('filename=')[1].strip('"')
        else:
            # Fallback: get from file metadata
            file_info = requests.get(
                f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}",
                headers=headers
            ).json()
            filename = file_info['name']
    else:
        filename = output_path

    # Stream to file
    with open(filename, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

    return filename

# Usage
downloaded_file = download_file(
    file_id="850e8400-e29b-41d4-a716-446655440000",
    enclave_id="750e8400-e29b-41d4-a716-446655440000"
)
print(f"Downloaded: {downloaded_file}")

JavaScript/TypeScript Example

async function downloadFile(
  fileId: string,
  enclaveId: string
): Promise<void> {
  const url = `/api/tenants/${tenantId}/enclaves/${enclaveId}/files/${fileId}/download`;

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${token}`
    }
  });

  if (!response.ok) {
    throw new Error(`Download failed: ${response.statusText}`);
  }

  // Extract filename from Content-Disposition header
  const contentDisp = response.headers.get('Content-Disposition') || '';
  const filenameMatch = contentDisp.match(/filename="(.+)"/);
  const filename = filenameMatch ? filenameMatch[1] : 'download';

  // Get the blob
  const blob = await response.blob();

  // Create download link
  const downloadUrl = window.URL.createObjectURL(blob);
  const link = document.createElement('a');
  link.href = downloadUrl;
  link.download = filename;

  // Trigger download
  document.body.appendChild(link);
  link.click();

  // Cleanup
  document.body.removeChild(link);
  window.URL.revokeObjectURL(downloadUrl);
}

How It Works

Validation: Verifies file exists and user has access
Retrieve from Storage: Fetches file from S3
Stream Response: Streams file content in chunks
Original Filename: Returned via Content-Disposition header

Streaming Benefits

Memory Efficient: Files streamed in chunks, not loaded fully into memory
No CORS Issues: Download handled entirely by API server
Progress Tracking: Can monitor download progress in frontend
Large File Support: Handles files of any size efficiently

Security

JWT token required for download request
Multi-layer validation (token → tenant → file ownership)
Streaming prevents full file buffering in memory
No size restrictions (handles files of any size)

Use Cases

Export documents for external processing
Download processed files for archival
Retrieve original files for comparison
Backup file retrieval
Integration with external systems
Browser-based downloads without CORS issues

Comparing Files

Compare two versions of a file to identify differences and generate a structured changeset.

Basic File Comparison

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "base_file=@contract_v1.pdf" \
  -F "target_file=@contract_v2.pdf" \
  -F "webhook_url=https://your-server.com/webhook/comparison-complete"

Using File Hashes

For files already uploaded, use their SHA-256 hashes:

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "base_file_hash=sha256:abc123def456..." \
  -F "target_file_hash=sha256:789ghi012jkl..."

Python Example

def compare_files(base_path, target_path, enclave_id, webhook_url=None):
    url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare"
    headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

    files = {
        "base_file": open(base_path, "rb"),
        "target_file": open(target_path, "rb")
    }
    data = {}
    if webhook_url:
        data["webhook_url"] = webhook_url

    response = requests.post(url, headers=headers, files=files, data=data)
    return response.json()

# Compare two versions of a contract
result = compare_files(
    "contract_v1.pdf",
    "contract_v2.pdf",
    enclave_id,
    webhook_url="https://myapp.com/webhook/comparison"
)

Comparison Webhook Payload

Results are delivered via webhook when processing completes:

{
  "event": "file.comparison_complete",
  "status": "success",
  "base_file_hash": "sha256:abc123...",
  "target_file_hash": "sha256:def456...",
  "changeset": {
    "added": [
      {"block_id": "1", "content": "New paragraph added..."}
    ],
    "removed": [
      {"block_id": "2", "content": "Deleted content..."}
    ],
    "replaced": [
      {"block_id": "3", "old": "Original text", "new": "Updated text"}
    ]
  },
  "metadata": {...}
}

Use Cases

Document versioning: Track changes between document versions
Contract comparison: Identify modifications in legal documents
Content auditing: Review what changed between file uploads

Deleting Files

⚠️ Warning: Deletion is permanent and cannot be undone.

curl -X DELETE "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

What gets deleted:

File record from database
Object storage object
Vector embeddings
Transcripts
All processing artifacts

Duplicate Handling

Files are deduplicated by SHA-256 hash:

Uploading the same file twice returns the existing file
Use a different file_id to force a new upload
Duplicate detection happens automatically

Error Handling

Common Errors

413 Payload Too Large

File exceeds 512MB limit
Solution: Split or compress the file

415 Unsupported Media Type

File type not allowed
Solution: Check supported file types list

400 Bad Request

Invalid metadata JSON
Invalid webhook URL (must be HTTPS)
Missing required fields

404 Not Found

File doesn't exist
Check file_id and enclave_id

Best Practices

✅ DO:

Use webhooks for async processing
Add meaningful metadata
Monitor file processing status
Handle file size limits
Use appropriate file types

❌ DON'T:

Upload files without checking status
Upload duplicate files unnecessarily
Upload files larger than 512MB
Ignore failed processing status
Use insecure webhook URLs (HTTP)

Next Steps

Querying Documents Guide - Query your uploaded files
Best Practices Guide - General usage best practices
API Reference - Full API documentation

File Management Guide

Supported File Types

Documents

Images

Video (with transcription)

Audio (with transcription)

Archives

Uploading Files

Basic Upload

Upload with Metadata

Upload with Webhook

Python Example

JavaScript/TypeScript Example

File Processing Pipeline

Processing Times

Monitoring File Status

Check Single File Status

List All Files

Polling for Ready Status

Webhooks

Webhook Payload

Webhook Implementation

Webhook Requirements

File Metadata

Adding Metadata

Best Practices

Retrieving Metadata

File Limits

Size Limits

Handling Large Files

Getting Transcripts

Downloading Files

Basic Download

Python Example

JavaScript/TypeScript Example

How It Works

Streaming Benefits

Security

Use Cases

Comparing Files

Basic File Comparison

Using File Hashes

Python Example

Comparison Webhook Payload

Use Cases

Deleting Files

Duplicate Handling

Error Handling

Common Errors

Best Practices

✅ DO:

❌ DON'T:

Next Steps