Skip to content

File Management Guide

This guide covers uploading, processing, and managing files in Satori enclaves.

Supported File Types

Satori supports a wide variety of file types:

Documents

  • PDF: application/pdf
  • Text: text/plain, text/csv, text/tsv
  • Word: .docx, .doc
  • Excel: .xlsx, .xls
  • PowerPoint: .pptx, .ppt
  • OpenDocument: .odt, .ods, .odp
  • Other: JSON, XML, RTF

Images

  • JPEG, PNG, GIF, WebP, SVG, TIFF, BMP

Video (with transcription)

  • MP4, MPEG, AVI, MOV, WMV, WebM, MKV, FLV

Audio (with transcription)

  • MP3, WAV, OGG, M4A, AAC, MIDI

Archives

  • ZIP, RAR, 7Z, TAR, GZIP

Uploading Files

Basic Upload

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@/path/to/document.pdf"

Upload with Metadata

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@document.pdf" \
  -F 'metadata={"author": "John Doe", "category": "research", "date": "2025-01-15"}'

Upload with Webhook

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@document.pdf" \
  -F "webhook_url=https://your-server.com/webhook/file-processed"

Python Example

import requests

def upload_file(file_path, enclave_id, metadata=None, webhook_url=None):
    url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/"
    headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

    files = {"file": open(file_path, "rb")}
    data = {}

    if metadata:
        data["metadata"] = json.dumps(metadata)
    if webhook_url:
        data["webhook_url"] = webhook_url

    response = requests.post(url, headers=headers, files=files, data=data)
    return response.json()

# Usage
file_info = upload_file(
    "document.pdf",
    enclave_id,
    metadata={"author": "John Doe", "category": "research"},
    webhook_url="https://myapp.com/webhook"
)
print(f"File uploaded: {file_info['id']}, Status: {file_info['status']}")

JavaScript/TypeScript Example

async function uploadFile(
  file: File,
  enclaveId: string,
  metadata?: Record<string, any>,
  webhookUrl?: string
) {
  const formData = new FormData();
  formData.append("file", file);

  if (metadata) {
    formData.append("metadata", JSON.stringify(metadata));
  }
  if (webhookUrl) {
    formData.append("webhook_url", webhookUrl);
  }

  const response = await fetch(
    `/api/tenants/${tenantId}/enclaves/${enclaveId}/files/`,
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${token}`,
      },
      body: formData,
    }
  );

  return await response.json();
}

File Processing Pipeline

Files go through several processing stages:

  1. pending → File uploaded, queued for processing
  2. processing → Content extraction in progress
  3. clearing_artifacts → Cleaning up temporary files
  4. building_artifacts → Creating vector embeddings
  5. classifying → AI classification (optional)
  6. ready → File ready for queries
  7. failed → Processing failed (check logs)

Processing Times

  • Small PDFs (< 10MB): 30-60 seconds
  • Large PDFs (> 100MB): 2-5 minutes
  • Videos: 1-10 minutes (depends on length)
  • Audio: 30 seconds - 3 minutes
  • Images: 10-30 seconds

Monitoring File Status

Check Single File Status

curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

List All Files

curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

Polling for Ready Status

import time

def wait_for_file_ready(file_id, max_wait=300, poll_interval=5):
    """Wait for file to be ready, with timeout."""
    start_time = time.time()

    while time.time() - start_time < max_wait:
        response = requests.get(
            f"{BASE_URL}/files/{file_id}",
            headers={"Authorization": f"Bearer {JWT_TOKEN}"}
        )
        file = response.json()

        if file["status"] == "ready":
            return file
        elif file["status"] == "failed":
            raise Exception(f"File processing failed: {file_id}")

        time.sleep(poll_interval)

    raise TimeoutError(f"File not ready within {max_wait} seconds")

Webhooks

Webhooks notify your server when file processing completes.

Webhook Payload

{
  "event": "file.status_changed",
  "file_id": "850e8400-e29b-41d4-a716-446655440000",
  "status": "ready",
  "tenant_id": "550e8400-e29b-41d4-a716-446655440000",
  "enclave_id": "750e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2025-01-15T10:05:30Z",
  "metadata": {
    "file_name": "contract.pdf",
    "size_bytes": 245000
  }
}

Webhook Implementation

from fastapi import FastAPI, Request

app = FastAPI()

@app.post("/webhook/file-processed")
async def handle_file_webhook(request: Request):
    payload = await request.json()

    if payload["status"] == "ready":
        file_id = payload["file_id"]
        # File is ready - start querying
        await process_ready_file(file_id)
    elif payload["status"] == "failed":
        # Handle failure
        await handle_failed_upload(payload["file_id"])

    return {"status": "received"}

Webhook Requirements

  • HTTPS only: Webhook URLs must use HTTPS
  • Retry logic: Satori retries failed webhooks (3 attempts with exponential backoff)
  • Response: Your endpoint should return 200 OK

File Metadata

Adding Metadata

Metadata is stored as JSON and can include any key-value pairs:

metadata = {
    "author": "John Doe",
    "date": "2025-01-15",
    "category": "research",
    "department": "engineering",
    "project": "project-alpha",
    "version": "1.0",
    "tags": ["important", "reviewed"]
}

Best Practices

  • Keep under 10KB: Large metadata can slow processing
  • Use searchable fields: Include fields you might want to filter by
  • Consistent structure: Use the same fields across similar files
  • Include timestamps: Track when files were created/uploaded

Retrieving Metadata

response = requests.get(
    f"{BASE_URL}/files/{file_id}",
    headers={"Authorization": f"Bearer {JWT_TOKEN}"}
)
file = response.json()
metadata = file.get("file_meta", {})
print(f"Author: {metadata.get('author')}")

File Limits

Size Limits

  • Maximum file size: 512MB
  • Recommended: Keep files under 100MB for faster processing
  • Large files: Consider splitting into multiple files

Handling Large Files

def split_large_pdf(file_path, max_size_mb=100):
    """Split large PDF into smaller chunks."""
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)

    if file_size_mb > max_size_mb:
        # Use PDF splitting library
        # Upload each chunk separately
        pass

Getting Transcripts

For video and audio files, retrieve transcripts:

curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/transcript" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

Response:

{
  "file_id": "850e8400-e29b-41d4-a716-446655440000",
  "filename": "meeting_recording.mp4",
  "content_type": "video/mp4",
  "transcript": "Welcome everyone to today's meeting...",
  "keywords": ["quarterly results", "revenue increase"],
  "created_at": "2025-01-15T10:05:00Z",
  "updated_at": "2025-01-15T10:05:00Z"
}

Downloading Files

Download files in their original format with their original filename. Files are streamed directly from storage for efficiency.

Basic Download

curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/download" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -O -J

Flags: - -O: Save file to disk - -J: Use Content-Disposition filename from response

Python Example

import requests

def download_file(file_id, enclave_id, output_path=None):
    """Download a file from the enclave."""
    url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/download"
    headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

    # Stream download
    response = requests.get(url, headers=headers, stream=True)
    response.raise_for_status()

    # Get filename from Content-Disposition header or use default
    if output_path is None:
        content_disp = response.headers.get('Content-Disposition', '')
        if 'filename=' in content_disp:
            # Extract filename from header
            filename = content_disp.split('filename=')[1].strip('"')
        else:
            # Fallback: get from file metadata
            file_info = requests.get(
                f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}",
                headers=headers
            ).json()
            filename = file_info['name']
    else:
        filename = output_path

    # Stream to file
    with open(filename, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

    return filename

# Usage
downloaded_file = download_file(
    file_id="850e8400-e29b-41d4-a716-446655440000",
    enclave_id="750e8400-e29b-41d4-a716-446655440000"
)
print(f"Downloaded: {downloaded_file}")

JavaScript/TypeScript Example

async function downloadFile(
  fileId: string,
  enclaveId: string
): Promise<void> {
  const url = `/api/tenants/${tenantId}/enclaves/${enclaveId}/files/${fileId}/download`;

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${token}`
    }
  });

  if (!response.ok) {
    throw new Error(`Download failed: ${response.statusText}`);
  }

  // Extract filename from Content-Disposition header
  const contentDisp = response.headers.get('Content-Disposition') || '';
  const filenameMatch = contentDisp.match(/filename="(.+)"/);
  const filename = filenameMatch ? filenameMatch[1] : 'download';

  // Get the blob
  const blob = await response.blob();

  // Create download link
  const downloadUrl = window.URL.createObjectURL(blob);
  const link = document.createElement('a');
  link.href = downloadUrl;
  link.download = filename;

  // Trigger download
  document.body.appendChild(link);
  link.click();

  // Cleanup
  document.body.removeChild(link);
  window.URL.revokeObjectURL(downloadUrl);
}

How It Works

  1. Validation: Verifies file exists and user has access
  2. Retrieve from Storage: Fetches file from S3
  3. Stream Response: Streams file content in chunks
  4. Original Filename: Returned via Content-Disposition header

Streaming Benefits

  • Memory Efficient: Files streamed in chunks, not loaded fully into memory
  • No CORS Issues: Download handled entirely by API server
  • Progress Tracking: Can monitor download progress in frontend
  • Large File Support: Handles files of any size efficiently

Security

  • JWT token required for download request
  • Multi-layer validation (token → tenant → file ownership)
  • Streaming prevents full file buffering in memory
  • No size restrictions (handles files of any size)

Use Cases

  • Export documents for external processing
  • Download processed files for archival
  • Retrieve original files for comparison
  • Backup file retrieval
  • Integration with external systems
  • Browser-based downloads without CORS issues

Comparing Files

Compare two versions of a file to identify differences and generate a structured changeset.

Basic File Comparison

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "base_file=@contract_v1.pdf" \
  -F "target_file=@contract_v2.pdf" \
  -F "webhook_url=https://your-server.com/webhook/comparison-complete"

Using File Hashes

For files already uploaded, use their SHA-256 hashes:

curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "base_file_hash=sha256:abc123def456..." \
  -F "target_file_hash=sha256:789ghi012jkl..."

Python Example

def compare_files(base_path, target_path, enclave_id, webhook_url=None):
    url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare"
    headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

    files = {
        "base_file": open(base_path, "rb"),
        "target_file": open(target_path, "rb")
    }
    data = {}
    if webhook_url:
        data["webhook_url"] = webhook_url

    response = requests.post(url, headers=headers, files=files, data=data)
    return response.json()

# Compare two versions of a contract
result = compare_files(
    "contract_v1.pdf",
    "contract_v2.pdf",
    enclave_id,
    webhook_url="https://myapp.com/webhook/comparison"
)

Comparison Webhook Payload

Results are delivered via webhook when processing completes:

{
  "event": "file.comparison_complete",
  "status": "success",
  "base_file_hash": "sha256:abc123...",
  "target_file_hash": "sha256:def456...",
  "changeset": {
    "added": [
      {"block_id": "1", "content": "New paragraph added..."}
    ],
    "removed": [
      {"block_id": "2", "content": "Deleted content..."}
    ],
    "replaced": [
      {"block_id": "3", "old": "Original text", "new": "Updated text"}
    ]
  },
  "metadata": {...}
}

Use Cases

  • Document versioning: Track changes between document versions
  • Contract comparison: Identify modifications in legal documents
  • Content auditing: Review what changed between file uploads

Deleting Files

⚠️ Warning: Deletion is permanent and cannot be undone.

curl -X DELETE "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

What gets deleted:

  • File record from database
  • Object storage object
  • Vector embeddings
  • Transcripts
  • All processing artifacts

Duplicate Handling

Files are deduplicated by SHA-256 hash:

  • Uploading the same file twice returns the existing file
  • Use a different file_id to force a new upload
  • Duplicate detection happens automatically

Error Handling

Common Errors

413 Payload Too Large

  • File exceeds 512MB limit
  • Solution: Split or compress the file

415 Unsupported Media Type

  • File type not allowed
  • Solution: Check supported file types list

400 Bad Request

  • Invalid metadata JSON
  • Invalid webhook URL (must be HTTPS)
  • Missing required fields

404 Not Found

  • File doesn't exist
  • Check file_id and enclave_id

Best Practices

✅ DO:

  • Use webhooks for async processing
  • Add meaningful metadata
  • Monitor file processing status
  • Handle file size limits
  • Use appropriate file types

❌ DON'T:

  • Upload files without checking status
  • Upload duplicate files unnecessarily
  • Upload files larger than 512MB
  • Ignore failed processing status
  • Use insecure webhook URLs (HTTP)

Next Steps