File Management Guide
This guide covers uploading, processing, and managing files in Satori enclaves.
Supported File Types
Satori supports a wide variety of file types:
Documents
- PDF:
application/pdf - Text:
text/plain,text/csv,text/tsv - Word:
.docx,.doc - Excel:
.xlsx,.xls - PowerPoint:
.pptx,.ppt - OpenDocument:
.odt,.ods,.odp - Other: JSON, XML, RTF
Images
- JPEG, PNG, GIF, WebP, SVG, TIFF, BMP
Video (with transcription)
- MP4, MPEG, AVI, MOV, WMV, WebM, MKV, FLV
Audio (with transcription)
- MP3, WAV, OGG, M4A, AAC, MIDI
Archives
- ZIP, RAR, 7Z, TAR, GZIP
Uploading Files
Basic Upload
curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
-F "file=@/path/to/document.pdf"
Upload with Metadata
curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
-F "file=@document.pdf" \
-F 'metadata={"author": "John Doe", "category": "research", "date": "2025-01-15"}'
Upload with Webhook
curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
-F "file=@document.pdf" \
-F "webhook_url=https://your-server.com/webhook/file-processed"
Python Example
import requests
def upload_file(file_path, enclave_id, metadata=None, webhook_url=None):
url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/"
headers = {"Authorization": f"Bearer {JWT_TOKEN}"}
files = {"file": open(file_path, "rb")}
data = {}
if metadata:
data["metadata"] = json.dumps(metadata)
if webhook_url:
data["webhook_url"] = webhook_url
response = requests.post(url, headers=headers, files=files, data=data)
return response.json()
# Usage
file_info = upload_file(
"document.pdf",
enclave_id,
metadata={"author": "John Doe", "category": "research"},
webhook_url="https://myapp.com/webhook"
)
print(f"File uploaded: {file_info['id']}, Status: {file_info['status']}")
JavaScript/TypeScript Example
async function uploadFile(
file: File,
enclaveId: string,
metadata?: Record<string, any>,
webhookUrl?: string
) {
const formData = new FormData();
formData.append("file", file);
if (metadata) {
formData.append("metadata", JSON.stringify(metadata));
}
if (webhookUrl) {
formData.append("webhook_url", webhookUrl);
}
const response = await fetch(
`/api/tenants/${tenantId}/enclaves/${enclaveId}/files/`,
{
method: "POST",
headers: {
Authorization: `Bearer ${token}`,
},
body: formData,
}
);
return await response.json();
}
File Processing Pipeline
Files go through several processing stages:
- pending → File uploaded, queued for processing
- processing → Content extraction in progress
- clearing_artifacts → Cleaning up temporary files
- building_artifacts → Creating vector embeddings
- classifying → AI classification (optional)
- ready → File ready for queries
- failed → Processing failed (check logs)
Processing Times
- Small PDFs (< 10MB): 30-60 seconds
- Large PDFs (> 100MB): 2-5 minutes
- Videos: 1-10 minutes (depends on length)
- Audio: 30 seconds - 3 minutes
- Images: 10-30 seconds
Monitoring File Status
Check Single File Status
curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>"
List All Files
curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>"
Polling for Ready Status
import time
def wait_for_file_ready(file_id, max_wait=300, poll_interval=5):
"""Wait for file to be ready, with timeout."""
start_time = time.time()
while time.time() - start_time < max_wait:
response = requests.get(
f"{BASE_URL}/files/{file_id}",
headers={"Authorization": f"Bearer {JWT_TOKEN}"}
)
file = response.json()
if file["status"] == "ready":
return file
elif file["status"] == "failed":
raise Exception(f"File processing failed: {file_id}")
time.sleep(poll_interval)
raise TimeoutError(f"File not ready within {max_wait} seconds")
Webhooks
Webhooks notify your server when file processing completes.
Webhook Payload
{
"event": "file.status_changed",
"file_id": "850e8400-e29b-41d4-a716-446655440000",
"status": "ready",
"tenant_id": "550e8400-e29b-41d4-a716-446655440000",
"enclave_id": "750e8400-e29b-41d4-a716-446655440000",
"timestamp": "2025-01-15T10:05:30Z",
"metadata": {
"file_name": "contract.pdf",
"size_bytes": 245000
}
}
Webhook Implementation
from fastapi import FastAPI, Request
app = FastAPI()
@app.post("/webhook/file-processed")
async def handle_file_webhook(request: Request):
payload = await request.json()
if payload["status"] == "ready":
file_id = payload["file_id"]
# File is ready - start querying
await process_ready_file(file_id)
elif payload["status"] == "failed":
# Handle failure
await handle_failed_upload(payload["file_id"])
return {"status": "received"}
Webhook Requirements
- HTTPS only: Webhook URLs must use HTTPS
- Retry logic: Satori retries failed webhooks (3 attempts with exponential backoff)
- Response: Your endpoint should return 200 OK
File Metadata
Adding Metadata
Metadata is stored as JSON and can include any key-value pairs:
metadata = {
"author": "John Doe",
"date": "2025-01-15",
"category": "research",
"department": "engineering",
"project": "project-alpha",
"version": "1.0",
"tags": ["important", "reviewed"]
}
Best Practices
- Keep under 10KB: Large metadata can slow processing
- Use searchable fields: Include fields you might want to filter by
- Consistent structure: Use the same fields across similar files
- Include timestamps: Track when files were created/uploaded
Retrieving Metadata
response = requests.get(
f"{BASE_URL}/files/{file_id}",
headers={"Authorization": f"Bearer {JWT_TOKEN}"}
)
file = response.json()
metadata = file.get("file_meta", {})
print(f"Author: {metadata.get('author')}")
File Limits
Size Limits
- Maximum file size: 512MB
- Recommended: Keep files under 100MB for faster processing
- Large files: Consider splitting into multiple files
Handling Large Files
def split_large_pdf(file_path, max_size_mb=100):
"""Split large PDF into smaller chunks."""
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
if file_size_mb > max_size_mb:
# Use PDF splitting library
# Upload each chunk separately
pass
Getting Transcripts
For video and audio files, retrieve transcripts:
curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/transcript" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>"
Response:
{
"file_id": "850e8400-e29b-41d4-a716-446655440000",
"filename": "meeting_recording.mp4",
"content_type": "video/mp4",
"transcript": "Welcome everyone to today's meeting...",
"keywords": ["quarterly results", "revenue increase"],
"created_at": "2025-01-15T10:05:00Z",
"updated_at": "2025-01-15T10:05:00Z"
}
Downloading Files
Download files in their original format with their original filename. Files are streamed directly from storage for efficiency.
Basic Download
curl -X GET "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/download" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
-O -J
Flags:
- -O: Save file to disk
- -J: Use Content-Disposition filename from response
Python Example
import requests
def download_file(file_id, enclave_id, output_path=None):
"""Download a file from the enclave."""
url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/download"
headers = {"Authorization": f"Bearer {JWT_TOKEN}"}
# Stream download
response = requests.get(url, headers=headers, stream=True)
response.raise_for_status()
# Get filename from Content-Disposition header or use default
if output_path is None:
content_disp = response.headers.get('Content-Disposition', '')
if 'filename=' in content_disp:
# Extract filename from header
filename = content_disp.split('filename=')[1].strip('"')
else:
# Fallback: get from file metadata
file_info = requests.get(
f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}",
headers=headers
).json()
filename = file_info['name']
else:
filename = output_path
# Stream to file
with open(filename, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
return filename
# Usage
downloaded_file = download_file(
file_id="850e8400-e29b-41d4-a716-446655440000",
enclave_id="750e8400-e29b-41d4-a716-446655440000"
)
print(f"Downloaded: {downloaded_file}")
JavaScript/TypeScript Example
async function downloadFile(
fileId: string,
enclaveId: string
): Promise<void> {
const url = `/api/tenants/${tenantId}/enclaves/${enclaveId}/files/${fileId}/download`;
const response = await fetch(url, {
headers: {
Authorization: `Bearer ${token}`
}
});
if (!response.ok) {
throw new Error(`Download failed: ${response.statusText}`);
}
// Extract filename from Content-Disposition header
const contentDisp = response.headers.get('Content-Disposition') || '';
const filenameMatch = contentDisp.match(/filename="(.+)"/);
const filename = filenameMatch ? filenameMatch[1] : 'download';
// Get the blob
const blob = await response.blob();
// Create download link
const downloadUrl = window.URL.createObjectURL(blob);
const link = document.createElement('a');
link.href = downloadUrl;
link.download = filename;
// Trigger download
document.body.appendChild(link);
link.click();
// Cleanup
document.body.removeChild(link);
window.URL.revokeObjectURL(downloadUrl);
}
How It Works
- Validation: Verifies file exists and user has access
- Retrieve from Storage: Fetches file from S3
- Stream Response: Streams file content in chunks
- Original Filename: Returned via Content-Disposition header
Streaming Benefits
- Memory Efficient: Files streamed in chunks, not loaded fully into memory
- No CORS Issues: Download handled entirely by API server
- Progress Tracking: Can monitor download progress in frontend
- Large File Support: Handles files of any size efficiently
Security
- JWT token required for download request
- Multi-layer validation (token → tenant → file ownership)
- Streaming prevents full file buffering in memory
- No size restrictions (handles files of any size)
Use Cases
- Export documents for external processing
- Download processed files for archival
- Retrieve original files for comparison
- Backup file retrieval
- Integration with external systems
- Browser-based downloads without CORS issues
Comparing Files
Compare two versions of a file to identify differences and generate a structured changeset.
Basic File Comparison
curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
-F "base_file=@contract_v1.pdf" \
-F "target_file=@contract_v2.pdf" \
-F "webhook_url=https://your-server.com/webhook/comparison-complete"
Using File Hashes
For files already uploaded, use their SHA-256 hashes:
curl -X POST "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
-F "base_file_hash=sha256:abc123def456..." \
-F "target_file_hash=sha256:789ghi012jkl..."
Python Example
def compare_files(base_path, target_path, enclave_id, webhook_url=None):
url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/compare"
headers = {"Authorization": f"Bearer {JWT_TOKEN}"}
files = {
"base_file": open(base_path, "rb"),
"target_file": open(target_path, "rb")
}
data = {}
if webhook_url:
data["webhook_url"] = webhook_url
response = requests.post(url, headers=headers, files=files, data=data)
return response.json()
# Compare two versions of a contract
result = compare_files(
"contract_v1.pdf",
"contract_v2.pdf",
enclave_id,
webhook_url="https://myapp.com/webhook/comparison"
)
Comparison Webhook Payload
Results are delivered via webhook when processing completes:
{
"event": "file.comparison_complete",
"status": "success",
"base_file_hash": "sha256:abc123...",
"target_file_hash": "sha256:def456...",
"changeset": {
"added": [
{"block_id": "1", "content": "New paragraph added..."}
],
"removed": [
{"block_id": "2", "content": "Deleted content..."}
],
"replaced": [
{"block_id": "3", "old": "Original text", "new": "Updated text"}
]
},
"metadata": {...}
}
Use Cases
- Document versioning: Track changes between document versions
- Contract comparison: Identify modifications in legal documents
- Content auditing: Review what changed between file uploads
Deleting Files
⚠️ Warning: Deletion is permanent and cannot be undone.
curl -X DELETE "__API_HOST__/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}" \
-H "Authorization: Bearer <YOUR_JWT_TOKEN>"
What gets deleted:
- File record from database
- Object storage object
- Vector embeddings
- Transcripts
- All processing artifacts
Duplicate Handling
Files are deduplicated by SHA-256 hash:
- Uploading the same file twice returns the existing file
- Use a different
file_idto force a new upload - Duplicate detection happens automatically
Error Handling
Common Errors
413 Payload Too Large
- File exceeds 512MB limit
- Solution: Split or compress the file
415 Unsupported Media Type
- File type not allowed
- Solution: Check supported file types list
400 Bad Request
- Invalid metadata JSON
- Invalid webhook URL (must be HTTPS)
- Missing required fields
404 Not Found
- File doesn't exist
- Check file_id and enclave_id
Best Practices
✅ DO:
- Use webhooks for async processing
- Add meaningful metadata
- Monitor file processing status
- Handle file size limits
- Use appropriate file types
❌ DON'T:
- Upload files without checking status
- Upload duplicate files unnecessarily
- Upload files larger than 512MB
- Ignore failed processing status
- Use insecure webhook URLs (HTTP)
Next Steps
- Querying Documents Guide - Query your uploaded files
- Best Practices Guide - General usage best practices
- API Reference - Full API documentation