Cost Analysis
Comprehensive breakdown of infrastructure costs for the NorthBuilt RAG System.
Note: This system uses S3 Vectors with Bedrock Knowledge Base for vector storage (pay-per-use). See ADR-010 for migration details.
Monthly Cost Breakdown
Base Infrastructure (Always On)
| Service | Component | Monthly Cost | Notes |
|---|---|---|---|
| S3 Vectors | Knowledge Base storage | Variable | Pay-per-use (~$0.10/1M vectors stored) |
| S3 | Documents bucket | $0.25 | ~10GB documents |
| S3 | Terraform state storage | $0.10 | ~1GB data |
| DynamoDB | Terraform state locks | $0.01 | On-demand, minimal usage |
| DynamoDB | Classify table | $0.25 | On-demand, ~1000 writes/month |
| Secrets Manager | 4 secrets | $1.60 | $0.40/secret/month |
| CloudFront | Distribution | $1.00 | First 1TB free, minimal overage |
| S3 | Web hosting | $0.05 | ~500MB static assets |
| Cognito | User pool | $0.00 | First 50K MAU free |
| API Gateway | HTTP API | $1.00 | $1/million requests (est. 10K/month) |
| Lambda | Chat function (reserved) | $10.00 | Reserved concurrency |
| Lambda | Other functions | $5.00 | On-demand |
| CloudWatch | Logs (7-day retention) | $2.00 | ~5GB/month |
| Bedrock | Inference + embeddings | $45.00 | See usage breakdown below |
| TOTAL BASE | ~$66/month | Pay-per-use model |
Usage-Based Costs
Bedrock Inference Costs
Claude Sonnet 4.5 Pricing
- Input: $3.00 per million tokens
- Output: $15.00 per million tokens
Titan Embeddings V2 Pricing
- Input: $0.0001 per 1000 tokens
Example Monthly Usage (1000 queries)
Queries: 1000/month
Average query: 50 tokens
Average context: 2000 tokens (5 documents × 400 tokens each)
Average response: 300 tokens
Claude Sonnet Costs (response generation):
- Input: (50 + 2000) × 1000 = 2.05M tokens × $3/M = $6.15
- Output: 300 × 1000 = 0.3M tokens × $15/M = $4.50
- Total Claude Sonnet: $10.65/month
Claude Haiku Costs (query understanding):
- Input: 500 tokens × 1000 = 0.5M tokens × $0.25/M = $0.125
- Output: 100 tokens × 1000 = 0.1M tokens × $1.25/M = $0.125
- Total Claude Haiku: $0.25/month
Titan Costs (for retrieval):
- Embeddings: 50 tokens × 1000 queries = 50K tokens × $0.0001/1K = $0.005
- Total Titan: $0.01/month (negligible)
Total Bedrock: $10.91/month for 1000 queries
Query Understanding Cost Breakdown
Query understanding extracts client filters from natural language queries using Claude Haiku for cost efficiency.
Per Query Cost
Input tokens: ~500 (query + entity list + prompt)
Output tokens: ~100 (structured JSON response)
Cost per query: $0.00025 (Claude Haiku)
Monthly Cost by Volume
| Monthly Queries | Haiku Input | Haiku Output | Total QU Cost |
|---|---|---|---|
| 100 | $0.01 | $0.01 | $0.02 |
| 1,000 | $0.13 | $0.13 | $0.25 |
| 5,000 | $0.63 | $0.63 | $1.25 |
| 10,000 | $1.25 | $1.25 | $2.50 |
| 50,000 | $6.25 | $6.25 | $12.50 |
Why Claude Haiku?
- 12x cheaper than Claude Sonnet for entity extraction
- Fast response time (~200ms)
- Structured JSON output is reliable
- Entity extraction doesn’t require Sonnet’s reasoning capabilities
Scaling Examples (including Query Understanding)
| Monthly Queries | Sonnet (Generation) | Haiku (QU) | Titan | Total Bedrock |
|---|---|---|---|---|
| 100 | $1.07 | $0.02 | $0.00 | $1.09 |
| 1,000 | $10.65 | $0.25 | $0.01 | $10.91 |
| 5,000 | $53.25 | $1.25 | $0.03 | $54.53 |
| 10,000 | $106.50 | $2.50 | $0.05 | $109.05 |
| 50,000 | $532.50 | $12.50 | $0.25 | $545.25 |
Document Ingestion Costs
Per Document
Document size: 5000 tokens (typical)
Chunks: 10 chunks × 500 tokens each
Titan Embedding Costs:
- 10 embeddings × 500 tokens = 5000 tokens
- Cost: 5000 tokens × $0.0001/1K = $0.0005
S3 Vectors Storage:
- Minimal storage cost (~$0.10/1M vectors)
Total per document: $0.0005 (negligible)
Monthly Ingestion Examples
| Documents/Month | Embeddings | Total Cost |
|---|---|---|
| 100 | 500K tokens | $0.05 |
| 1,000 | 5M tokens | $0.50 |
| 10,000 | 50M tokens | $5.00 |
Combined Monthly Estimates
| Usage Profile | Base | Bedrock | Ingestion | Total |
|---|---|---|---|---|
| Development (100 queries, 100 docs) | $136.41 | $1.07 | $0.05 | $137.53 |
| Light Production (1K queries, 1K docs) | $136.41 | $10.66 | $0.50 | $147.57 |
| Medium Production (5K queries, 5K docs) | $136.41 | $53.28 | $2.50 | $192.19 |
| Heavy Production (10K queries, 10K docs) | $136.41 | $106.55 | $5.00 | $247.96 |
| Enterprise (50K queries, 50K docs) | $136.41 | $532.75 | $25.00 | $694.16 |
Cost Optimization Strategies
Short-Term Optimizations (No Architecture Changes)
1. Optimize Lambda Memory
Current: Chat Lambda = 1024MB Strategy:
- Profile memory usage via CloudWatch
- Reduce to 512MB if under-utilized
- Lambda pricing: $0.0000166667 per GB-second
Calculation:
Current: 1024MB × 3s × 1000 invocations = 3072 GB-seconds = $0.05
Optimized: 512MB × 3s × 1000 invocations = 1536 GB-seconds = $0.03
Savings: $0.02/1000 invocations (~$0.20/month at 10K queries)
2. Reduce Context Window
Current: 5 documents × 400 tokens = 2000 tokens Strategy:
- Reduce to 3 documents = 1200 tokens
- Improves response time
- Reduces token costs
Calculation:
Current: 2050 tokens input × $3/M × 1000 = $6.15
Optimized: 1250 tokens input × $3/M × 1000 = $3.75
Savings: $2.40/1000 queries (~$2.40/month at 1K queries)
3. Implement Response Caching
Strategy:
- Cache identical queries for 1 hour
- Estimate 20% cache hit rate
- Store in DynamoDB or ElastiCache
Calculation:
Queries: 1000/month
Cache hits: 200/month (20%)
Saved Bedrock calls: 200 × $0.01 = $2.00
DynamoDB storage cost: 1MB × $0.25/GB = negligible
Savings: ~$2/month at 1K queries, scales linearly
4. CloudWatch Log Retention
Current: 7-day retention Strategy: Reduce to 3 days for non-critical logs
Calculation:
Current: 5GB/month × $0.50/GB = $2.50
Optimized: 2GB/month × $0.50/GB = $1.00
Savings: $1.50/month
Medium-Term Optimizations (Architecture Changes)
1. Bedrock Model Selection
Current: Claude Sonnet 4.5 ($3 input / $15 output per M tokens) Alternatives:
- Claude Haiku: $0.25 input / $1.25 output (12× cheaper)
- Claude Sonnet 3.5: $3 input / $15 output (same price, older model)
Use Case: Switch to Haiku for simple classification tasks
Savings: ~$9/month for 1000 classification queries
2. Hybrid Approach for Embeddings
Strategy: Use smaller embedding model for less critical content Options:
- Titan Text Lite (512-dim): Cheaper (hypothetical)
- Cohere embed-english-light-v3.0: $0.00001/1K tokens (10× cheaper)
Calculation:
Current: 5M tokens/month × $0.0001/1K = $0.50
Optimized: 5M tokens/month × $0.00001/1K = $0.05
Savings: $0.45/month per 5M tokens Tradeoffs: Lower quality embeddings may reduce retrieval accuracy
Long-Term Optimizations (Major Changes)
1. Reserved Capacity
When: Consistent high usage (>10K queries/month) Strategy: Reserved Lambda concurrency
Lambda Savings Account:
- Commit to consistent usage for 20% discount on compute
- Requires predictable workload patterns
2. Multi-Tenancy
When: Multiple clients Strategy: Share Bedrock Knowledge Base with client-level metadata filtering
The system supports multi-tenancy via client-level metadata filtering on the S3 Vectors storage. Each document is tagged with client metadata; project metadata is stored for display but not used for filtering, allowing all documents from a client’s projects to contribute to RAG context.
Benefits:
- Single Knowledge Base for all tenants
- Client-level metadata filtering at query time
- All projects under a client accessible for richer context
- No additional infrastructure cost per tenant
3. On-Premises Hybrid
When: Very high volume (>100K queries/month) Strategy: Run embeddings on-premises, Bedrock for generation only
Estimate:
- Self-hosted embeddings: $50/month (GPU instance)
- Bedrock generation only: ~$100/month at 100K queries Savings: ~$400/month vs full Bedrock at scale
Cost Comparison with Alternatives
Alternative 1: OpenAI + Third-Party Vector DB
| Component | OpenAI Stack | AWS Bedrock Stack | Difference |
|---|---|---|---|
| LLM (1K queries) | GPT-4: $30 | Claude Sonnet 4.5: $10.66 | -$19.34 |
| Embeddings (1K docs) | text-embedding-3: $0.02 | Titan V2: $0.50 | +$0.48 |
| Vector DB | Pinecone: $70 | S3 Vectors: ~$1 | -$69 |
| Hosting | Vercel: $20 | CloudFront+S3: $1 | -$19 |
| Auth | Auth0: $35 | Cognito: $0 | -$35 |
| Total | $155 | $13 | -$142/month (92% cheaper) |
Alternative 2: Fully Managed (e.g., Mendable, ChatBase)
| Component | Managed SaaS | Self-Hosted AWS | Difference |
|---|---|---|---|
| Platform Fee | $99-399/month | $0 | -$99 to -$399 |
| Infrastructure | Included | $82/month | +$82 |
| Customization | Limited | Full control | N/A |
| Data Privacy | Shared | Isolated | N/A |
| Total | $99-399 | $82 | -$17 to -$317/month |
Alternative 3: Self-Hosted Open Source
| Component | Open Source | AWS Bedrock | Difference |
|---|---|---|---|
| LLM | Llama 3 (g5.2xlarge EC2) | Bedrock: $10.66 | N/A (different model) |
| EC2 Cost | $730/month | $0 | -$730 |
| Embeddings | Self-hosted | Bedrock: $0.50 | -$0.50 |
| Vector DB | Qdrant (t3.medium) | S3 Vectors: ~$1 | -$37 |
| Total | $768 | $13 | +$755/month (5800% more expensive!) |
Note: Open source is expensive due to GPU costs for hosting LLMs. Only cost-effective at massive scale (>1M queries/month).
Cost Monitoring
CloudWatch Billing Alarms
Recommended Alarms:
- Total monthly cost > $200
- Bedrock cost > $50
- Lambda cost > $15 (detect runaway invocations)
- S3 storage cost > $5 (detect unexpected growth)
Cost Allocation Tags
Apply These Tags:
tags = {
Project = "NorthBuilt-RAG"
Environment = "production"
ManagedBy = "terraform"
CostCenter = "engineering"
Component = "api|web|storage|compute"
}
Monthly Review Checklist
- Review CloudWatch billing dashboard
- Check Bedrock token usage (input/output ratio)
- Verify S3 Vectors storage usage
- Audit Lambda memory usage (can optimize?)
- Review CloudWatch log retention (can reduce?)
- Check for idle resources (unused secrets, old S3 versions)
Break-Even Analysis
When to Self-Host vs Managed
S3 Vectors vs Self-Hosted Vector DB:
- S3 Vectors: ~$1/month (pay-per-use)
- Self-hosted (t3.medium + EBS): $38/month
- Verdict: S3 Vectors is significantly cheaper with zero operational overhead
Bedrock Break-Even:
- Bedrock: $10.66/1K queries
- Self-hosted Llama (g5.2xlarge): $730/month base
- Break-even: When queries > 68K/month (730 / 10.66 * 1000)
Verdict: Bedrock is better until ~70K queries/month
Cost Forecasting
Growth Projections
| Timeframe | Est. Queries/Month | Est. Docs | Projected Cost | Notes |
|---|---|---|---|---|
| Month 1-3 | 500 | 500 | $141 | Initial launch |
| Month 4-6 | 2,000 | 2,000 | $161 | Growing adoption |
| Month 7-12 | 5,000 | 5,000 | $192 | Steady state |
| Year 2 | 10,000 | 10,000 | $248 | Mature product |
| Year 3 | 25,000 | 25,000 | $401 | Scale phase |
Optimization Roadmap
Now (Cost: $66):
- Using cost-effective serverless architecture with S3 Vectors
- On-demand pricing for variable workloads
- Multi-tenancy via metadata filtering
Month 6 (Cost: $55):
- Implement query caching (20% cache hit rate)
- Reduce context window (5 -> 3 documents)
- Optimize Lambda memory
Year 2 (Cost: $100):
- Reserved capacity for Lambda
- Bedrock provisioned throughput for consistent high usage
- Advanced caching strategies
Year 3 (Cost: $200):
- At 25K queries/month, consider hybrid approach
- Self-host embeddings, keep Bedrock for generation
- Multi-region deployment for latency optimization
Last updated: 2026-01-01