Cost Analysis

Comprehensive breakdown of infrastructure costs for the NorthBuilt RAG System.

Note: This system uses S3 Vectors with Bedrock Knowledge Base for vector storage (pay-per-use). See ADR-010 for migration details.

Monthly Cost Breakdown

Base Infrastructure (Always On)

Service Component Monthly Cost Notes
S3 Vectors Knowledge Base storage Variable Pay-per-use (~$0.10/1M vectors stored)
S3 Documents bucket $0.25 ~10GB documents
S3 Terraform state storage $0.10 ~1GB data
DynamoDB Terraform state locks $0.01 On-demand, minimal usage
DynamoDB Classify table $0.25 On-demand, ~1000 writes/month
Secrets Manager 4 secrets $1.60 $0.40/secret/month
CloudFront Distribution $1.00 First 1TB free, minimal overage
S3 Web hosting $0.05 ~500MB static assets
Cognito User pool $0.00 First 50K MAU free
API Gateway HTTP API $1.00 $1/million requests (est. 10K/month)
Lambda Chat function (reserved) $10.00 Reserved concurrency
Lambda Other functions $5.00 On-demand
CloudWatch Logs (7-day retention) $2.00 ~5GB/month
Bedrock Inference + embeddings $45.00 See usage breakdown below
       
TOTAL BASE   ~$66/month Pay-per-use model

Usage-Based Costs

Bedrock Inference Costs

Claude Sonnet 4.5 Pricing

  • Input: $3.00 per million tokens
  • Output: $15.00 per million tokens

Titan Embeddings V2 Pricing

  • Input: $0.0001 per 1000 tokens

Example Monthly Usage (1000 queries)

Queries: 1000/month
Average query: 50 tokens
Average context: 2000 tokens (5 documents × 400 tokens each)
Average response: 300 tokens

Claude Sonnet Costs (response generation):
- Input: (50 + 2000) × 1000 = 2.05M tokens × $3/M = $6.15
- Output: 300 × 1000 = 0.3M tokens × $15/M = $4.50
- Total Claude Sonnet: $10.65/month

Claude Haiku Costs (query understanding):
- Input: 500 tokens × 1000 = 0.5M tokens × $0.25/M = $0.125
- Output: 100 tokens × 1000 = 0.1M tokens × $1.25/M = $0.125
- Total Claude Haiku: $0.25/month

Titan Costs (for retrieval):
- Embeddings: 50 tokens × 1000 queries = 50K tokens × $0.0001/1K = $0.005
- Total Titan: $0.01/month (negligible)

Total Bedrock: $10.91/month for 1000 queries

Query Understanding Cost Breakdown

Query understanding extracts client filters from natural language queries using Claude Haiku for cost efficiency.

Per Query Cost

Input tokens: ~500 (query + entity list + prompt)
Output tokens: ~100 (structured JSON response)
Cost per query: $0.00025 (Claude Haiku)

Monthly Cost by Volume

Monthly Queries Haiku Input Haiku Output Total QU Cost
100 $0.01 $0.01 $0.02
1,000 $0.13 $0.13 $0.25
5,000 $0.63 $0.63 $1.25
10,000 $1.25 $1.25 $2.50
50,000 $6.25 $6.25 $12.50

Why Claude Haiku?

  • 12x cheaper than Claude Sonnet for entity extraction
  • Fast response time (~200ms)
  • Structured JSON output is reliable
  • Entity extraction doesn’t require Sonnet’s reasoning capabilities

Scaling Examples (including Query Understanding)

Monthly Queries Sonnet (Generation) Haiku (QU) Titan Total Bedrock
100 $1.07 $0.02 $0.00 $1.09
1,000 $10.65 $0.25 $0.01 $10.91
5,000 $53.25 $1.25 $0.03 $54.53
10,000 $106.50 $2.50 $0.05 $109.05
50,000 $532.50 $12.50 $0.25 $545.25

Document Ingestion Costs

Per Document

Document size: 5000 tokens (typical)
Chunks: 10 chunks × 500 tokens each

Titan Embedding Costs:
- 10 embeddings × 500 tokens = 5000 tokens
- Cost: 5000 tokens × $0.0001/1K = $0.0005

S3 Vectors Storage:
- Minimal storage cost (~$0.10/1M vectors)

Total per document: $0.0005 (negligible)

Monthly Ingestion Examples

Documents/Month Embeddings Total Cost
100 500K tokens $0.05
1,000 5M tokens $0.50
10,000 50M tokens $5.00

Combined Monthly Estimates

Usage Profile Base Bedrock Ingestion Total
Development (100 queries, 100 docs) $136.41 $1.07 $0.05 $137.53
Light Production (1K queries, 1K docs) $136.41 $10.66 $0.50 $147.57
Medium Production (5K queries, 5K docs) $136.41 $53.28 $2.50 $192.19
Heavy Production (10K queries, 10K docs) $136.41 $106.55 $5.00 $247.96
Enterprise (50K queries, 50K docs) $136.41 $532.75 $25.00 $694.16

Cost Optimization Strategies

Short-Term Optimizations (No Architecture Changes)

1. Optimize Lambda Memory

Current: Chat Lambda = 1024MB Strategy:

  • Profile memory usage via CloudWatch
  • Reduce to 512MB if under-utilized
  • Lambda pricing: $0.0000166667 per GB-second

Calculation:

Current: 1024MB × 3s × 1000 invocations = 3072 GB-seconds = $0.05
Optimized: 512MB × 3s × 1000 invocations = 1536 GB-seconds = $0.03

Savings: $0.02/1000 invocations (~$0.20/month at 10K queries)

2. Reduce Context Window

Current: 5 documents × 400 tokens = 2000 tokens Strategy:

  • Reduce to 3 documents = 1200 tokens
  • Improves response time
  • Reduces token costs

Calculation:

Current: 2050 tokens input × $3/M × 1000 = $6.15
Optimized: 1250 tokens input × $3/M × 1000 = $3.75

Savings: $2.40/1000 queries (~$2.40/month at 1K queries)

3. Implement Response Caching

Strategy:

  • Cache identical queries for 1 hour
  • Estimate 20% cache hit rate
  • Store in DynamoDB or ElastiCache

Calculation:

Queries: 1000/month
Cache hits: 200/month (20%)
Saved Bedrock calls: 200 × $0.01 = $2.00

DynamoDB storage cost: 1MB × $0.25/GB = negligible

Savings: ~$2/month at 1K queries, scales linearly

4. CloudWatch Log Retention

Current: 7-day retention Strategy: Reduce to 3 days for non-critical logs

Calculation:

Current: 5GB/month × $0.50/GB = $2.50
Optimized: 2GB/month × $0.50/GB = $1.00

Savings: $1.50/month

Medium-Term Optimizations (Architecture Changes)

1. Bedrock Model Selection

Current: Claude Sonnet 4.5 ($3 input / $15 output per M tokens) Alternatives:

  • Claude Haiku: $0.25 input / $1.25 output (12× cheaper)
  • Claude Sonnet 3.5: $3 input / $15 output (same price, older model)

Use Case: Switch to Haiku for simple classification tasks

Savings: ~$9/month for 1000 classification queries

2. Hybrid Approach for Embeddings

Strategy: Use smaller embedding model for less critical content Options:

  • Titan Text Lite (512-dim): Cheaper (hypothetical)
  • Cohere embed-english-light-v3.0: $0.00001/1K tokens (10× cheaper)

Calculation:

Current: 5M tokens/month × $0.0001/1K = $0.50
Optimized: 5M tokens/month × $0.00001/1K = $0.05

Savings: $0.45/month per 5M tokens Tradeoffs: Lower quality embeddings may reduce retrieval accuracy

Long-Term Optimizations (Major Changes)

1. Reserved Capacity

When: Consistent high usage (>10K queries/month) Strategy: Reserved Lambda concurrency

Lambda Savings Account:

  • Commit to consistent usage for 20% discount on compute
  • Requires predictable workload patterns

2. Multi-Tenancy

When: Multiple clients Strategy: Share Bedrock Knowledge Base with client-level metadata filtering

The system supports multi-tenancy via client-level metadata filtering on the S3 Vectors storage. Each document is tagged with client metadata; project metadata is stored for display but not used for filtering, allowing all documents from a client’s projects to contribute to RAG context.

Benefits:

  • Single Knowledge Base for all tenants
  • Client-level metadata filtering at query time
  • All projects under a client accessible for richer context
  • No additional infrastructure cost per tenant

3. On-Premises Hybrid

When: Very high volume (>100K queries/month) Strategy: Run embeddings on-premises, Bedrock for generation only

Estimate:

  • Self-hosted embeddings: $50/month (GPU instance)
  • Bedrock generation only: ~$100/month at 100K queries Savings: ~$400/month vs full Bedrock at scale

Cost Comparison with Alternatives

Alternative 1: OpenAI + Third-Party Vector DB

Component OpenAI Stack AWS Bedrock Stack Difference
LLM (1K queries) GPT-4: $30 Claude Sonnet 4.5: $10.66 -$19.34
Embeddings (1K docs) text-embedding-3: $0.02 Titan V2: $0.50 +$0.48
Vector DB Pinecone: $70 S3 Vectors: ~$1 -$69
Hosting Vercel: $20 CloudFront+S3: $1 -$19
Auth Auth0: $35 Cognito: $0 -$35
Total $155 $13 -$142/month (92% cheaper)

Alternative 2: Fully Managed (e.g., Mendable, ChatBase)

Component Managed SaaS Self-Hosted AWS Difference
Platform Fee $99-399/month $0 -$99 to -$399
Infrastructure Included $82/month +$82
Customization Limited Full control N/A
Data Privacy Shared Isolated N/A
Total $99-399 $82 -$17 to -$317/month

Alternative 3: Self-Hosted Open Source

Component Open Source AWS Bedrock Difference
LLM Llama 3 (g5.2xlarge EC2) Bedrock: $10.66 N/A (different model)
EC2 Cost $730/month $0 -$730
Embeddings Self-hosted Bedrock: $0.50 -$0.50
Vector DB Qdrant (t3.medium) S3 Vectors: ~$1 -$37
Total $768 $13 +$755/month (5800% more expensive!)

Note: Open source is expensive due to GPU costs for hosting LLMs. Only cost-effective at massive scale (>1M queries/month).

Cost Monitoring

CloudWatch Billing Alarms

Recommended Alarms:

  1. Total monthly cost > $200
  2. Bedrock cost > $50
  3. Lambda cost > $15 (detect runaway invocations)
  4. S3 storage cost > $5 (detect unexpected growth)

Cost Allocation Tags

Apply These Tags:

tags = {
  Project     = "NorthBuilt-RAG"
  Environment = "production"
  ManagedBy   = "terraform"
  CostCenter  = "engineering"
  Component   = "api|web|storage|compute"
}

Monthly Review Checklist

  • Review CloudWatch billing dashboard
  • Check Bedrock token usage (input/output ratio)
  • Verify S3 Vectors storage usage
  • Audit Lambda memory usage (can optimize?)
  • Review CloudWatch log retention (can reduce?)
  • Check for idle resources (unused secrets, old S3 versions)

Break-Even Analysis

When to Self-Host vs Managed

S3 Vectors vs Self-Hosted Vector DB:

  • S3 Vectors: ~$1/month (pay-per-use)
  • Self-hosted (t3.medium + EBS): $38/month
  • Verdict: S3 Vectors is significantly cheaper with zero operational overhead

Bedrock Break-Even:

  • Bedrock: $10.66/1K queries
  • Self-hosted Llama (g5.2xlarge): $730/month base
  • Break-even: When queries > 68K/month (730 / 10.66 * 1000)

Verdict: Bedrock is better until ~70K queries/month

Cost Forecasting

Growth Projections

Timeframe Est. Queries/Month Est. Docs Projected Cost Notes
Month 1-3 500 500 $141 Initial launch
Month 4-6 2,000 2,000 $161 Growing adoption
Month 7-12 5,000 5,000 $192 Steady state
Year 2 10,000 10,000 $248 Mature product
Year 3 25,000 25,000 $401 Scale phase

Optimization Roadmap

Now (Cost: $66):

  • Using cost-effective serverless architecture with S3 Vectors
  • On-demand pricing for variable workloads
  • Multi-tenancy via metadata filtering

Month 6 (Cost: $55):

  • Implement query caching (20% cache hit rate)
  • Reduce context window (5 -> 3 documents)
  • Optimize Lambda memory

Year 2 (Cost: $100):

  • Reserved capacity for Lambda
  • Bedrock provisioned throughput for consistent high usage
  • Advanced caching strategies

Year 3 (Cost: $200):

  • At 25K queries/month, consider hybrid approach
  • Self-host embeddings, keep Bedrock for generation
  • Multi-region deployment for latency optimization

Last updated: 2026-01-01