Cost Analysis

Comprehensive breakdown of infrastructure costs for the NorthBuilt RAG System.

Note: This system uses S3 Vectors with Bedrock Knowledge Base for vector storage (pay-per-use). See ADR-010 for migration details.

Monthly Cost Breakdown

Base Infrastructure (Always On)

Service	Component	Monthly Cost	Notes
S3 Vectors	Knowledge Base storage	Variable	Pay-per-use (~$0.10/1M vectors stored)
S3	Documents bucket	$0.25	~10GB documents
S3	Terraform state storage	$0.10	~1GB data
DynamoDB	Terraform state locks	$0.01	On-demand, minimal usage
DynamoDB	Main table (nb-rag-sys)	$0.25	On-demand, ~1000 writes/month
Secrets Manager	4 secrets	$1.60	$0.40/secret/month
CloudFront	Distribution	$1.00	First 1TB free, minimal overage
S3	Web hosting	$0.05	~500MB static assets
Cognito	User pool	$0.00	First 50K MAU free
API Gateway	HTTP API	$1.00	$1/million requests (est. 10K/month)
Lambda	Chat function (reserved)	$10.00	Reserved concurrency
Lambda	Other functions	$5.00	On-demand
CloudWatch	Logs (7-day retention)	$2.00	~5GB/month
Bedrock	Inference + embeddings	$45.00	See usage breakdown below

TOTAL BASE		~$66/month	Pay-per-use model

Usage-Based Costs

Bedrock Inference Costs

Claude Sonnet 4.5 Pricing

Input: $3.00 per million tokens
Output: $15.00 per million tokens

Titan Embeddings V2 Pricing

Input: $0.0001 per 1000 tokens

Example Monthly Usage (1000 queries)

Queries: 1000/month
Average query: 50 tokens
Average context: 2000 tokens (5 documents × 400 tokens each)
Average response: 300 tokens

Claude Sonnet Costs (response generation):
- Input: (50 + 2000) × 1000 = 2.05M tokens × $3/M = $6.15
- Output: 300 × 1000 = 0.3M tokens × $15/M = $4.50
- Total Claude Sonnet: $10.65/month

Claude Haiku Costs (query understanding):
- Input: 500 tokens × 1000 = 0.5M tokens × $0.25/M = $0.125
- Output: 100 tokens × 1000 = 0.1M tokens × $1.25/M = $0.125
- Total Claude Haiku: $0.25/month

Titan Costs (for retrieval):
- Embeddings: 50 tokens × 1000 queries = 50K tokens × $0.0001/1K = $0.005
- Total Titan: $0.01/month (negligible)

Total Bedrock: $10.91/month for 1000 queries

Query Understanding Cost Breakdown

Query understanding extracts client filters from natural language queries using Claude Haiku for cost efficiency.

Per Query Cost

Input tokens: ~500 (query + entity list + prompt)
Output tokens: ~100 (structured JSON response)
Cost per query: $0.00025 (Claude Haiku)

Monthly Cost by Volume

Monthly Queries	Haiku Input	Haiku Output	Total QU Cost
100	$0.01	$0.01	$0.02
1,000	$0.13	$0.13	$0.25
5,000	$0.63	$0.63	$1.25
10,000	$1.25	$1.25	$2.50
50,000	$6.25	$6.25	$12.50

Why Claude Haiku?

12x cheaper than Claude Sonnet for entity extraction
Fast response time (~200ms)
Structured JSON output is reliable
Entity extraction doesn’t require Sonnet’s reasoning capabilities

Scaling Examples (including Query Understanding)

Monthly Queries	Sonnet (Generation)	Haiku (QU)	Titan	Total Bedrock
100	$1.07	$0.02	$0.00	$1.09
1,000	$10.65	$0.25	$0.01	$10.91
5,000	$53.25	$1.25	$0.03	$54.53
10,000	$106.50	$2.50	$0.05	$109.05
50,000	$532.50	$12.50	$0.25	$545.25

Document Ingestion Costs

Per Document

Document size: 5000 tokens (typical)
Chunks: 10 chunks × 500 tokens each

Titan Embedding Costs:
- 10 embeddings × 500 tokens = 5000 tokens
- Cost: 5000 tokens × $0.0001/1K = $0.0005

S3 Vectors Storage:
- Minimal storage cost (~$0.10/1M vectors)

Total per document: $0.0005 (negligible)

Monthly Ingestion Examples

Documents/Month	Embeddings	Total Cost
100	500K tokens	$0.05
1,000	5M tokens	$0.50
10,000	50M tokens	$5.00

Combined Monthly Estimates

Usage Profile	Base	Bedrock	Ingestion	Total
Development (100 queries, 100 docs)	$136.41	$1.07	$0.05	$137.53
Light Production (1K queries, 1K docs)	$136.41	$10.66	$0.50	$147.57
Medium Production (5K queries, 5K docs)	$136.41	$53.28	$2.50	$192.19
Heavy Production (10K queries, 10K docs)	$136.41	$106.55	$5.00	$247.96
Enterprise (50K queries, 50K docs)	$136.41	$532.75	$25.00	$694.16

Cost Optimization Strategies

Short-Term Optimizations (No Architecture Changes)

1. Optimize Lambda Memory

Current: Chat Lambda = 1024MB Strategy:

Profile memory usage via CloudWatch
Reduce to 512MB if under-utilized
Lambda pricing: $0.0000166667 per GB-second

Calculation:

Current: 1024MB × 3s × 1000 invocations = 3072 GB-seconds = $0.05
Optimized: 512MB × 3s × 1000 invocations = 1536 GB-seconds = $0.03

Savings: $0.02/1000 invocations (~$0.20/month at 10K queries)

2. Reduce Context Window

Current: 5 documents × 400 tokens = 2000 tokens Strategy:

Reduce to 3 documents = 1200 tokens
Improves response time
Reduces token costs

Calculation:

Current: 2050 tokens input × $3/M × 1000 = $6.15
Optimized: 1250 tokens input × $3/M × 1000 = $3.75

Savings: $2.40/1000 queries (~$2.40/month at 1K queries)

3. Implement Response Caching

Strategy:

Cache identical queries for 1 hour
Estimate 20% cache hit rate
Store in DynamoDB or ElastiCache

Calculation:

Queries: 1000/month
Cache hits: 200/month (20%)
Saved Bedrock calls: 200 × $0.01 = $2.00

DynamoDB storage cost: 1MB × $0.25/GB = negligible

Savings: ~$2/month at 1K queries, scales linearly

4. CloudWatch Log Retention

Current: 7-day retention Strategy: Reduce to 3 days for non-critical logs

Calculation:

Current: 5GB/month × $0.50/GB = $2.50
Optimized: 2GB/month × $0.50/GB = $1.00

Savings: $1.50/month

Medium-Term Optimizations (Architecture Changes)

1. Bedrock Model Selection

Current: Claude Sonnet 4.5 ($3 input / $15 output per M tokens) Alternatives:

Claude Haiku: $0.25 input / $1.25 output (12× cheaper)
Claude Sonnet 3.5: $3 input / $15 output (same price, older model)

Use Case: Switch to Haiku for simple classification tasks

Savings: ~$9/month for 1000 classification queries

2. Hybrid Approach for Embeddings

Strategy: Use smaller embedding model for less critical content Options:

Titan Text Lite (512-dim): Cheaper (hypothetical)
Cohere embed-english-light-v3.0: $0.00001/1K tokens (10× cheaper)

Calculation:

Current: 5M tokens/month × $0.0001/1K = $0.50
Optimized: 5M tokens/month × $0.00001/1K = $0.05

Savings: $0.45/month per 5M tokens Tradeoffs: Lower quality embeddings may reduce retrieval accuracy

Long-Term Optimizations (Major Changes)

1. Reserved Capacity

When: Consistent high usage (>10K queries/month) Strategy: Reserved Lambda concurrency

Lambda Savings Account:

Commit to consistent usage for 20% discount on compute
Requires predictable workload patterns

2. Multi-Tenancy

When: Multiple clients Strategy: Share Bedrock Knowledge Base with client-level metadata filtering

The system supports multi-tenancy via client-level metadata filtering on the S3 Vectors storage. Each document is tagged with client metadata; project metadata is stored for display but not used for filtering, allowing all documents from a client’s projects to contribute to RAG context.

Benefits:

Single Knowledge Base for all tenants
Client-level metadata filtering at query time
All projects under a client accessible for richer context
No additional infrastructure cost per tenant

3. On-Premises Hybrid

When: Very high volume (>100K queries/month) Strategy: Run embeddings on-premises, Bedrock for generation only

Estimate:

Self-hosted embeddings: $50/month (GPU instance)
Bedrock generation only: ~$100/month at 100K queries Savings: ~$400/month vs full Bedrock at scale

Cost Comparison with Alternatives

Alternative 1: OpenAI + Third-Party Vector DB

Component	OpenAI Stack	AWS Bedrock Stack	Difference
LLM (1K queries)	GPT-4: $30	Claude Sonnet 4.5: $10.66	-$19.34
Embeddings (1K docs)	text-embedding-3: $0.02	Titan V2: $0.50	+$0.48
Vector DB	Pinecone: $70	S3 Vectors: ~$1	-$69
Hosting	Vercel: $20	CloudFront+S3: $1	-$19
Auth	Auth0: $35	Cognito: $0	-$35
Total	$155	$13	-$142/month (92% cheaper)

Alternative 2: Fully Managed (e.g., Mendable, ChatBase)

Component	Managed SaaS	Self-Hosted AWS	Difference
Platform Fee	$99-399/month	$0	-$99 to -$399
Infrastructure	Included	$82/month	+$82
Customization	Limited	Full control	N/A
Data Privacy	Shared	Isolated	N/A
Total	$99-399	$82	-$17 to -$317/month

Alternative 3: Self-Hosted Open Source

Component	Open Source	AWS Bedrock	Difference
LLM	Llama 3 (g5.2xlarge EC2)	Bedrock: $10.66	N/A (different model)
EC2 Cost	$730/month	$0	-$730
Embeddings	Self-hosted	Bedrock: $0.50	-$0.50
Vector DB	Qdrant (t3.medium)	S3 Vectors: ~$1	-$37
Total	$768	$13	+$755/month (5800% more expensive!)

Note: Open source is expensive due to GPU costs for hosting LLMs. Only cost-effective at massive scale (>1M queries/month).

Cost Monitoring

CloudWatch Billing Alarms

Recommended Alarms:

Total monthly cost > $200
Bedrock cost > $50
Lambda cost > $15 (detect runaway invocations)
S3 storage cost > $5 (detect unexpected growth)

Cost Allocation Tags

Apply These Tags:

tags = {
  Project     = "NorthBuilt-RAG"
  Environment = "production"
  ManagedBy   = "terraform"
  CostCenter  = "engineering"
  Component   = "api|web|storage|compute"
}

Monthly Review Checklist

Review CloudWatch billing dashboard
Check Bedrock token usage (input/output ratio)
Verify S3 Vectors storage usage
Audit Lambda memory usage (can optimize?)
Review CloudWatch log retention (can reduce?)
Check for idle resources (unused secrets, old S3 versions)

Break-Even Analysis

When to Self-Host vs Managed

S3 Vectors vs Self-Hosted Vector DB:

S3 Vectors: ~$1/month (pay-per-use)
Self-hosted (t3.medium + EBS): $38/month
Verdict: S3 Vectors is significantly cheaper with zero operational overhead

Bedrock Break-Even:

Bedrock: $10.66/1K queries
Self-hosted Llama (g5.2xlarge): $730/month base
Break-even: When queries > 68K/month (730 / 10.66 * 1000)

Verdict: Bedrock is better until ~70K queries/month

Cost Forecasting

Growth Projections

Timeframe	Est. Queries/Month	Est. Docs	Projected Cost	Notes
Month 1-3	500	500	$141	Initial launch
Month 4-6	2,000	2,000	$161	Growing adoption
Month 7-12	5,000	5,000	$192	Steady state
Year 2	10,000	10,000	$248	Mature product
Year 3	25,000	25,000	$401	Scale phase

Optimization Roadmap

Now (Cost: $66):

Using cost-effective serverless architecture with S3 Vectors
On-demand pricing for variable workloads
Multi-tenancy via metadata filtering

Month 6 (Cost: $55):

Implement query caching (20% cache hit rate)
Reduce context window (5 -> 3 documents)
Optimize Lambda memory

Year 2 (Cost: $100):

Reserved capacity for Lambda
Bedrock provisioned throughput for consistent high usage
Advanced caching strategies

Year 3 (Cost: $200):

At 25K queries/month, consider hybrid approach
Self-host embeddings, keep Bedrock for generation
Multi-region deployment for latency optimization

Last updated: 2026-01-01