Architecture Decision Records (ADRs)

Documentation of significant architectural decisions made during the development of the NorthBuilt RAG System.

ADR Format

Each ADR follows this structure:

Status: Accepted Superseded Deprecated
Date: Decision date
Context: Problem and constraints
Decision: What was decided
Consequences: Impact of the decision
Alternatives Considered: Other options evaluated

ADR-001: Serverless Architecture on AWS Lambda

Status: Accepted Date: 2025-10-01

Context

Need to build a RAG system with:

Variable workload (unpredictable query patterns)
Minimal operational overhead
Cost-effective at low scale
Fast time to market

Constraints:

Small team (no dedicated DevOps)
Budget-conscious
Need production-ready quickly

Decision

Use 100% serverless architecture on AWS:

Lambda for compute
API Gateway for HTTP endpoints
S3 for storage
DynamoDB for data
Managed AI services (Bedrock)

Consequences

Positive:

No server management or patching
Auto-scaling built-in
Pay only for usage
High availability by default
Fast deployment cycle

Negative:

Cold start latency (~1s)
Lambda timeout limits (15 min max)
Vendor lock-in to AWS
Debugging more complex than traditional servers

Cost Impact: ~$140/month at 1K queries/month

Alternatives Considered

Container-based (ECS/EKS)
- Pros: More control, no cold starts, can run any code
- Cons: Higher baseline cost ($50/month min), requires container expertise, more operational overhead
- Rejected: Overkill for current scale
EC2 Instances
- Pros: Full control, no timeouts, familiar
- Cons: Fixed cost ($30-100/month), manual scaling, patching required
- Rejected: Too much operational burden
Managed Platform (Heroku, Render)
- Pros: Simple deployment, less AWS-specific
- Cons: Higher cost ($25-50/month), less control, still need to manage containers
- Rejected: More expensive, less flexibility

ADR-002: Pinecone for Vector Storage

Status: Superseded by ADR-010 Date: 2025-11-01

Context

Initial implementation used OpenSearch Serverless for vector storage, but:

High cost: ~$700/month minimum
Slower queries: 200ms+ latency
Complex networking (VPC, security groups)
Frequent Terraform drift issues

Need vector database that is:

Cost-effective at small scale
Fast (<50ms query latency)
Simple to operate
Reliable

Decision

Migrate to Pinecone managed vector database.

Consequences

Positive:

Cost: $70/month (90% savings vs OpenSearch)
Performance: <25ms query latency (8× faster)
Simplicity: No VPC networking, no drift
Reliability: 99.9% SLA, managed service

Negative:

External dependency (not AWS)
Additional API key to manage
Data egress from AWS (minimal cost)
Migration effort required

Cost Impact: $70/month (fixed) + $0 per query

Alternatives Considered

OpenSearch Serverless (original)
- Pros: Fully AWS-native, powerful query language
- Cons: $700/month minimum, slow, complex
- Rejected: Too expensive
Self-hosted Qdrant/Weaviate on EC2
- Pros: Full control, cheaper than OpenSearch (~$40/month)
- Cons: Operational burden, manual scaling, backups, patching
- Rejected: Too much maintenance
DynamoDB + FAISS
- Pros: Fully AWS-native, very cheap
- Cons: Complex to implement, slower than specialized vector DB
- Rejected: Development time not worth savings
Elasticsearch (self-managed)
- Pros: Mature, powerful, familiar
- Cons: Expensive to run (need 3 nodes), complex operations
- Rejected: Operational overhead

ADR-003: Claude Sonnet 4.5 for Response Generation

Status: Accepted Date: 2025-10-15

Context

Need LLM for generating responses from retrieved context. Requirements:

High quality responses
Fast inference (<3s)
Accurate citation handling
Cost-effective

Decision

Use Claude Sonnet 4.5 via AWS Bedrock.

Pricing: $3/M input tokens, $15/M output tokens

Consequences

Positive:

Excellent response quality
Strong instruction following (citations, formatting)
Fast inference (~2s average)
AWS Bedrock integration (no separate API)
No model hosting required

Negative:

More expensive than smaller models
Token limits (200K context window, but we use ~2K)
Vendor lock-in (Anthropic model)

Cost Impact: ~$11/month for 1000 queries

Alternatives Considered

Claude Haiku (cheaper)
- Pros: 12× cheaper ($0.25 input, $1.25 output)
- Cons: Lower quality responses, less nuanced
- Future: May use for simple queries
GPT-4 via OpenAI
- Pros: Comparable quality, more familiar to some
- Cons: More expensive ($30 input, $60 output), separate API to manage
- Rejected: More expensive, one more service
Self-hosted Llama 3
- Pros: Free inference (after setup)
- Cons: GPU required ($730/month for g5.2xlarge), complex deployment, lower quality
- Rejected: Not cost-effective until >70K queries/month
Claude Opus (higher quality)
- Pros: Highest quality responses
- Cons: 5× more expensive
- Rejected: Quality difference not worth cost for most queries

ADR-004: Cognito + Google OAuth for Authentication

Status: Accepted Date: 2025-10-10

Context

Need user authentication for web UI. Requirements:

Secure (industry standard)
Low maintenance
Familiar UX (social login)
Cost-effective

Decision

Use AWS Cognito with Google OAuth federation.

Consequences

Positive:

Free for first 50K monthly active users
Fully managed (no password storage, MFA, etc.)
Standard OAuth 2.0 flow
JWT tokens for API authorization
No additional auth service needed

Negative:

AWS lock-in
Limited customization of UI
Redirect-based flow (not SPA-native)
Google API setup required

Cost Impact: $0/month (under 50K MAU)

Alternatives Considered

Auth0
- Pros: Better UX, more identity providers, more features
- Cons: $35/month minimum, external service
- Rejected: Unnecessary cost
Firebase Authentication
- Pros: Good Google integration, free tier
- Cons: Ties to Google Cloud, harder to integrate with AWS
- Rejected: Prefer AWS-native
Custom JWT implementation
- Pros: Full control, no cost
- Cons: Security risk, maintenance burden, password management
- Rejected: Not worth security risk
No authentication
- Pros: Simplest
- Cons: No user tracking, no access control
- Rejected: Need to track usage per user

ADR-005: HTTP API (not REST API) for API Gateway

Status: Accepted Date: 2025-10-12

Context

API Gateway offers two options:

REST API: Full-featured, more expensive
HTTP API: Simpler, cheaper, faster

Decision

Use HTTP API for lower cost and better performance.

Consequences

Positive:

Cost: 70% cheaper ($1/M vs $3.50/M requests)
Latency: ~10ms lower latency
Simpler: Fewer features to configure

Negative:

No resource policies or usage plans
No API key authentication (use JWT instead)
Limited request validation
No caching (need external cache)

Cost Impact: $1/month vs $3.50/month at 1M requests

Alternatives Considered

REST API
- Pros: More features (caching, usage plans, resource policies)
- Cons: More expensive, slower
- Rejected: Don’t need extra features
Application Load Balancer
- Pros: Lower cost at high scale (>10M requests/month)
- Cons: Fixed cost (~$20/month), need to run targets (Lambda or containers)
- Rejected: Current scale doesn’t justify

ADR-006: Migrate from OpenSearch to Pinecone

Status: Superseded by ADR-010 Date: 2025-11-01

Context

OpenSearch Serverless had multiple issues:

High cost: $700/month minimum (OCU pricing)
Performance: 200ms+ query latency (p95)
Complexity: VPC, security groups, collection policies
Terraform drift: Constant drift with policies
Overkill: Full-text search features unused

Decision

Migrate to Pinecone as the primary vector store.

Migration approach:

Create Pinecone index
Re-ingest all documents
Update Query Lambda to use Pinecone
Destroy OpenSearch collection
Remove VPC resources

Consequences

Positive:

Cost savings: $630/month (90% reduction)
Performance: <25ms latency (8× faster)
Simpler architecture: Removed VPC, 50+ Terraform resources
No drift: Pinecone doesn’t use complex IAM policies
Better docs: Pinecone docs > AWS OpenSearch docs

Negative:

External dependency: Data stored outside AWS
Migration downtime: 2 hours to re-ingest
Lost features: No full-text search (only vector similarity)

Cost Impact: $70/month fixed (was $700/month)

Migration Steps

Created Pinecone index (1024-dim, cosine similarity)
Wrote migration script to fetch from OpenSearch, upsert to Pinecone
Updated Lambda environment variables (API key, index name)
Tested query performance (validated <25ms latency)
Destroyed OpenSearch resources via Terraform
Removed VPC, subnets, security groups, collection policies

Rollback plan: Keep OpenSearch for 7 days before destroying, can rollback if issues.

ADR-007: Terraform for Infrastructure as Code

Status: Accepted Date: 2025-10-01

Context

Need to manage AWS infrastructure. Requirements:

Repeatable deployments
Version control for infrastructure
Multiple environments (dev, prod)
Team collaboration

Decision

Use Terraform for infrastructure as code.

Consequences

Positive:

Declarative: Describe desired state, Terraform handles changes
State management: S3 + DynamoDB for locking
Modules: Reusable components
Plan before apply: Review changes before executing
Industry standard: Well-documented, large community

Negative:

State management: Need to protect state file
Learning curve: HCL syntax, Terraform concepts
Drift: Manual changes cause drift (must avoid)

Cost Impact: $0.10/month (S3 state storage)

Alternatives Considered

AWS CloudFormation
- Pros: AWS-native, no state file, free
- Cons: YAML/JSON verbose, slower, AWS-only
- Rejected: Prefer Terraform’s cleaner syntax
AWS CDK
- Pros: Real programming language (Python, TypeScript)
- Cons: Less mature, generates CloudFormation (slower), more complex
- Rejected: Overkill for our needs
Pulumi
- Pros: Real programming language, good UX
- Cons: Less mature, smaller community, managed state
- Rejected: Prefer Terraform’s larger ecosystem
Manual (ClickOps)
- Pros: Fastest initially, familiar
- Cons: Not repeatable, no version control, error-prone
- Rejected: Not sustainable

ADR-008: GitHub Actions for CI/CD

Status: Accepted Date: 2025-10-05

Context

Need CI/CD pipeline for deploying infrastructure and code. Requirements:

Automated deployments on push
Secure (no long-lived credentials)
Easy to configure
Free or cheap

Decision

Use GitHub Actions with OIDC authentication to AWS.

Consequences

Positive:

Free: Unlimited minutes for public repos, 2000 min/month for private
Integrated: Lives with code in .github/workflows
Secure: OIDC eliminates long-lived AWS keys
Flexible: Can run any bash command, install any tool

Negative:

GitHub lock-in: Workflow syntax specific to GitHub Actions
Limited debugging: Can’t SSH into runners
Cold start: Runners start fresh each time (must install tools)

Cost Impact: $0/month (free tier)

Alternatives Considered

AWS CodePipeline
- Pros: AWS-native, integrates with CodeBuild
- Cons: $1/pipeline/month, more complex setup
- Rejected: GitHub Actions simpler and free
GitLab CI
- Pros: Similar to GitHub Actions, good UX
- Cons: Need to migrate repo, learning curve
- Rejected: Already using GitHub
Jenkins
- Pros: Full control, highly customizable
- Cons: Need to host ($30/month EC2), complex setup, maintenance
- Rejected: Too much overhead
CircleCI / Travis CI
- Pros: Good UX, mature
- Cons: Cost ($30+/month for private), not as integrated
- Rejected: GitHub Actions more convenient

ADR-009: Python 3.14 for Lambda Runtime

Status: Accepted Date: 2025-10-15

Context

Need to choose Lambda runtime. Python is preferred for:

Team expertise
AWS SDK (boto3) built-in
AI/ML libraries (LangChain, etc.)

Decision

Use Python 3.14 runtime (latest available).

Consequences

Positive:

Performance: Faster than Python 3.11/3.12
Features: Latest Python features
Support: Will be supported for ~5 years
Libraries: All major libraries compatible

Negative:

Newer runtime: Less battle-tested (3.12 more stable)
Dependencies: Some libraries may lag

Cost Impact: None

Alternatives Considered

Python 3.12
- Pros: More stable, better tested
- Cons: Slightly slower, older features
- May switch if stability issues arise
Node.js
- Pros: Faster cold starts, async-native
- Cons: Different language, less AI/ML ecosystem
- Rejected: Team less familiar
Go
- Pros: Fastest cold starts, compiled
- Cons: Verbose, harder to write, less AWS library support
- Rejected: Development speed more important
Java
- Pros: Enterprise-grade, good AWS support
- Cons: Slow cold starts (5s+), verbose
- Rejected: Cold starts unacceptable

ADR-010: Migrate from Pinecone to S3 Vectors

Status: Accepted Date: 2025-12-15

Context

Pinecone worked well but had limitations for our use case:

External dependency: Data stored outside AWS ecosystem
API key management: Additional secret to manage and rotate
Cost structure: Fixed monthly cost regardless of usage
Integration complexity: Separate service from Bedrock Knowledge Base

AWS announced S3 Vectors, a purpose-built vector storage service that integrates natively with Bedrock Knowledge Bases.

Decision

Migrate from Pinecone to AWS S3 Vectors for vector storage.

Key benefits of S3 Vectors:

Native Bedrock Knowledge Base integration
Fully managed within AWS ecosystem
Pay-per-use pricing model
No external API keys required
Automatic scaling and high availability

Consequences

Positive:

Simplified architecture: Single AWS ecosystem, no external dependencies
Native integration: Direct integration with Bedrock Knowledge Base
Security: No external API keys, uses IAM for access control
Cost model: Pay only for storage and queries used
Compliance: Data stays within AWS, simplifies compliance

Negative:

Migration effort: Required re-ingestion of all documents
Feature differences: S3 Vectors has different metadata limits (1KB with Bedrock KB)
Newer service: Less battle-tested than Pinecone

Cost Impact: Variable based on usage (vs $70/month fixed with Pinecone)

Migration Steps

Created S3 Vectors bucket and index (1024-dim, cosine similarity)
Updated Terraform to use aws_bedrockagent_knowledge_base with S3 Vectors storage
Configured IAM policies for Bedrock to access S3 Vectors
Re-ingested all documents via Bedrock Knowledge Base sync
Updated Lambda handlers to use Bedrock Knowledge Base Retrieve API
Removed Pinecone provider and related Terraform resources
Deleted Pinecone API key from Secrets Manager

Configuration Details

# S3 Vectors storage configuration
storage_configuration {
  type = "S3_VECTORS"
  s3_vectors_configuration {
    index_arn = var.s3_vectors_index_arn
  }
}

# Fixed-size chunking (512 tokens, 20% overlap)
chunking_configuration {
  chunking_strategy = "FIXED_SIZE"
  fixed_size_chunking_configuration {
    max_tokens         = 512
    overlap_percentage = 20
  }
}

ADR-011: Lambda Response Streaming for Chat

Status: Accepted Date: 2026-01-08

Context

The original chat implementation used a synchronous Python Lambda that:

Waited for the entire LLM response before returning
Created 3-5 second wait times for users
Poor UX for longer responses
No progressive feedback during generation

Users expected real-time streaming responses like ChatGPT and other modern LLM interfaces.

Constraints:

API Gateway HTTP APIs don’t support streaming
Python Lambda has limited streaming support
Need to maintain conversation history and clarification prompts
Must integrate with existing Cognito authentication

Decision

Implement a new Node.js Lambda with Lambda Response Streaming:

Runtime: Node.js 22 with awslambda.streamifyResponse()
API Gateway: REST API (not HTTP API) with response streaming URIs
Protocol: Server-Sent Events (SSE) for real-time token delivery
Architecture: Separate streaming endpoint alongside existing batch endpoint

Key implementation choices:

Node.js chosen for native streaming support (Python streaming is limited)
REST API required for Lambda response streaming (HTTP API doesn’t support it)
SSE protocol for broad browser compatibility
Pre-flight clarification check before streaming begins

Consequences

Positive:

UX improvement: Time to first token < 2 seconds (vs 3-5s total wait)
Progressive feedback: Users see response as it generates
Modern experience: Matches user expectations from ChatGPT
Feature parity: Full conversation history, query understanding, clarification prompts

Negative:

Two runtimes: Now maintaining Python and Node.js Lambdas
More infrastructure: REST API alongside HTTP API
Testing complexity: Need SSE parsing in integration tests
Cost: REST API slightly more expensive than HTTP API

Cost Impact: Minimal increase (~$0.50/month from REST API pricing difference)

Alternatives Considered

WebSocket API
- Pros: Bi-directional, persistent connection
- Cons: More complex client code, connection management, overkill for request-response
- Rejected: SSE simpler for our use case
Python streaming with chunked transfer
- Pros: Keep single runtime
- Cons: Limited Python streaming support, complex implementation
- Rejected: Node.js streaming is more mature
Polling-based approach
- Pros: Works with HTTP API, simple server-side
- Cons: Poor UX (visible delays), more requests, higher latency
- Rejected: User experience unacceptable
CloudFront + Lambda Function URL
- Pros: Avoids API Gateway streaming limitations
- Cons: Harder to integrate Cognito auth, more complex setup
- Rejected: REST API simpler for our needs

Implementation Details

// Lambda handler with response streaming
export const handler = awslambda.streamifyResponse(
  async (event, responseStream, context) => {
    // SSE content type
    responseStream = HttpResponseStream.from(responseStream, {
      statusCode: 200,
      headers: { 'Content-Type': 'text/event-stream' }
    });

    // Stream tokens as SSE events
    for await (const chunk of bedrockStream) {
      responseStream.write(`event: token\ndata: ${JSON.stringify({text: chunk})}\n\n`);
    }

    responseStream.end();
  }
);

SSE Event Types:

sources: Retrieved documents for citation
token: Partial response text
done: Stream completion with message ID
error: Error information

Summary

ADR	Decision	Status	Impact
001	Serverless on Lambda	Accepted	~$140/month, no ops
002	Pinecone for vectors	Superseded	Replaced by S3 Vectors
003	Claude Sonnet 4.5	Accepted	$11/month per 1K queries
004	Cognito + Google OAuth	Accepted	Free, managed auth
005	HTTP API (not REST)	Accepted	70% cost savings
006	Migrate OpenSearch → Pinecone	Superseded	Replaced by S3 Vectors
007	Terraform for IaC	Accepted	Repeatable deployments
008	GitHub Actions for CI/CD	Accepted	Free, secure OIDC
009	Python 3.14 runtime	Accepted	Latest features, good perf
010	Migrate Pinecone → S3 Vectors	Accepted	Native AWS, pay-per-use
011	Lambda Response Streaming	Accepted	Real-time UX, < 2s TTFT

Last updated: 2026-01-08