Operations Runbook
Day-to-day operational procedures for the NorthBuilt RAG System.
Daily Operations
Morning Health Check
#!/bin/bash
# Save as: scripts/morning-health-check.sh
echo "=== NorthBuilt RAG System Health Check ==="
echo "Date: $(date)"
echo
# Check API Gateway
echo "1. API Gateway Status:"
API_ID=$(aws apigatewayv2 get-apis --query 'Items[?Name==`nb-rag-sys-api`].ApiId' --output text)
if [ -n "$API_ID" ]; then
echo "[OK] API Gateway: $API_ID"
else
echo "[ERROR] API Gateway: NOT FOUND"
fi
# Check Lambda functions
echo
echo "2. Lambda Functions:"
for func in chat query classification webhook-fathom webhook-helpscout; do
status=$(aws lambda get-function --function-name "nb-rag-sys-$func" 2>&1)
if [ $? -eq 0 ]; then
echo "[OK] nb-rag-sys-$func"
else
echo "[ERROR] nb-rag-sys-$func: ERROR"
fi
done
# Check recent errors (last 24 hours)
echo
echo "3. Recent Errors (last 24 hours):"
ERROR_COUNT=$(aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "ERROR" \
--start-time $(date -u -d '24 hours ago' +%s)000 \
--query 'length(events)' --output text)
echo "Chat Lambda errors: $ERROR_COUNT"
# Check Bedrock usage
echo
echo "4. Bedrock Usage (last 24 hours):"
BEDROCK_INVOCATIONS=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Bedrock \
--metric-name Invocations \
--dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 86400 \
--statistics Sum \
--query 'Datapoints[0].Sum' --output text)
echo "Model invocations: ${BEDROCK_INVOCATIONS:-0}"
# Check estimated cost
echo
echo "5. Estimated Monthly Cost:"
COST=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Billing \
--metric-name EstimatedCharges \
--dimensions Name=Currency,Value=USD \
--start-time $(date -u -d '6 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 21600 \
--statistics Maximum \
--query 'Datapoints[0].Maximum' --output text)
echo "Current month: \$${COST:-N/A}"
echo
echo "=== Health Check Complete ==="
Run daily:
chmod +x scripts/morning-health-check.sh
./scripts/morning-health-check.sh
Alternative: In-App Dashboard
For a visual health check, use the System Dashboard in the web application:
- Navigate to the application and sign in
- Click “Dashboard” in the sidebar
- Review metrics cards for query volume, latency, and error rates
- Check ingestion metrics for document sync status
- View recent logs for any errors
The dashboard provides real-time metrics with 24h/7d/30d time ranges.
Common Tasks
Add New User
Users are authenticated via Google OAuth, so no manual user creation needed. However, to grant admin access:
# Get user sub (UUID) from Cognito
aws cognito-idp list-users \
--user-pool-id [user-pool-id] \
--query 'Users[?Username==`[email]`].Username' --output text
# Add to admin group (if implemented)
aws cognito-idp admin-add-user-to-group \
--user-pool-id [user-pool-id] \
--username [user-sub] \
--group-name admins
Rotate API Keys
Webhook API Keys
# 1. Generate new API key in external service
# 2. Update secret
aws secretsmanager update-secret \
--secret-id nb-rag-sys-fathom-api-key \
--secret-string '{"api_key": "NEW_KEY"}'
# 3. Update webhook configuration in external service
# (Fathom or HelpScout)
# 4. Restart webhook Lambda
aws lambda update-function-configuration \
--function-name nb-rag-sys-webhook-fathom \
--environment Variables="{FORCE_UPDATE=$(date +%s)}"
Update Lambda Function Code
# 1. Make code changes
cd lambda/chat
vim handler.py
# 2. Package function
zip -r function.zip . -x "*.pyc" "__pycache__/*" "venv/*" ".venv/*" "tests/*"
# 3. Update function
aws lambda update-function-code \
--function-name nb-rag-sys-chat \
--zip-file fileb://function.zip
# 4. Wait for update to complete
aws lambda wait function-updated \
--function-name nb-rag-sys-chat
# 5. Test function
aws lambda invoke \
--function-name nb-rag-sys-chat \
--payload '{"body": "{\"query\":\"test\"}"}' \
/tmp/response.json
cat /tmp/response.json
Recommended: Use Terraform for code updates:
cd terraform
terraform apply -target=module.lambda.aws_lambda_function.chat
Invalidate CloudFront Cache
After updating web assets:
# Get CloudFront distribution ID
DIST_ID=$(aws cloudfront list-distributions \
--query 'DistributionList.Items[?Comment==`nb-rag-sys-web`].Id' --output text)
# Create invalidation
aws cloudfront create-invalidation \
--distribution-id $DIST_ID \
--paths "/*"
# Check invalidation status
aws cloudfront get-invalidation \
--distribution-id $DIST_ID \
--id [invalidation-id]
Scale Lambda Concurrency
Increase reserved concurrency during high traffic:
# Increase reserved concurrency
aws lambda put-function-concurrency \
--function-name nb-rag-sys-chat \
--reserved-concurrent-executions 20
# Or remove limit (unreserved)
aws lambda delete-function-concurrency \
--function-name nb-rag-sys-chat
Enable/Disable Webhooks
Temporarily disable webhook processing:
# Disable by removing Lambda trigger
aws apigatewayv2 delete-integration \
--api-id [api-id] \
--integration-id [integration-id]
# Or set reserved concurrency to 0 (no invocations)
aws lambda put-function-concurrency \
--function-name nb-rag-sys-webhook-fathom \
--reserved-concurrent-executions 0
# Re-enable
aws lambda delete-function-concurrency \
--function-name nb-rag-sys-webhook-fathom
Backup & Recovery
Backup S3 Documents
# Backup documents to a separate bucket
aws s3 sync s3://nb-rag-sys-documents/ s3://nb-rag-sys-backups/documents/
# Or download locally
aws s3 sync s3://nb-rag-sys-documents/ ./backups/documents/
# List current documents
aws s3 ls s3://nb-rag-sys-documents/ --recursive --summarize
Run weekly:
aws s3 sync s3://nb-rag-sys-documents/ s3://nb-rag-sys-backups/documents-$(date +%Y%m%d)/
Restore Documents and Rebuild Vectors
# Restore documents from backup
aws s3 sync s3://nb-rag-sys-backups/documents-YYYYMMDD/ s3://nb-rag-sys-documents/
# Trigger Knowledge Base sync to rebuild vectors
aws bedrock-agent start-ingestion-job \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id]
# Monitor sync progress
aws bedrock-agent get-ingestion-job \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id] \
--ingestion-job-id [job-id]
Backup DynamoDB Table
# Enable Point-in-Time Recovery (already enabled)
aws dynamodb update-continuous-backups \
--table-name nb-rag-sys \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
# Create on-demand backup
aws dynamodb create-backup \
--table-name nb-rag-sys \
--backup-name "classification-$(date +%Y%m%d)"
# List backups
aws dynamodb list-backups --table-name nb-rag-sys
Restore DynamoDB Table
# Restore from backup
aws dynamodb restore-table-from-backup \
--target-table-name nb-rag-sys-restored \
--backup-arn arn:aws:dynamodb:us-east-1:ACCOUNT:table/nb-rag-sys/backup/BACKUP_ID
# Or restore from point-in-time
aws dynamodb restore-table-to-point-in-time \
--source-table-name nb-rag-sys \
--target-table-name nb-rag-sys-restored \
--restore-date-time "2025-11-08T12:00:00Z"
Backup Terraform State
# Download current state
aws s3 cp s3://nb-rag-sys-terraform-state/terraform.tfstate ./terraform-state-backup-$(date +%Y%m%d).tfstate
# S3 versioning already enabled, can restore previous version:
aws s3api list-object-versions \
--bucket nb-rag-sys-terraform-state \
--prefix terraform.tfstate
# Restore specific version
aws s3api get-object \
--bucket nb-rag-sys-terraform-state \
--key terraform.tfstate \
--version-id [version-id] \
terraform.tfstate
Monitoring
View Real-Time Logs
# Tail Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-chat --follow
# Tail API Gateway logs
aws logs tail /aws/apigateway/nb-rag-sys --follow
# Filter for errors only
aws logs tail /aws/lambda/nb-rag-sys-chat --follow --filter-pattern "ERROR"
Check System Metrics
# Lambda invocations (last hour)
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum
# API Gateway requests (last hour)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name Count \
--dimensions Name=ApiName,Value=nb-rag-sys-api \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum
View CloudWatch Dashboard
# Open dashboard in browser
open "https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=nb-rag-sys-main"
S3 Vectors Debugging
Commands for debugging vector search and Knowledge Base issues. Useful when:
- Documents are not appearing in search results
- Metadata filtering isn’t working as expected
- Ingestion jobs are failing
- Retrieval returns unexpected results
Direct Retrieval Testing
Test Knowledge Base retrieval without going through the Lambda:
# Get Knowledge Base ID
KB_ID=$(aws bedrock-agent list-knowledge-bases \
--query 'knowledgeBaseSummaries[?name==`nb-rag-sys-knowledge-base`].knowledgeBaseId' \
--output text)
# Basic retrieval test
aws bedrock-agent-runtime retrieve \
--knowledge-base-id $KB_ID \
--retrieval-query '{"text": "your test query here"}' \
--retrieval-configuration '{
"vectorSearchConfiguration": {
"numberOfResults": 5
}
}'
# Test with client filter
aws bedrock-agent-runtime retrieve \
--knowledge-base-id $KB_ID \
--retrieval-query '{"text": "your test query here"}' \
--retrieval-configuration '{
"vectorSearchConfiguration": {
"numberOfResults": 5,
"filter": {
"equals": {
"key": "client",
"value": "YourClient"
}
}
}
}'
Check Ingestion Job Status
# Get data source ID
DS_ID=$(aws bedrock-agent list-data-sources \
--knowledge-base-id $KB_ID \
--query 'dataSourceSummaries[0].dataSourceId' \
--output text)
# List recent ingestion jobs
aws bedrock-agent list-ingestion-jobs \
--knowledge-base-id $KB_ID \
--data-source-id $DS_ID \
--max-results 5
# Get details for a specific job
aws bedrock-agent get-ingestion-job \
--knowledge-base-id $KB_ID \
--data-source-id $DS_ID \
--ingestion-job-id [job-id]
# Start a new ingestion job
aws bedrock-agent start-ingestion-job \
--knowledge-base-id $KB_ID \
--data-source-id $DS_ID
Inspect S3 Metadata
Documents require a companion .metadata.json file for Bedrock KB filtering:
# List documents in S3
aws s3 ls s3://nb-rag-sys-docs-[suffix]/fathom/ --recursive
# Check if metadata file exists
DOC="fathom/client/project/document.md"
aws s3 ls "s3://nb-rag-sys-docs-[suffix]/${DOC}.metadata.json"
# View metadata contents
aws s3 cp "s3://nb-rag-sys-docs-[suffix]/${DOC}.metadata.json" - | jq .
# Expected format:
# {
# "metadataAttributes": {
# "client": {"value": {"type": "STRING", "stringValue": "ClientName"}},
# "source": {"value": {"type": "STRING", "stringValue": "fathom"}},
# ...
# }
# }
Check S3 Vectors Configuration
# Get the index associated with the Knowledge Base
aws bedrock-agent get-knowledge-base \
--knowledge-base-id $KB_ID \
--query 'knowledgeBase.storageConfiguration'
# List S3 Vectors indexes (if using S3 Vectors)
aws s3vectors list-indexes
Debug Missing Documents
When a document doesn’t appear in search results:
# 1. Verify document exists in S3
aws s3 ls "s3://nb-rag-sys-docs-[suffix]/path/to/document.md"
# 2. Verify metadata sidecar exists
aws s3 ls "s3://nb-rag-sys-docs-[suffix]/path/to/document.md.metadata.json"
# 3. Check metadata content is valid
aws s3 cp "s3://nb-rag-sys-docs-[suffix]/path/to/document.md.metadata.json" - | jq .
# 4. Check last ingestion job included this document
aws bedrock-agent get-ingestion-job \
--knowledge-base-id $KB_ID \
--data-source-id $DS_ID \
--ingestion-job-id [most-recent-job-id]
# 5. Re-run ingestion if needed
aws bedrock-agent start-ingestion-job \
--knowledge-base-id $KB_ID \
--data-source-id $DS_ID
# 6. Wait for ingestion to complete (typically 5-10 minutes)
watch -n 30 "aws bedrock-agent get-ingestion-job \
--knowledge-base-id $KB_ID \
--data-source-id $DS_ID \
--ingestion-job-id [job-id] \
--query 'ingestionJob.status' --output text"
Debug Metadata Filtering
When client filtering isn’t working:
# 1. Retrieve WITHOUT filter to see what's indexed
aws bedrock-agent-runtime retrieve \
--knowledge-base-id $KB_ID \
--retrieval-query '{"text": "test query"}' \
--retrieval-configuration '{"vectorSearchConfiguration": {"numberOfResults": 10}}'
# 2. Check metadata in results
# Look for: retrievalResults[].metadata.client
# 3. Retrieve WITH filter
aws bedrock-agent-runtime retrieve \
--knowledge-base-id $KB_ID \
--retrieval-query '{"text": "test query"}' \
--retrieval-configuration '{
"vectorSearchConfiguration": {
"numberOfResults": 10,
"filter": {"equals": {"key": "client", "value": "YourClient"}}
}
}'
# 4. If no results with filter, verify the metadata key is filterable
# Check the Knowledge Base configuration - filterable keys are set in Terraform
Monitor Custom RAG Metrics
After implementing custom metrics, query them in CloudWatch:
# Get retrieval latency metrics (last hour)
aws cloudwatch get-metric-statistics \
--namespace RAG/Retrieval \
--metric-name RetrievalLatencyMs \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average,Maximum,Minimum
# Get filter effectiveness (ratio of filtered results)
aws cloudwatch get-metric-statistics \
--namespace RAG/Retrieval \
--metric-name FilterEffectiveness \
--dimensions Name=HasFilter,Value=true \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
# Get error counts
aws cloudwatch get-metric-statistics \
--namespace RAG/Retrieval \
--metric-name Errors \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum
Common S3 Vectors Issues
| Issue | Likely Cause | Solution |
|---|---|---|
| Document not in results | Missing .metadata.json sidecar |
Create sidecar file and re-ingest |
| Filter returns no results | Metadata key not filterable | Check KB config, ensure key is in filterable list |
| Ingestion fails | Invalid metadata format | Validate JSON structure matches Bedrock KB spec |
| Inconsistent results | Stale vectors | Re-run ingestion job |
| High latency | Too many candidates | Reduce numberOfResults or add filters |
Incident Response
High Error Rate
Alert: Lambda error rate > 1%
Response:
# 1. Check recent errors
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "ERROR" \
--start-time $(date -u -d '15 minutes ago' +%s)000 \
--limit 10
# 2. Identify pattern
# - Same error repeated? → Code bug
# - Different errors? → Dependency issue (Bedrock, Knowledge Base)
# 3. Check dependencies
aws bedrock list-foundation-models --region us-east-1 # Should succeed
aws bedrock-agent get-knowledge-base --knowledge-base-id [kb-id] # Should succeed
# 4. Rollback if recent deployment
cd terraform
git log -1 # Check last commit
git revert HEAD
terraform apply
# 5. Increase logging temporarily
aws lambda update-function-configuration \
--function-name nb-rag-sys-chat \
--environment Variables="{LOG_LEVEL=DEBUG}"
High Latency
Alert: API Gateway latency > 5 seconds
Response:
# 1. Check Lambda duration
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 900 \
--statistics Average,Maximum
# 2. Check for cold starts
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "INIT_START" \
--start-time $(date -u -d '15 minutes ago' +%s)000
# 3. Check Bedrock latency
aws cloudwatch get-metric-statistics \
--namespace AWS/Bedrock \
--metric-name ModelInvocationLatency \
--start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 900 \
--statistics Average
# 4. Increase Lambda memory (improves CPU)
aws lambda update-function-configuration \
--function-name nb-rag-sys-chat \
--memory-size 1536
# 5. Enable provisioned concurrency (eliminates cold starts)
aws lambda put-provisioned-concurrency-config \
--function-name nb-rag-sys-chat \
--provisioned-concurrent-executions 2 \
--qualifier '$LATEST'
Service Outage
Alert: API Gateway 5xx error rate > 10%
Response:
# 1. Check AWS service health
open "https://health.aws.amazon.com/health/status"
# 2. Check Lambda function status
aws lambda get-function --function-name nb-rag-sys-chat
# 3. Check API Gateway
aws apigatewayv2 get-api --api-id [api-id]
# 4. Check recent deployments
gh run list --limit 5
# 5. If recent deployment, rollback
gh run view [run-id]
cd terraform
git revert HEAD
terraform apply
# 6. Notify users (if applicable)
# Post to status page or send email
Maintenance Windows
Scheduled Maintenance
Plan maintenance during low-traffic hours (e.g., weekends, nights).
Pre-Maintenance Checklist:
- Announce maintenance window to users
- Backup all data (S3 documents, DynamoDB, Terraform state)
- Test changes in staging environment (if available)
- Prepare rollback plan
- Have team members on standby
During Maintenance:
# 1. Enable maintenance mode (optional - return 503 from Lambda)
# 2. Perform changes
# 3. Run smoke tests
# 4. Monitor for errors
# 5. Disable maintenance mode
Post-Maintenance Checklist:
- Verify all services operational
- Check error rates
- Monitor latency
- Send “all clear” notification
Disaster Recovery
Complete System Failure
Scenario: All infrastructure destroyed
Recovery Steps:
# 1. Verify Terraform state backup exists
aws s3 ls s3://nb-rag-sys-terraform-state/terraform.tfstate
# 2. Re-run bootstrap (if needed)
./.github/setup-oidc.sh
# 3. Deploy infrastructure
cd terraform
terraform init
terraform apply
# 4. Restore S3 documents from backup
aws s3 sync s3://nb-rag-sys-backups/documents/ s3://nb-rag-sys-documents/
# 5. Trigger Knowledge Base sync to rebuild vectors
aws bedrock-agent start-ingestion-job \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id]
# 6. Restore DynamoDB data (if needed)
aws dynamodb restore-table-from-backup \
--target-table-name nb-rag-sys \
--backup-arn [backup-arn]
# 7. Deploy web assets
cd web
npm ci
npm run build
aws s3 sync dist/ s3://nb-rag-sys-web/
aws cloudfront create-invalidation --distribution-id [dist-id] --paths "/*"
# 8. Verify system operational
./scripts/morning-health-check.sh
RTO: ~30 minutes RPO: Near-zero (S3 durability)
Last updated: 2026-01-07