Operations Runbook
Day-to-day operational procedures for the NorthBuilt RAG System.
Daily Operations
Morning Health Check
#!/bin/bash
# Save as: scripts/morning-health-check.sh
echo "=== NorthBuilt RAG System Health Check ==="
echo "Date: $(date)"
echo
# Check API Gateway
echo "1. API Gateway Status:"
API_ID=$(aws apigatewayv2 get-apis --query 'Items[?Name==`nb-rag-sys-api`].ApiId' --output text)
if [ -n "$API_ID" ]; then
echo "[OK] API Gateway: $API_ID"
else
echo "[ERROR] API Gateway: NOT FOUND"
fi
# Check Lambda functions
echo
echo "2. Lambda Functions:"
for func in chat query classify webhook-fathom webhook-helpscout webhook-linear; do
status=$(aws lambda get-function --function-name "nb-rag-sys-$func" 2>&1)
if [ $? -eq 0 ]; then
echo "[OK] nb-rag-sys-$func"
else
echo "[ERROR] nb-rag-sys-$func: ERROR"
fi
done
# Check recent errors (last 24 hours)
echo
echo "3. Recent Errors (last 24 hours):"
ERROR_COUNT=$(aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "ERROR" \
--start-time $(date -u -d '24 hours ago' +%s)000 \
--query 'length(events)' --output text)
echo "Chat Lambda errors: $ERROR_COUNT"
# Check Bedrock usage
echo
echo "4. Bedrock Usage (last 24 hours):"
BEDROCK_INVOCATIONS=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Bedrock \
--metric-name Invocations \
--dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 86400 \
--statistics Sum \
--query 'Datapoints[0].Sum' --output text)
echo "Model invocations: ${BEDROCK_INVOCATIONS:-0}"
# Check estimated cost
echo
echo "5. Estimated Monthly Cost:"
COST=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Billing \
--metric-name EstimatedCharges \
--dimensions Name=Currency,Value=USD \
--start-time $(date -u -d '6 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 21600 \
--statistics Maximum \
--query 'Datapoints[0].Maximum' --output text)
echo "Current month: \$${COST:-N/A}"
echo
echo "=== Health Check Complete ==="
Run daily:
chmod +x scripts/morning-health-check.sh
./scripts/morning-health-check.sh
Common Tasks
Add New User
Users are authenticated via Google OAuth, so no manual user creation needed. However, to grant admin access:
# Get user sub (UUID) from Cognito
aws cognito-idp list-users \
--user-pool-id [user-pool-id] \
--query 'Users[?Username==`[email]`].Username' --output text
# Add to admin group (if implemented)
aws cognito-idp admin-add-user-to-group \
--user-pool-id [user-pool-id] \
--username [user-sub] \
--group-name admins
Rotate API Keys
Webhook API Keys
# 1. Generate new API key in external service
# 2. Update secret
aws secretsmanager update-secret \
--secret-id nb-rag-sys-fathom-api-key \
--secret-string '{"api_key": "NEW_KEY"}'
# 3. Update webhook configuration in external service
# (Fathom, HelpScout, or Linear)
# 4. Restart webhook Lambda
aws lambda update-function-configuration \
--function-name nb-rag-sys-webhook-fathom \
--environment Variables="{FORCE_UPDATE=$(date +%s)}"
Update Lambda Function Code
# 1. Make code changes
cd lambda/chat
vim handler.py
# 2. Package function
zip -r function.zip . -x "*.pyc" "__pycache__/*" "venv/*" ".venv/*" "tests/*"
# 3. Update function
aws lambda update-function-code \
--function-name nb-rag-sys-chat \
--zip-file fileb://function.zip
# 4. Wait for update to complete
aws lambda wait function-updated \
--function-name nb-rag-sys-chat
# 5. Test function
aws lambda invoke \
--function-name nb-rag-sys-chat \
--payload '{"body": "{\"query\":\"test\"}"}' \
/tmp/response.json
cat /tmp/response.json
Recommended: Use Terraform for code updates:
cd terraform
terraform apply -target=module.lambda.aws_lambda_function.chat
Invalidate CloudFront Cache
After updating web assets:
# Get CloudFront distribution ID
DIST_ID=$(aws cloudfront list-distributions \
--query 'DistributionList.Items[?Comment==`nb-rag-sys-web`].Id' --output text)
# Create invalidation
aws cloudfront create-invalidation \
--distribution-id $DIST_ID \
--paths "/*"
# Check invalidation status
aws cloudfront get-invalidation \
--distribution-id $DIST_ID \
--id [invalidation-id]
Scale Lambda Concurrency
Increase reserved concurrency during high traffic:
# Increase reserved concurrency
aws lambda put-function-concurrency \
--function-name nb-rag-sys-chat \
--reserved-concurrent-executions 20
# Or remove limit (unreserved)
aws lambda delete-function-concurrency \
--function-name nb-rag-sys-chat
Enable/Disable Webhooks
Temporarily disable webhook processing:
# Disable by removing Lambda trigger
aws apigatewayv2 delete-integration \
--api-id [api-id] \
--integration-id [integration-id]
# Or set reserved concurrency to 0 (no invocations)
aws lambda put-function-concurrency \
--function-name nb-rag-sys-webhook-fathom \
--reserved-concurrent-executions 0
# Re-enable
aws lambda delete-function-concurrency \
--function-name nb-rag-sys-webhook-fathom
Backup & Recovery
Backup S3 Documents
# Backup documents to a separate bucket
aws s3 sync s3://nb-rag-sys-documents/ s3://nb-rag-sys-backups/documents/
# Or download locally
aws s3 sync s3://nb-rag-sys-documents/ ./backups/documents/
# List current documents
aws s3 ls s3://nb-rag-sys-documents/ --recursive --summarize
Run weekly:
aws s3 sync s3://nb-rag-sys-documents/ s3://nb-rag-sys-backups/documents-$(date +%Y%m%d)/
Restore Documents and Rebuild Vectors
# Restore documents from backup
aws s3 sync s3://nb-rag-sys-backups/documents-YYYYMMDD/ s3://nb-rag-sys-documents/
# Trigger Knowledge Base sync to rebuild vectors
aws bedrock-agent start-ingestion-job \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id]
# Monitor sync progress
aws bedrock-agent get-ingestion-job \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id] \
--ingestion-job-id [job-id]
Backup DynamoDB Table
# Enable Point-in-Time Recovery (already enabled)
aws dynamodb update-continuous-backups \
--table-name nb-rag-sys-classify \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
# Create on-demand backup
aws dynamodb create-backup \
--table-name nb-rag-sys-classify \
--backup-name "classify-$(date +%Y%m%d)"
# List backups
aws dynamodb list-backups --table-name nb-rag-sys-classify
Restore DynamoDB Table
# Restore from backup
aws dynamodb restore-table-from-backup \
--target-table-name nb-rag-sys-classify-restored \
--backup-arn arn:aws:dynamodb:us-east-1:ACCOUNT:table/nb-rag-sys-classify/backup/BACKUP_ID
# Or restore from point-in-time
aws dynamodb restore-table-to-point-in-time \
--source-table-name nb-rag-sys-classify \
--target-table-name nb-rag-sys-classify-restored \
--restore-date-time "2025-11-08T12:00:00Z"
Backup Terraform State
# Download current state
aws s3 cp s3://nb-rag-sys-terraform-state/terraform.tfstate ./terraform-state-backup-$(date +%Y%m%d).tfstate
# S3 versioning already enabled, can restore previous version:
aws s3api list-object-versions \
--bucket nb-rag-sys-terraform-state \
--prefix terraform.tfstate
# Restore specific version
aws s3api get-object \
--bucket nb-rag-sys-terraform-state \
--key terraform.tfstate \
--version-id [version-id] \
terraform.tfstate
Monitoring
View Real-Time Logs
# Tail Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-chat --follow
# Tail API Gateway logs
aws logs tail /aws/apigateway/nb-rag-sys --follow
# Filter for errors only
aws logs tail /aws/lambda/nb-rag-sys-chat --follow --filter-pattern "ERROR"
Check System Metrics
# Lambda invocations (last hour)
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum
# API Gateway requests (last hour)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name Count \
--dimensions Name=ApiName,Value=nb-rag-sys-api \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum
View CloudWatch Dashboard
# Open dashboard in browser
open "https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=nb-rag-sys-main"
Incident Response
High Error Rate
Alert: Lambda error rate > 1%
Response:
# 1. Check recent errors
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "ERROR" \
--start-time $(date -u -d '15 minutes ago' +%s)000 \
--limit 10
# 2. Identify pattern
# - Same error repeated? → Code bug
# - Different errors? → Dependency issue (Bedrock, Knowledge Base)
# 3. Check dependencies
aws bedrock list-foundation-models --region us-east-1 # Should succeed
aws bedrock-agent get-knowledge-base --knowledge-base-id [kb-id] # Should succeed
# 4. Rollback if recent deployment
cd terraform
git log -1 # Check last commit
git revert HEAD
terraform apply
# 5. Increase logging temporarily
aws lambda update-function-configuration \
--function-name nb-rag-sys-chat \
--environment Variables="{LOG_LEVEL=DEBUG}"
High Latency
Alert: API Gateway latency > 5 seconds
Response:
# 1. Check Lambda duration
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 900 \
--statistics Average,Maximum
# 2. Check for cold starts
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "INIT_START" \
--start-time $(date -u -d '15 minutes ago' +%s)000
# 3. Check Bedrock latency
aws cloudwatch get-metric-statistics \
--namespace AWS/Bedrock \
--metric-name ModelInvocationLatency \
--start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 900 \
--statistics Average
# 4. Increase Lambda memory (improves CPU)
aws lambda update-function-configuration \
--function-name nb-rag-sys-chat \
--memory-size 1536
# 5. Enable provisioned concurrency (eliminates cold starts)
aws lambda put-provisioned-concurrency-config \
--function-name nb-rag-sys-chat \
--provisioned-concurrent-executions 2 \
--qualifier '$LATEST'
Service Outage
Alert: API Gateway 5xx error rate > 10%
Response:
# 1. Check AWS service health
open "https://health.aws.amazon.com/health/status"
# 2. Check Lambda function status
aws lambda get-function --function-name nb-rag-sys-chat
# 3. Check API Gateway
aws apigatewayv2 get-api --api-id [api-id]
# 4. Check recent deployments
gh run list --limit 5
# 5. If recent deployment, rollback
gh run view [run-id]
cd terraform
git revert HEAD
terraform apply
# 6. Notify users (if applicable)
# Post to status page or send email
Maintenance Windows
Scheduled Maintenance
Plan maintenance during low-traffic hours (e.g., weekends, nights).
Pre-Maintenance Checklist:
- Announce maintenance window to users
- Backup all data (S3 documents, DynamoDB, Terraform state)
- Test changes in staging environment (if available)
- Prepare rollback plan
- Have team members on standby
During Maintenance:
# 1. Enable maintenance mode (optional - return 503 from Lambda)
# 2. Perform changes
# 3. Run smoke tests
# 4. Monitor for errors
# 5. Disable maintenance mode
Post-Maintenance Checklist:
- Verify all services operational
- Check error rates
- Monitor latency
- Send “all clear” notification
Disaster Recovery
Complete System Failure
Scenario: All infrastructure destroyed
Recovery Steps:
# 1. Verify Terraform state backup exists
aws s3 ls s3://nb-rag-sys-terraform-state/terraform.tfstate
# 2. Re-run bootstrap (if needed)
./.github/setup-oidc.sh
# 3. Deploy infrastructure
cd terraform
terraform init
terraform apply
# 4. Restore S3 documents from backup
aws s3 sync s3://nb-rag-sys-backups/documents/ s3://nb-rag-sys-documents/
# 5. Trigger Knowledge Base sync to rebuild vectors
aws bedrock-agent start-ingestion-job \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id]
# 6. Restore DynamoDB data (if needed)
aws dynamodb restore-table-from-backup \
--target-table-name nb-rag-sys-classify \
--backup-arn [backup-arn]
# 7. Deploy web assets
cd web
npm ci
npm run build
aws s3 sync dist/ s3://nb-rag-sys-web/
aws cloudfront create-invalidation --distribution-id [dist-id] --paths "/*"
# 8. Verify system operational
./scripts/morning-health-check.sh
RTO: ~30 minutes RPO: Near-zero (S3 durability)
Last updated: 2026-01-01