Operations Runbook

Day-to-day operational procedures for the NorthBuilt RAG System.

Daily Operations

Morning Health Check

#!/bin/bash
# Save as: scripts/morning-health-check.sh

echo "=== NorthBuilt RAG System Health Check ==="
echo "Date: $(date)"
echo

# Check API Gateway
echo "1. API Gateway Status:"
API_ID=$(aws apigatewayv2 get-apis --query 'Items[?Name==`nb-rag-sys-api`].ApiId' --output text)
if [ -n "$API_ID" ]; then
  echo "[OK] API Gateway: $API_ID"
else
  echo "[ERROR] API Gateway: NOT FOUND"
fi

# Check Lambda functions
echo
echo "2. Lambda Functions:"
for func in chat query classify webhook-fathom webhook-helpscout webhook-linear; do
  status=$(aws lambda get-function --function-name "nb-rag-sys-$func" 2>&1)
  if [ $? -eq 0 ]; then
    echo "[OK] nb-rag-sys-$func"
  else
    echo "[ERROR] nb-rag-sys-$func: ERROR"
  fi
done

# Check recent errors (last 24 hours)
echo
echo "3. Recent Errors (last 24 hours):"
ERROR_COUNT=$(aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '24 hours ago' +%s)000 \
  --query 'length(events)' --output text)
echo "Chat Lambda errors: $ERROR_COUNT"

# Check Bedrock usage
echo
echo "4. Bedrock Usage (last 24 hours):"
BEDROCK_INVOCATIONS=$(aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name Invocations \
  --dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Sum \
  --query 'Datapoints[0].Sum' --output text)
echo "Model invocations: ${BEDROCK_INVOCATIONS:-0}"

# Check estimated cost
echo
echo "5. Estimated Monthly Cost:"
COST=$(aws cloudwatch get-metric-statistics \
  --namespace AWS/Billing \
  --metric-name EstimatedCharges \
  --dimensions Name=Currency,Value=USD \
  --start-time $(date -u -d '6 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 21600 \
  --statistics Maximum \
  --query 'Datapoints[0].Maximum' --output text)
echo "Current month: \$${COST:-N/A}"

echo
echo "=== Health Check Complete ==="

Run daily:

chmod +x scripts/morning-health-check.sh
./scripts/morning-health-check.sh

Common Tasks

Add New User

Users are authenticated via Google OAuth, so no manual user creation needed. However, to grant admin access:

# Get user sub (UUID) from Cognito
aws cognito-idp list-users \
  --user-pool-id [user-pool-id] \
  --query 'Users[?Username==`[email]`].Username' --output text

# Add to admin group (if implemented)
aws cognito-idp admin-add-user-to-group \
  --user-pool-id [user-pool-id] \
  --username [user-sub] \
  --group-name admins

Rotate API Keys

Webhook API Keys

# 1. Generate new API key in external service
# 2. Update secret
aws secretsmanager update-secret \
  --secret-id nb-rag-sys-fathom-api-key \
  --secret-string '{"api_key": "NEW_KEY"}'

# 3. Update webhook configuration in external service
# (Fathom, HelpScout, or Linear)

# 4. Restart webhook Lambda
aws lambda update-function-configuration \
  --function-name nb-rag-sys-webhook-fathom \
  --environment Variables="{FORCE_UPDATE=$(date +%s)}"

Update Lambda Function Code

# 1. Make code changes
cd lambda/chat
vim handler.py

# 2. Package function
zip -r function.zip . -x "*.pyc" "__pycache__/*" "venv/*" ".venv/*" "tests/*"

# 3. Update function
aws lambda update-function-code \
  --function-name nb-rag-sys-chat \
  --zip-file fileb://function.zip

# 4. Wait for update to complete
aws lambda wait function-updated \
  --function-name nb-rag-sys-chat

# 5. Test function
aws lambda invoke \
  --function-name nb-rag-sys-chat \
  --payload '{"body": "{\"query\":\"test\"}"}' \
  /tmp/response.json

cat /tmp/response.json

Recommended: Use Terraform for code updates:

cd terraform
terraform apply -target=module.lambda.aws_lambda_function.chat

Invalidate CloudFront Cache

After updating web assets:

# Get CloudFront distribution ID
DIST_ID=$(aws cloudfront list-distributions \
  --query 'DistributionList.Items[?Comment==`nb-rag-sys-web`].Id' --output text)

# Create invalidation
aws cloudfront create-invalidation \
  --distribution-id $DIST_ID \
  --paths "/*"

# Check invalidation status
aws cloudfront get-invalidation \
  --distribution-id $DIST_ID \
  --id [invalidation-id]

Scale Lambda Concurrency

Increase reserved concurrency during high traffic:

# Increase reserved concurrency
aws lambda put-function-concurrency \
  --function-name nb-rag-sys-chat \
  --reserved-concurrent-executions 20

# Or remove limit (unreserved)
aws lambda delete-function-concurrency \
  --function-name nb-rag-sys-chat

Enable/Disable Webhooks

Temporarily disable webhook processing:

# Disable by removing Lambda trigger
aws apigatewayv2 delete-integration \
  --api-id [api-id] \
  --integration-id [integration-id]

# Or set reserved concurrency to 0 (no invocations)
aws lambda put-function-concurrency \
  --function-name nb-rag-sys-webhook-fathom \
  --reserved-concurrent-executions 0

# Re-enable
aws lambda delete-function-concurrency \
  --function-name nb-rag-sys-webhook-fathom

Backup & Recovery

Backup S3 Documents

# Backup documents to a separate bucket
aws s3 sync s3://nb-rag-sys-documents/ s3://nb-rag-sys-backups/documents/

# Or download locally
aws s3 sync s3://nb-rag-sys-documents/ ./backups/documents/

# List current documents
aws s3 ls s3://nb-rag-sys-documents/ --recursive --summarize

Run weekly:

aws s3 sync s3://nb-rag-sys-documents/ s3://nb-rag-sys-backups/documents-$(date +%Y%m%d)/

Restore Documents and Rebuild Vectors

# Restore documents from backup
aws s3 sync s3://nb-rag-sys-backups/documents-YYYYMMDD/ s3://nb-rag-sys-documents/

# Trigger Knowledge Base sync to rebuild vectors
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id]

# Monitor sync progress
aws bedrock-agent get-ingestion-job \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id] \
  --ingestion-job-id [job-id]

Backup DynamoDB Table

# Enable Point-in-Time Recovery (already enabled)
aws dynamodb update-continuous-backups \
  --table-name nb-rag-sys-classify \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

# Create on-demand backup
aws dynamodb create-backup \
  --table-name nb-rag-sys-classify \
  --backup-name "classify-$(date +%Y%m%d)"

# List backups
aws dynamodb list-backups --table-name nb-rag-sys-classify

Restore DynamoDB Table

# Restore from backup
aws dynamodb restore-table-from-backup \
  --target-table-name nb-rag-sys-classify-restored \
  --backup-arn arn:aws:dynamodb:us-east-1:ACCOUNT:table/nb-rag-sys-classify/backup/BACKUP_ID

# Or restore from point-in-time
aws dynamodb restore-table-to-point-in-time \
  --source-table-name nb-rag-sys-classify \
  --target-table-name nb-rag-sys-classify-restored \
  --restore-date-time "2025-11-08T12:00:00Z"

Backup Terraform State

# Download current state
aws s3 cp s3://nb-rag-sys-terraform-state/terraform.tfstate ./terraform-state-backup-$(date +%Y%m%d).tfstate

# S3 versioning already enabled, can restore previous version:
aws s3api list-object-versions \
  --bucket nb-rag-sys-terraform-state \
  --prefix terraform.tfstate

# Restore specific version
aws s3api get-object \
  --bucket nb-rag-sys-terraform-state \
  --key terraform.tfstate \
  --version-id [version-id] \
  terraform.tfstate

Monitoring

View Real-Time Logs

# Tail Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-chat --follow

# Tail API Gateway logs
aws logs tail /aws/apigateway/nb-rag-sys --follow

# Filter for errors only
aws logs tail /aws/lambda/nb-rag-sys-chat --follow --filter-pattern "ERROR"

Check System Metrics

# Lambda invocations (last hour)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum

# API Gateway requests (last hour)
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name Count \
  --dimensions Name=ApiName,Value=nb-rag-sys-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum

View CloudWatch Dashboard

# Open dashboard in browser
open "https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=nb-rag-sys-main"

Incident Response

High Error Rate

Alert: Lambda error rate > 1%

Response:

# 1. Check recent errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --limit 10

# 2. Identify pattern
# - Same error repeated? → Code bug
# - Different errors? → Dependency issue (Bedrock, Knowledge Base)

# 3. Check dependencies
aws bedrock list-foundation-models --region us-east-1  # Should succeed
aws bedrock-agent get-knowledge-base --knowledge-base-id [kb-id]  # Should succeed

# 4. Rollback if recent deployment
cd terraform
git log -1  # Check last commit
git revert HEAD
terraform apply

# 5. Increase logging temporarily
aws lambda update-function-configuration \
  --function-name nb-rag-sys-chat \
  --environment Variables="{LOG_LEVEL=DEBUG}"

High Latency

Alert: API Gateway latency > 5 seconds

Response:

# 1. Check Lambda duration
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 900 \
  --statistics Average,Maximum

# 2. Check for cold starts
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "INIT_START" \
  --start-time $(date -u -d '15 minutes ago' +%s)000

# 3. Check Bedrock latency
aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name ModelInvocationLatency \
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 900 \
  --statistics Average

# 4. Increase Lambda memory (improves CPU)
aws lambda update-function-configuration \
  --function-name nb-rag-sys-chat \
  --memory-size 1536

# 5. Enable provisioned concurrency (eliminates cold starts)
aws lambda put-provisioned-concurrency-config \
  --function-name nb-rag-sys-chat \
  --provisioned-concurrent-executions 2 \
  --qualifier '$LATEST'

Service Outage

Alert: API Gateway 5xx error rate > 10%

Response:

# 1. Check AWS service health
open "https://health.aws.amazon.com/health/status"

# 2. Check Lambda function status
aws lambda get-function --function-name nb-rag-sys-chat

# 3. Check API Gateway
aws apigatewayv2 get-api --api-id [api-id]

# 4. Check recent deployments
gh run list --limit 5

# 5. If recent deployment, rollback
gh run view [run-id]
cd terraform
git revert HEAD
terraform apply

# 6. Notify users (if applicable)
# Post to status page or send email

Maintenance Windows

Scheduled Maintenance

Plan maintenance during low-traffic hours (e.g., weekends, nights).

Pre-Maintenance Checklist:

  • Announce maintenance window to users
  • Backup all data (S3 documents, DynamoDB, Terraform state)
  • Test changes in staging environment (if available)
  • Prepare rollback plan
  • Have team members on standby

During Maintenance:

# 1. Enable maintenance mode (optional - return 503 from Lambda)
# 2. Perform changes
# 3. Run smoke tests
# 4. Monitor for errors
# 5. Disable maintenance mode

Post-Maintenance Checklist:

  • Verify all services operational
  • Check error rates
  • Monitor latency
  • Send “all clear” notification

Disaster Recovery

Complete System Failure

Scenario: All infrastructure destroyed

Recovery Steps:

# 1. Verify Terraform state backup exists
aws s3 ls s3://nb-rag-sys-terraform-state/terraform.tfstate

# 2. Re-run bootstrap (if needed)
./.github/setup-oidc.sh

# 3. Deploy infrastructure
cd terraform
terraform init
terraform apply

# 4. Restore S3 documents from backup
aws s3 sync s3://nb-rag-sys-backups/documents/ s3://nb-rag-sys-documents/

# 5. Trigger Knowledge Base sync to rebuild vectors
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id]

# 6. Restore DynamoDB data (if needed)
aws dynamodb restore-table-from-backup \
  --target-table-name nb-rag-sys-classify \
  --backup-arn [backup-arn]

# 7. Deploy web assets
cd web
npm ci
npm run build
aws s3 sync dist/ s3://nb-rag-sys-web/
aws cloudfront create-invalidation --distribution-id [dist-id] --paths "/*"

# 8. Verify system operational
./scripts/morning-health-check.sh

RTO: ~30 minutes RPO: Near-zero (S3 durability)


Last updated: 2026-01-01