Operations Runbook

Day-to-day operational procedures for the NorthBuilt RAG System.

Daily Operations

Morning Health Check

#!/bin/bash
# Save as: scripts/morning-health-check.sh

echo "=== NorthBuilt RAG System Health Check ==="
echo "Date: $(date)"
echo

# Check API Gateway
echo "1. API Gateway Status:"
API_ID=$(aws apigatewayv2 get-apis --query 'Items[?Name==`nb-rag-sys-api`].ApiId' --output text)
if [ -n "$API_ID" ]; then
  echo "[OK] API Gateway: $API_ID"
else
  echo "[ERROR] API Gateway: NOT FOUND"
fi

# Check Lambda functions
echo
echo "2. Lambda Functions:"
for func in chat query classification webhook-fathom webhook-helpscout; do
  status=$(aws lambda get-function --function-name "nb-rag-sys-$func" 2>&1)
  if [ $? -eq 0 ]; then
    echo "[OK] nb-rag-sys-$func"
  else
    echo "[ERROR] nb-rag-sys-$func: ERROR"
  fi
done

# Check recent errors (last 24 hours)
echo
echo "3. Recent Errors (last 24 hours):"
ERROR_COUNT=$(aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '24 hours ago' +%s)000 \
  --query 'length(events)' --output text)
echo "Chat Lambda errors: $ERROR_COUNT"

# Check Bedrock usage
echo
echo "4. Bedrock Usage (last 24 hours):"
BEDROCK_INVOCATIONS=$(aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name Invocations \
  --dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Sum \
  --query 'Datapoints[0].Sum' --output text)
echo "Model invocations: ${BEDROCK_INVOCATIONS:-0}"

# Check estimated cost
echo
echo "5. Estimated Monthly Cost:"
COST=$(aws cloudwatch get-metric-statistics \
  --namespace AWS/Billing \
  --metric-name EstimatedCharges \
  --dimensions Name=Currency,Value=USD \
  --start-time $(date -u -d '6 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 21600 \
  --statistics Maximum \
  --query 'Datapoints[0].Maximum' --output text)
echo "Current month: \$${COST:-N/A}"

echo
echo "=== Health Check Complete ==="

Run daily:

chmod +x scripts/morning-health-check.sh
./scripts/morning-health-check.sh

Alternative: In-App Dashboard

For a visual health check, use the System Dashboard in the web application:

  1. Navigate to the application and sign in
  2. Click “Dashboard” in the sidebar
  3. Review metrics cards for query volume, latency, and error rates
  4. Check ingestion metrics for document sync status
  5. View recent logs for any errors

The dashboard provides real-time metrics with 24h/7d/30d time ranges.

Common Tasks

Add New User

Users are authenticated via Google OAuth, so no manual user creation needed. However, to grant admin access:

# Get user sub (UUID) from Cognito
aws cognito-idp list-users \
  --user-pool-id [user-pool-id] \
  --query 'Users[?Username==`[email]`].Username' --output text

# Add to admin group (if implemented)
aws cognito-idp admin-add-user-to-group \
  --user-pool-id [user-pool-id] \
  --username [user-sub] \
  --group-name admins

Rotate API Keys

Webhook API Keys

# 1. Generate new API key in external service
# 2. Update secret
aws secretsmanager update-secret \
  --secret-id nb-rag-sys-fathom-api-key \
  --secret-string '{"api_key": "NEW_KEY"}'

# 3. Update webhook configuration in external service
# (Fathom or HelpScout)

# 4. Restart webhook Lambda
aws lambda update-function-configuration \
  --function-name nb-rag-sys-webhook-fathom \
  --environment Variables="{FORCE_UPDATE=$(date +%s)}"

Update Lambda Function Code

# 1. Make code changes
cd lambda/chat
vim handler.py

# 2. Package function
zip -r function.zip . -x "*.pyc" "__pycache__/*" "venv/*" ".venv/*" "tests/*"

# 3. Update function
aws lambda update-function-code \
  --function-name nb-rag-sys-chat \
  --zip-file fileb://function.zip

# 4. Wait for update to complete
aws lambda wait function-updated \
  --function-name nb-rag-sys-chat

# 5. Test function
aws lambda invoke \
  --function-name nb-rag-sys-chat \
  --payload '{"body": "{\"query\":\"test\"}"}' \
  /tmp/response.json

cat /tmp/response.json

Recommended: Use Terraform for code updates:

cd terraform
terraform apply -target=module.lambda.aws_lambda_function.chat

Invalidate CloudFront Cache

After updating web assets:

# Get CloudFront distribution ID
DIST_ID=$(aws cloudfront list-distributions \
  --query 'DistributionList.Items[?Comment==`nb-rag-sys-web`].Id' --output text)

# Create invalidation
aws cloudfront create-invalidation \
  --distribution-id $DIST_ID \
  --paths "/*"

# Check invalidation status
aws cloudfront get-invalidation \
  --distribution-id $DIST_ID \
  --id [invalidation-id]

Scale Lambda Concurrency

Increase reserved concurrency during high traffic:

# Increase reserved concurrency
aws lambda put-function-concurrency \
  --function-name nb-rag-sys-chat \
  --reserved-concurrent-executions 20

# Or remove limit (unreserved)
aws lambda delete-function-concurrency \
  --function-name nb-rag-sys-chat

Enable/Disable Webhooks

Temporarily disable webhook processing:

# Disable by removing Lambda trigger
aws apigatewayv2 delete-integration \
  --api-id [api-id] \
  --integration-id [integration-id]

# Or set reserved concurrency to 0 (no invocations)
aws lambda put-function-concurrency \
  --function-name nb-rag-sys-webhook-fathom \
  --reserved-concurrent-executions 0

# Re-enable
aws lambda delete-function-concurrency \
  --function-name nb-rag-sys-webhook-fathom

Backup & Recovery

Backup S3 Documents

# Backup documents to a separate bucket
aws s3 sync s3://nb-rag-sys-documents/ s3://nb-rag-sys-backups/documents/

# Or download locally
aws s3 sync s3://nb-rag-sys-documents/ ./backups/documents/

# List current documents
aws s3 ls s3://nb-rag-sys-documents/ --recursive --summarize

Run weekly:

aws s3 sync s3://nb-rag-sys-documents/ s3://nb-rag-sys-backups/documents-$(date +%Y%m%d)/

Restore Documents and Rebuild Vectors

# Restore documents from backup
aws s3 sync s3://nb-rag-sys-backups/documents-YYYYMMDD/ s3://nb-rag-sys-documents/

# Trigger Knowledge Base sync to rebuild vectors
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id]

# Monitor sync progress
aws bedrock-agent get-ingestion-job \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id] \
  --ingestion-job-id [job-id]

Backup DynamoDB Table

# Enable Point-in-Time Recovery (already enabled)
aws dynamodb update-continuous-backups \
  --table-name nb-rag-sys \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

# Create on-demand backup
aws dynamodb create-backup \
  --table-name nb-rag-sys \
  --backup-name "classification-$(date +%Y%m%d)"

# List backups
aws dynamodb list-backups --table-name nb-rag-sys

Restore DynamoDB Table

# Restore from backup
aws dynamodb restore-table-from-backup \
  --target-table-name nb-rag-sys-restored \
  --backup-arn arn:aws:dynamodb:us-east-1:ACCOUNT:table/nb-rag-sys/backup/BACKUP_ID

# Or restore from point-in-time
aws dynamodb restore-table-to-point-in-time \
  --source-table-name nb-rag-sys \
  --target-table-name nb-rag-sys-restored \
  --restore-date-time "2025-11-08T12:00:00Z"

Backup Terraform State

# Download current state
aws s3 cp s3://nb-rag-sys-terraform-state/terraform.tfstate ./terraform-state-backup-$(date +%Y%m%d).tfstate

# S3 versioning already enabled, can restore previous version:
aws s3api list-object-versions \
  --bucket nb-rag-sys-terraform-state \
  --prefix terraform.tfstate

# Restore specific version
aws s3api get-object \
  --bucket nb-rag-sys-terraform-state \
  --key terraform.tfstate \
  --version-id [version-id] \
  terraform.tfstate

Monitoring

View Real-Time Logs

# Tail Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-chat --follow

# Tail API Gateway logs
aws logs tail /aws/apigateway/nb-rag-sys --follow

# Filter for errors only
aws logs tail /aws/lambda/nb-rag-sys-chat --follow --filter-pattern "ERROR"

Check System Metrics

# Lambda invocations (last hour)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum

# API Gateway requests (last hour)
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name Count \
  --dimensions Name=ApiName,Value=nb-rag-sys-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum

View CloudWatch Dashboard

# Open dashboard in browser
open "https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=nb-rag-sys-main"

S3 Vectors Debugging

Commands for debugging vector search and Knowledge Base issues. Useful when:

  • Documents are not appearing in search results
  • Metadata filtering isn’t working as expected
  • Ingestion jobs are failing
  • Retrieval returns unexpected results

Direct Retrieval Testing

Test Knowledge Base retrieval without going through the Lambda:

# Get Knowledge Base ID
KB_ID=$(aws bedrock-agent list-knowledge-bases \
  --query 'knowledgeBaseSummaries[?name==`nb-rag-sys-knowledge-base`].knowledgeBaseId' \
  --output text)

# Basic retrieval test
aws bedrock-agent-runtime retrieve \
  --knowledge-base-id $KB_ID \
  --retrieval-query '{"text": "your test query here"}' \
  --retrieval-configuration '{
    "vectorSearchConfiguration": {
      "numberOfResults": 5
    }
  }'

# Test with client filter
aws bedrock-agent-runtime retrieve \
  --knowledge-base-id $KB_ID \
  --retrieval-query '{"text": "your test query here"}' \
  --retrieval-configuration '{
    "vectorSearchConfiguration": {
      "numberOfResults": 5,
      "filter": {
        "equals": {
          "key": "client",
          "value": "YourClient"
        }
      }
    }
  }'

Check Ingestion Job Status

# Get data source ID
DS_ID=$(aws bedrock-agent list-data-sources \
  --knowledge-base-id $KB_ID \
  --query 'dataSourceSummaries[0].dataSourceId' \
  --output text)

# List recent ingestion jobs
aws bedrock-agent list-ingestion-jobs \
  --knowledge-base-id $KB_ID \
  --data-source-id $DS_ID \
  --max-results 5

# Get details for a specific job
aws bedrock-agent get-ingestion-job \
  --knowledge-base-id $KB_ID \
  --data-source-id $DS_ID \
  --ingestion-job-id [job-id]

# Start a new ingestion job
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id $KB_ID \
  --data-source-id $DS_ID

Inspect S3 Metadata

Documents require a companion .metadata.json file for Bedrock KB filtering:

# List documents in S3
aws s3 ls s3://nb-rag-sys-docs-[suffix]/fathom/ --recursive

# Check if metadata file exists
DOC="fathom/client/project/document.md"
aws s3 ls "s3://nb-rag-sys-docs-[suffix]/${DOC}.metadata.json"

# View metadata contents
aws s3 cp "s3://nb-rag-sys-docs-[suffix]/${DOC}.metadata.json" - | jq .

# Expected format:
# {
#   "metadataAttributes": {
#     "client": {"value": {"type": "STRING", "stringValue": "ClientName"}},
#     "source": {"value": {"type": "STRING", "stringValue": "fathom"}},
#     ...
#   }
# }

Check S3 Vectors Configuration

# Get the index associated with the Knowledge Base
aws bedrock-agent get-knowledge-base \
  --knowledge-base-id $KB_ID \
  --query 'knowledgeBase.storageConfiguration'

# List S3 Vectors indexes (if using S3 Vectors)
aws s3vectors list-indexes

Debug Missing Documents

When a document doesn’t appear in search results:

# 1. Verify document exists in S3
aws s3 ls "s3://nb-rag-sys-docs-[suffix]/path/to/document.md"

# 2. Verify metadata sidecar exists
aws s3 ls "s3://nb-rag-sys-docs-[suffix]/path/to/document.md.metadata.json"

# 3. Check metadata content is valid
aws s3 cp "s3://nb-rag-sys-docs-[suffix]/path/to/document.md.metadata.json" - | jq .

# 4. Check last ingestion job included this document
aws bedrock-agent get-ingestion-job \
  --knowledge-base-id $KB_ID \
  --data-source-id $DS_ID \
  --ingestion-job-id [most-recent-job-id]

# 5. Re-run ingestion if needed
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id $KB_ID \
  --data-source-id $DS_ID

# 6. Wait for ingestion to complete (typically 5-10 minutes)
watch -n 30 "aws bedrock-agent get-ingestion-job \
  --knowledge-base-id $KB_ID \
  --data-source-id $DS_ID \
  --ingestion-job-id [job-id] \
  --query 'ingestionJob.status' --output text"

Debug Metadata Filtering

When client filtering isn’t working:

# 1. Retrieve WITHOUT filter to see what's indexed
aws bedrock-agent-runtime retrieve \
  --knowledge-base-id $KB_ID \
  --retrieval-query '{"text": "test query"}' \
  --retrieval-configuration '{"vectorSearchConfiguration": {"numberOfResults": 10}}'

# 2. Check metadata in results
# Look for: retrievalResults[].metadata.client

# 3. Retrieve WITH filter
aws bedrock-agent-runtime retrieve \
  --knowledge-base-id $KB_ID \
  --retrieval-query '{"text": "test query"}' \
  --retrieval-configuration '{
    "vectorSearchConfiguration": {
      "numberOfResults": 10,
      "filter": {"equals": {"key": "client", "value": "YourClient"}}
    }
  }'

# 4. If no results with filter, verify the metadata key is filterable
# Check the Knowledge Base configuration - filterable keys are set in Terraform

Monitor Custom RAG Metrics

After implementing custom metrics, query them in CloudWatch:

# Get retrieval latency metrics (last hour)
aws cloudwatch get-metric-statistics \
  --namespace RAG/Retrieval \
  --metric-name RetrievalLatencyMs \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average,Maximum,Minimum

# Get filter effectiveness (ratio of filtered results)
aws cloudwatch get-metric-statistics \
  --namespace RAG/Retrieval \
  --metric-name FilterEffectiveness \
  --dimensions Name=HasFilter,Value=true \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

# Get error counts
aws cloudwatch get-metric-statistics \
  --namespace RAG/Retrieval \
  --metric-name Errors \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum

Common S3 Vectors Issues

Issue Likely Cause Solution
Document not in results Missing .metadata.json sidecar Create sidecar file and re-ingest
Filter returns no results Metadata key not filterable Check KB config, ensure key is in filterable list
Ingestion fails Invalid metadata format Validate JSON structure matches Bedrock KB spec
Inconsistent results Stale vectors Re-run ingestion job
High latency Too many candidates Reduce numberOfResults or add filters

Incident Response

High Error Rate

Alert: Lambda error rate > 1%

Response:

# 1. Check recent errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --limit 10

# 2. Identify pattern
# - Same error repeated? → Code bug
# - Different errors? → Dependency issue (Bedrock, Knowledge Base)

# 3. Check dependencies
aws bedrock list-foundation-models --region us-east-1  # Should succeed
aws bedrock-agent get-knowledge-base --knowledge-base-id [kb-id]  # Should succeed

# 4. Rollback if recent deployment
cd terraform
git log -1  # Check last commit
git revert HEAD
terraform apply

# 5. Increase logging temporarily
aws lambda update-function-configuration \
  --function-name nb-rag-sys-chat \
  --environment Variables="{LOG_LEVEL=DEBUG}"

High Latency

Alert: API Gateway latency > 5 seconds

Response:

# 1. Check Lambda duration
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 900 \
  --statistics Average,Maximum

# 2. Check for cold starts
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "INIT_START" \
  --start-time $(date -u -d '15 minutes ago' +%s)000

# 3. Check Bedrock latency
aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name ModelInvocationLatency \
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 900 \
  --statistics Average

# 4. Increase Lambda memory (improves CPU)
aws lambda update-function-configuration \
  --function-name nb-rag-sys-chat \
  --memory-size 1536

# 5. Enable provisioned concurrency (eliminates cold starts)
aws lambda put-provisioned-concurrency-config \
  --function-name nb-rag-sys-chat \
  --provisioned-concurrent-executions 2 \
  --qualifier '$LATEST'

Service Outage

Alert: API Gateway 5xx error rate > 10%

Response:

# 1. Check AWS service health
open "https://health.aws.amazon.com/health/status"

# 2. Check Lambda function status
aws lambda get-function --function-name nb-rag-sys-chat

# 3. Check API Gateway
aws apigatewayv2 get-api --api-id [api-id]

# 4. Check recent deployments
gh run list --limit 5

# 5. If recent deployment, rollback
gh run view [run-id]
cd terraform
git revert HEAD
terraform apply

# 6. Notify users (if applicable)
# Post to status page or send email

Maintenance Windows

Scheduled Maintenance

Plan maintenance during low-traffic hours (e.g., weekends, nights).

Pre-Maintenance Checklist:

  • Announce maintenance window to users
  • Backup all data (S3 documents, DynamoDB, Terraform state)
  • Test changes in staging environment (if available)
  • Prepare rollback plan
  • Have team members on standby

During Maintenance:

# 1. Enable maintenance mode (optional - return 503 from Lambda)
# 2. Perform changes
# 3. Run smoke tests
# 4. Monitor for errors
# 5. Disable maintenance mode

Post-Maintenance Checklist:

  • Verify all services operational
  • Check error rates
  • Monitor latency
  • Send “all clear” notification

Disaster Recovery

Complete System Failure

Scenario: All infrastructure destroyed

Recovery Steps:

# 1. Verify Terraform state backup exists
aws s3 ls s3://nb-rag-sys-terraform-state/terraform.tfstate

# 2. Re-run bootstrap (if needed)
./.github/setup-oidc.sh

# 3. Deploy infrastructure
cd terraform
terraform init
terraform apply

# 4. Restore S3 documents from backup
aws s3 sync s3://nb-rag-sys-backups/documents/ s3://nb-rag-sys-documents/

# 5. Trigger Knowledge Base sync to rebuild vectors
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id]

# 6. Restore DynamoDB data (if needed)
aws dynamodb restore-table-from-backup \
  --target-table-name nb-rag-sys \
  --backup-arn [backup-arn]

# 7. Deploy web assets
cd web
npm ci
npm run build
aws s3 sync dist/ s3://nb-rag-sys-web/
aws cloudfront create-invalidation --distribution-id [dist-id] --paths "/*"

# 8. Verify system operational
./scripts/morning-health-check.sh

RTO: ~30 minutes RPO: Near-zero (S3 durability)


Last updated: 2026-01-07