Troubleshooting Guide

Comprehensive guide to diagnosing and resolving common issues in the NorthBuilt RAG System.

Quick Diagnosis

System Health Check

# Check API Gateway
curl -I https://[api-gateway-url]/chat

# Check Lambda functions
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `nb-rag-sys`)].FunctionName'

# Check recent Lambda errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000

# Check Bedrock throttling
aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name ModelInvocationClientErrors \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

Deployment Issues

Issue: Terraform Apply Fails with State Lock

Symptoms:

Error: Error acquiring the state lock
Lock Info:
  ID: abc123...
  Path: nb-rag-sys-terraform-state/terraform.tfstate

Causes:

  • Previous Terraform run crashed
  • Concurrent Terraform runs
  • Manual interruption (Ctrl+C)

Solution:

# 1. Verify no other Terraform processes are running
ps aux | grep terraform

# 2. Check DynamoDB for lock
aws dynamodb scan --table-name nb-rag-sys-terraform-locks

# 3. Force unlock (use LockID from error message)
terraform force-unlock abc123...

# 4. If unlock fails, manually delete from DynamoDB
aws dynamodb delete-item \
  --table-name nb-rag-sys-terraform-locks \
  --key '{"LockID": {"S": "nb-rag-sys-terraform-state/terraform.tfstate-md5"}}'

Issue: Terraform Backend Configuration Error

Symptoms:

Error: Backend initialization required

Causes:

  • First time running Terraform
  • Backend configuration changed
  • State bucket doesn’t exist

Solution:

# 1. Verify S3 bucket exists
aws s3 ls s3://nb-rag-sys-terraform-state

# 2. Initialize backend
cd terraform
terraform init

# 3. If bucket missing, run bootstrap
./.github/setup-oidc.sh

Issue: Resource Already Exists

Symptoms:

Error: Error creating Lambda Function: ResourceAlreadyExistsException

Causes:

  • Manual resource creation
  • Previous incomplete Terraform apply
  • Resource not in Terraform state

Solution:

# 1. Import existing resource into state
terraform import 'aws_lambda_function.chat' 'nb-rag-sys-chat'

# 2. Verify import
terraform plan
# Should show no changes for imported resource

# 3. If resource is wrong, delete and recreate
aws lambda delete-function --function-name nb-rag-sys-chat
terraform apply

Issue: GitHub Actions Authentication Failed

Symptoms:

Error: User: arn:aws:sts::ACCOUNT_ID:assumed-role/GitHubActionsOIDCRole/GitHubActions is not authorized

Causes:

  • IAM role permissions insufficient
  • OIDC provider misconfigured
  • GitHub secret incorrect

Solution:

# 1. Update IAM role permissions
./.github/setup-oidc.sh

# 2. Verify GitHub secret
gh secret list
gh secret set AWS_ROLE_ARN --body "arn:aws:iam::ACCOUNT_ID:role/GitHubActionsOIDCRole"

# 3. Check OIDC provider exists
aws iam list-open-id-connect-providers

# 4. Verify role trust policy
aws iam get-role --role-name GitHubActionsOIDCRole --query 'Role.AssumeRolePolicyDocument'

# 5. Re-run GitHub Actions workflow
gh run rerun --failed

Runtime Issues

Issue: Chat API Returns 401 Unauthorized

Symptoms:

  • Web UI shows “Unauthorized” error
  • API returns {"message": "Unauthorized"}

Causes:

  • Expired JWT token
  • Invalid JWT token
  • Cognito misconfiguration

Diagnosis:

# Check Cognito user pool
aws cognito-idp describe-user-pool --user-pool-id [pool-id]

# Check API Gateway authorizer
aws apigatewayv2 get-authorizer --api-id [api-id] --authorizer-id [authorizer-id]

# Test JWT token (decode)
echo "[jwt-token]" | cut -d. -f2 | base64 -d | jq .

Solution:

# 1. Refresh token in web UI (log out and log back in)

# 2. Verify Cognito issuer matches authorizer configuration
# Issuer should be: https://cognito-idp.us-east-1.amazonaws.com/[user-pool-id]

# 3. Check token expiration (exp claim)
# Tokens expire after 1 hour by default

# 4. Verify audience (aud claim) matches client ID
aws cognito-idp describe-user-pool-client \
  --user-pool-id [pool-id] \
  --client-id [client-id]

Issue: Chat API Returns 500 Internal Server Error

Symptoms:

  • Web UI shows generic error
  • API returns {"message": "Internal server error"}

Causes:

  • Lambda function error
  • Bedrock throttling
  • Knowledge Base retrieval issue

Diagnosis:

# 1. Check Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-chat --follow

# 2. Get recent errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '15 minutes ago' +%s)000

# 3. Check Lambda metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

Solution - Lambda Error:

# View specific error from logs
aws logs get-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --log-stream-name [latest-stream] \
  --limit 50

# Common fixes:
# - Increase Lambda timeout (in terraform/modules/lambda/main.tf)
# - Increase Lambda memory (may improve performance)
# - Check environment variables are set correctly
# - Verify IAM permissions

# Update Lambda configuration
aws lambda update-function-configuration \
  --function-name nb-rag-sys-chat \
  --timeout 90 \
  --memory-size 1536

Solution - Bedrock Throttling:

# Check Bedrock throttling
aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name ModelInvocationClientErrors \
  --dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Request service quota increase
aws service-quotas request-service-quota-increase \
  --service-code bedrock \
  --quota-code [quota-code] \
  --desired-value 100

Issue: Lambda Function Timeout

Symptoms:

Task timed out after 60.00 seconds

Causes:

  • Slow Bedrock inference
  • Slow Knowledge Base retrieval
  • Large context window
  • Network latency

Diagnosis:

# Check Lambda duration metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average,Maximum

# Review X-Ray traces (if enabled)
aws xray get-trace-summaries \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s) \
  --filter-expression 'duration > 50'

Solution:

# 1. Increase Lambda timeout
aws lambda update-function-configuration \
  --function-name nb-rag-sys-chat \
  --timeout 90

# 2. Optimize query
# - Reduce number of retrieved documents (max_results)
# - Reduce chunk size
# - Add caching for common queries

# 3. Enable provisioned concurrency (reduces cold starts)
aws lambda put-provisioned-concurrency-config \
  --function-name nb-rag-sys-chat \
  --provisioned-concurrent-executions 2 \
  --qualifier '$LATEST'

Symptoms:

  • Chat returns “I don’t have enough information”
  • Empty sources array in response

Causes:

  • No documents in S3 documents bucket
  • Knowledge Base not synced
  • Query embedding failed
  • Low similarity threshold

Diagnosis:

# 1. Check S3 documents bucket
aws s3 ls s3://nb-rag-sys-documents/ --recursive

# 2. Check Knowledge Base sync status
aws bedrock-agent list-ingestion-jobs \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id]

# 3. Check Chat Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-chat --follow

# 4. Test Knowledge Base retrieval via AWS CLI
aws bedrock-agent-runtime retrieve \
  --knowledge-base-id [kb-id] \
  --retrieval-query '{"text": "test query"}'

Solution:

# 1. Verify documents exist in S3
aws s3 ls s3://nb-rag-sys-documents/ --summarize

# 2. If empty, ingest sample documents
# See docs/operations/data-ingestion.md

# 3. Trigger Knowledge Base sync
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id]

# 4. Verify embedding model is working
# Check Bedrock Titan logs for errors

Issue: Webhook Not Receiving Events

Symptoms:

  • Fathom/HelpScout/Linear webhook events not processed
  • No new documents appearing in search results

Causes:

  • Webhook URL incorrect
  • API key validation failed
  • Lambda function error
  • API Gateway route misconfigured

Diagnosis:

# 1. Check webhook Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-webhook-fathom --follow

# 2. Check API Gateway logs
aws logs tail /aws/apigateway/nb-rag-sys --follow

# 3. Test webhook endpoint
curl -X POST https://[api-gateway-url]/webhooks/fathom \
  -H "Content-Type: application/json" \
  -H "x-api-key: [api-key]" \
  -d '{"event": "test", "data": {}}'

# 4. Check API Gateway routes
aws apigatewayv2 get-routes --api-id [api-id]

Solution:

# 1. Verify webhook URL in external service
# Should be: https://[api-gateway-url]/webhooks/[service]

# 2. Check API key in Secrets Manager matches webhook configuration
aws secretsmanager get-secret-value --secret-id nb-rag-sys-fathom-api-key

# 3. Review Lambda logs for validation errors
# Look for "Invalid API key" or "Missing API key"

# 4. Manually invoke Lambda to test
aws lambda invoke \
  --function-name nb-rag-sys-webhook-fathom \
  --payload '{"body": "{\"event\":\"test\"}"}' \
  /tmp/response.json

cat /tmp/response.json

Performance Issues

Issue: Slow Query Response Time

Symptoms:

  • Chat takes >5 seconds to respond
  • Poor user experience

Causes:

  • Lambda cold start
  • Slow Knowledge Base retrieval
  • Large context window
  • Slow Bedrock inference

Diagnosis:

# 1. Check Lambda duration breakdown via X-Ray
aws xray get-trace-summaries \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s)

# 2. Check Lambda cold starts
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "INIT_START" \
  --start-time $(date -u -d '1 hour ago' +%s)000

# 3. Check Bedrock latency
aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name ModelInvocationLatency \
  --dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

Solution:

# 1. Enable provisioned concurrency (eliminates cold starts)
aws lambda put-provisioned-concurrency-config \
  --function-name nb-rag-sys-chat \
  --provisioned-concurrent-executions 2 \
  --qualifier '$LATEST'

# 2. Increase Lambda memory (improves CPU performance)
aws lambda update-function-configuration \
  --function-name nb-rag-sys-chat \
  --memory-size 1536

# 3. Reduce context window
# Change max_results from 5 to 3 in Query Lambda

# 4. Implement response caching
# Cache frequent queries in DynamoDB or ElastiCache

# 5. Optimize retrieval
# Adjust adaptive retrieval multiplier or reranking settings

Issue: High Lambda Costs

Symptoms:

  • AWS bill higher than expected
  • Lambda invocations spiking

Causes:

  • Runaway Lambda invocations
  • No concurrency limits
  • Infinite retry loops
  • Memory overprovisioning

Diagnosis:

# 1. Check Lambda invocation count
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum

# 2. Check concurrent executions
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name ConcurrentExecutions \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Maximum

# 3. Review Lambda configuration
aws lambda get-function-configuration --function-name nb-rag-sys-chat

Solution:

# 1. Set reserved concurrency limit
aws lambda put-function-concurrency \
  --function-name nb-rag-sys-chat \
  --reserved-concurrent-executions 10

# 2. Review and reduce memory if overprovisioned
# Check memory usage in CloudWatch Logs
aws lambda update-function-configuration \
  --function-name nb-rag-sys-chat \
  --memory-size 512

# 3. Implement exponential backoff for retries
# Add to Lambda code or SQS DLQ

# 4. Enable cost allocation tags
aws lambda tag-resource \
  --resource arn:aws:lambda:us-east-1:ACCOUNT_ID:function:nb-rag-sys-chat \
  --tags CostCenter=engineering,Project=rag-system

# 5. Set up billing alarms
aws cloudwatch put-metric-alarm \
  --alarm-name lambda-high-cost \
  --alarm-description "Alert on high Lambda cost" \
  --metric-name EstimatedCharges \
  --namespace AWS/Billing \
  --statistic Maximum \
  --period 21600 \
  --threshold 50 \
  --comparison-operator GreaterThanThreshold

Data Issues

Issue: Documents Not Searchable After Ingestion

Symptoms:

  • Documents uploaded but not returned in search results
  • Webhook processed successfully but no results

Causes:

  • Knowledge Base sync not triggered
  • Embedding generation failed
  • Document not in correct S3 location
  • Metadata missing

Diagnosis:

# 1. Check webhook Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-webhook-fathom \
  --filter-pattern "Document" \
  --start-time $(date -u -d '1 hour ago' +%s)000

# 2. Verify document exists in S3
aws s3 ls s3://nb-rag-sys-documents/ --recursive

# 3. Check Knowledge Base sync status
aws bedrock-agent list-ingestion-jobs \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id]

# 4. Check for errors in webhook processing
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-webhook-fathom \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000

Solution:

# 1. Manually test embedding generation
aws bedrock-runtime invoke-model \
  --model-id amazon.titan-embed-text-v2:0 \
  --body '{"inputText": "test document"}' \
  --cli-binary-format raw-in-base64-out \
  /tmp/embedding.json

# 2. Trigger Knowledge Base sync
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id [kb-id] \
  --data-source-id [ds-id]

# 3. Re-process webhook event
# Resend webhook from external service or manually invoke Lambda

# 4. Check Knowledge Base status
aws bedrock-agent get-knowledge-base --knowledge-base-id [kb-id]

Issue: Classification Results Missing

Symptoms:

  • Documents ingested but no classification in DynamoDB
  • Classify Lambda not being invoked

Causes:

  • Classify Lambda not triggered
  • DynamoDB write permissions missing
  • Classify Lambda error

Diagnosis:

# 1. Check Classify Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-classify --follow

# 2. Check DynamoDB table
aws dynamodb scan --table-name nb-rag-sys-classify --max-items 10

# 3. Check IAM permissions
aws iam get-role-policy \
  --role-name nb-rag-sys-classify-lambda-role \
  --policy-name nb-rag-sys-classify-lambda-policy

Solution:

# 1. Manually invoke Classify Lambda
aws lambda invoke \
  --function-name nb-rag-sys-classify \
  --payload '{"document_id": "test", "content": "Test content"}' \
  /tmp/response.json

# 2. Verify DynamoDB write permissions
# Should have dynamodb:PutItem permission

# 3. Check webhook Lambda is invoking Classify Lambda
# Review webhook Lambda code for invoke call

# 4. Manually insert classification result to test
aws dynamodb put-item \
  --table-name nb-rag-sys-classify \
  --item '{
    "document_id": {"S": "test-123"},
    "timestamp": {"S": "2025-11-08T12:00:00Z"},
    "categories": {"L": [{"S": "support"}, {"S": "technical"}]},
    "sentiment": {"S": "neutral"}
  }'

Security Issues

Issue: Secrets Manager Access Denied

Symptoms:

User: arn:aws:sts::ACCOUNT_ID:assumed-role/nb-rag-sys-query-lambda-role/nb-rag-sys-query is not authorized to perform: secretsmanager:GetSecretValue

Causes:

  • Lambda IAM role missing permissions
  • Secret ARN incorrect
  • Secret deleted

Solution:

# 1. Verify secrets exist
aws secretsmanager list-secrets | grep nb-rag-sys

# 2. Check Lambda IAM policy
aws iam get-role-policy \
  --role-name nb-rag-sys-chat-lambda-role \
  --policy-name nb-rag-sys-chat-lambda-policy

# 3. Update IAM policy
# Add secretsmanager:GetSecretValue permission

# 4. Redeploy via Terraform
cd terraform
terraform apply

Issue: Cognito Google OAuth Not Working

Symptoms:

  • “Invalid redirect URI” error
  • Login button doesn’t redirect
  • OAuth consent screen shows error

Causes:

  • Redirect URI mismatch
  • Google client secret incorrect
  • Cognito identity provider misconfigured

Solution:

# 1. Get Cognito domain
aws cognito-idp describe-user-pool --user-pool-id [pool-id] | jq -r '.UserPool.Domain'

# 2. Verify redirect URI in Google Console
# Should be: https://[cognito-domain].auth.us-east-1.amazoncognito.com/oauth2/idpresponse

# 3. Check Cognito identity provider
aws cognito-idp describe-identity-provider \
  --user-pool-id [pool-id] \
  --provider-name Google

# 4. Update Google client secret in Secrets Manager
aws secretsmanager update-secret \
  --secret-id nb-rag-sys-google-client-secret \
  --secret-string '{"client_secret": "GOCSPX-..."}'

# 5. Update Cognito identity provider
aws cognito-idp update-identity-provider \
  --user-pool-id [pool-id] \
  --provider-name Google \
  --provider-details client_id=[client-id],client_secret=[client-secret],authorize_scopes="openid email profile"

Monitoring & Alerting

Issue: CloudWatch Alarms Not Firing

Symptoms:

  • No SNS notifications received
  • Alarms stuck in “Insufficient data” state

Causes:

  • SNS subscription not confirmed
  • Alarm threshold too high
  • Metric data not being published

Solution:

# 1. Check alarm state
aws cloudwatch describe-alarms --alarm-names nb-rag-sys-lambda-errors

# 2. Confirm SNS subscription
aws sns list-subscriptions-by-topic --topic-arn [topic-arn]
# If "PendingConfirmation", check email and confirm

# 3. Check metric data exists
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# 4. Update alarm threshold
aws cloudwatch put-metric-alarm \
  --alarm-name nb-rag-sys-lambda-errors \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --period 300 \
  --statistic Sum \
  --threshold 5 \
  --alarm-actions [sns-topic-arn]

Getting Help

Support Channels

  1. Check Documentation: https://craftcodery.github.io/compass
  2. Review Logs: CloudWatch Logs for detailed error messages
  3. GitHub Issues: https://github.com/craftcodery/compass/issues
  4. AWS Support: For AWS service issues (requires support plan)

Collecting Debug Information

When reporting issues, include:

# 1. System info
terraform output
aws sts get-caller-identity

# 2. Recent errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/nb-rag-sys-chat \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --limit 20

# 3. Resource status
aws lambda get-function --function-name nb-rag-sys-chat
aws apigatewayv2 get-api --api-id [api-id]

# 4. Metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=nb-rag-sys-chat \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum

Last updated: 2025-12-30