Troubleshooting Guide
Comprehensive guide to diagnosing and resolving common issues in the NorthBuilt RAG System.
Quick Diagnosis
System Health Check
# Check API Gateway
curl -I https://[api-gateway-url]/chat
# Check Lambda functions
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `nb-rag-sys`)].FunctionName'
# Check recent Lambda errors
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "ERROR" \
--start-time $(date -u -d '1 hour ago' +%s)000
# Check Bedrock throttling
aws cloudwatch get-metric-statistics \
--namespace AWS/Bedrock \
--metric-name ModelInvocationClientErrors \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
Deployment Issues
Issue: Terraform Apply Fails with State Lock
Symptoms:
Error: Error acquiring the state lock
Lock Info:
ID: abc123...
Path: nb-rag-sys-terraform-state/terraform.tfstate
Causes:
- Previous Terraform run crashed
- Concurrent Terraform runs
- Manual interruption (Ctrl+C)
Solution:
# 1. Verify no other Terraform processes are running
ps aux | grep terraform
# 2. Check DynamoDB for lock
aws dynamodb scan --table-name nb-rag-sys-terraform-locks
# 3. Force unlock (use LockID from error message)
terraform force-unlock abc123...
# 4. If unlock fails, manually delete from DynamoDB
aws dynamodb delete-item \
--table-name nb-rag-sys-terraform-locks \
--key '{"LockID": {"S": "nb-rag-sys-terraform-state/terraform.tfstate-md5"}}'
Issue: Terraform Backend Configuration Error
Symptoms:
Error: Backend initialization required
Causes:
- First time running Terraform
- Backend configuration changed
- State bucket doesn’t exist
Solution:
# 1. Verify S3 bucket exists
aws s3 ls s3://nb-rag-sys-terraform-state
# 2. Initialize backend
cd terraform
terraform init
# 3. If bucket missing, run bootstrap
./.github/setup-oidc.sh
Issue: Resource Already Exists
Symptoms:
Error: Error creating Lambda Function: ResourceAlreadyExistsException
Causes:
- Manual resource creation
- Previous incomplete Terraform apply
- Resource not in Terraform state
Solution:
# 1. Import existing resource into state
terraform import 'aws_lambda_function.chat' 'nb-rag-sys-chat'
# 2. Verify import
terraform plan
# Should show no changes for imported resource
# 3. If resource is wrong, delete and recreate
aws lambda delete-function --function-name nb-rag-sys-chat
terraform apply
Issue: GitHub Actions Authentication Failed
Symptoms:
Error: User: arn:aws:sts::ACCOUNT_ID:assumed-role/GitHubActionsOIDCRole/GitHubActions is not authorized
Causes:
- IAM role permissions insufficient
- OIDC provider misconfigured
- GitHub secret incorrect
Solution:
# 1. Update IAM role permissions
./.github/setup-oidc.sh
# 2. Verify GitHub secret
gh secret list
gh secret set AWS_ROLE_ARN --body "arn:aws:iam::ACCOUNT_ID:role/GitHubActionsOIDCRole"
# 3. Check OIDC provider exists
aws iam list-open-id-connect-providers
# 4. Verify role trust policy
aws iam get-role --role-name GitHubActionsOIDCRole --query 'Role.AssumeRolePolicyDocument'
# 5. Re-run GitHub Actions workflow
gh run rerun --failed
Runtime Issues
Issue: Chat API Returns 401 Unauthorized
Symptoms:
- Web UI shows “Unauthorized” error
- API returns
{"message": "Unauthorized"}
Causes:
- Expired JWT token
- Invalid JWT token
- Cognito misconfiguration
Diagnosis:
# Check Cognito user pool
aws cognito-idp describe-user-pool --user-pool-id [pool-id]
# Check API Gateway authorizer
aws apigatewayv2 get-authorizer --api-id [api-id] --authorizer-id [authorizer-id]
# Test JWT token (decode)
echo "[jwt-token]" | cut -d. -f2 | base64 -d | jq .
Solution:
# 1. Refresh token in web UI (log out and log back in)
# 2. Verify Cognito issuer matches authorizer configuration
# Issuer should be: https://cognito-idp.us-east-1.amazonaws.com/[user-pool-id]
# 3. Check token expiration (exp claim)
# Tokens expire after 1 hour by default
# 4. Verify audience (aud claim) matches client ID
aws cognito-idp describe-user-pool-client \
--user-pool-id [pool-id] \
--client-id [client-id]
Issue: Chat API Returns 500 Internal Server Error
Symptoms:
- Web UI shows generic error
- API returns
{"message": "Internal server error"}
Causes:
- Lambda function error
- Bedrock throttling
- Knowledge Base retrieval issue
Diagnosis:
# 1. Check Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-chat --follow
# 2. Get recent errors
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "ERROR" \
--start-time $(date -u -d '15 minutes ago' +%s)000
# 3. Check Lambda metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
Solution - Lambda Error:
# View specific error from logs
aws logs get-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--log-stream-name [latest-stream] \
--limit 50
# Common fixes:
# - Increase Lambda timeout (in terraform/modules/lambda/main.tf)
# - Increase Lambda memory (may improve performance)
# - Check environment variables are set correctly
# - Verify IAM permissions
# Update Lambda configuration
aws lambda update-function-configuration \
--function-name nb-rag-sys-chat \
--timeout 90 \
--memory-size 1536
Solution - Bedrock Throttling:
# Check Bedrock throttling
aws cloudwatch get-metric-statistics \
--namespace AWS/Bedrock \
--metric-name ModelInvocationClientErrors \
--dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
# Request service quota increase
aws service-quotas request-service-quota-increase \
--service-code bedrock \
--quota-code [quota-code] \
--desired-value 100
Issue: Lambda Function Timeout
Symptoms:
Task timed out after 60.00 seconds
Causes:
- Slow Bedrock inference
- Slow Knowledge Base retrieval
- Large context window
- Network latency
Diagnosis:
# Check Lambda duration metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average,Maximum
# Review X-Ray traces (if enabled)
aws xray get-trace-summaries \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--filter-expression 'duration > 50'
Solution:
# 1. Increase Lambda timeout
aws lambda update-function-configuration \
--function-name nb-rag-sys-chat \
--timeout 90
# 2. Optimize query
# - Reduce number of retrieved documents (max_results)
# - Reduce chunk size
# - Add caching for common queries
# 3. Enable provisioned concurrency (reduces cold starts)
aws lambda put-provisioned-concurrency-config \
--function-name nb-rag-sys-chat \
--provisioned-concurrent-executions 2 \
--qualifier '$LATEST'
Issue: No Results from Vector Search
Symptoms:
- Chat returns “I don’t have enough information”
- Empty sources array in response
Causes:
- No documents in S3 documents bucket
- Knowledge Base not synced
- Query embedding failed
- Low similarity threshold
Diagnosis:
# 1. Check S3 documents bucket
aws s3 ls s3://nb-rag-sys-documents/ --recursive
# 2. Check Knowledge Base sync status
aws bedrock-agent list-ingestion-jobs \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id]
# 3. Check Chat Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-chat --follow
# 4. Test Knowledge Base retrieval via AWS CLI
aws bedrock-agent-runtime retrieve \
--knowledge-base-id [kb-id] \
--retrieval-query '{"text": "test query"}'
Solution:
# 1. Verify documents exist in S3
aws s3 ls s3://nb-rag-sys-documents/ --summarize
# 2. If empty, ingest sample documents
# See docs/operations/data-ingestion.md
# 3. Trigger Knowledge Base sync
aws bedrock-agent start-ingestion-job \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id]
# 4. Verify embedding model is working
# Check Bedrock Titan logs for errors
Issue: Webhook Not Receiving Events
Symptoms:
- Fathom/HelpScout/Linear webhook events not processed
- No new documents appearing in search results
Causes:
- Webhook URL incorrect
- API key validation failed
- Lambda function error
- API Gateway route misconfigured
Diagnosis:
# 1. Check webhook Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-webhook-fathom --follow
# 2. Check API Gateway logs
aws logs tail /aws/apigateway/nb-rag-sys --follow
# 3. Test webhook endpoint
curl -X POST https://[api-gateway-url]/webhooks/fathom \
-H "Content-Type: application/json" \
-H "x-api-key: [api-key]" \
-d '{"event": "test", "data": {}}'
# 4. Check API Gateway routes
aws apigatewayv2 get-routes --api-id [api-id]
Solution:
# 1. Verify webhook URL in external service
# Should be: https://[api-gateway-url]/webhooks/[service]
# 2. Check API key in Secrets Manager matches webhook configuration
aws secretsmanager get-secret-value --secret-id nb-rag-sys-fathom-api-key
# 3. Review Lambda logs for validation errors
# Look for "Invalid API key" or "Missing API key"
# 4. Manually invoke Lambda to test
aws lambda invoke \
--function-name nb-rag-sys-webhook-fathom \
--payload '{"body": "{\"event\":\"test\"}"}' \
/tmp/response.json
cat /tmp/response.json
Performance Issues
Issue: Slow Query Response Time
Symptoms:
- Chat takes >5 seconds to respond
- Poor user experience
Causes:
- Lambda cold start
- Slow Knowledge Base retrieval
- Large context window
- Slow Bedrock inference
Diagnosis:
# 1. Check Lambda duration breakdown via X-Ray
aws xray get-trace-summaries \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s)
# 2. Check Lambda cold starts
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "INIT_START" \
--start-time $(date -u -d '1 hour ago' +%s)000
# 3. Check Bedrock latency
aws cloudwatch get-metric-statistics \
--namespace AWS/Bedrock \
--metric-name ModelInvocationLatency \
--dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
Solution:
# 1. Enable provisioned concurrency (eliminates cold starts)
aws lambda put-provisioned-concurrency-config \
--function-name nb-rag-sys-chat \
--provisioned-concurrent-executions 2 \
--qualifier '$LATEST'
# 2. Increase Lambda memory (improves CPU performance)
aws lambda update-function-configuration \
--function-name nb-rag-sys-chat \
--memory-size 1536
# 3. Reduce context window
# Change max_results from 5 to 3 in Query Lambda
# 4. Implement response caching
# Cache frequent queries in DynamoDB or ElastiCache
# 5. Optimize retrieval
# Adjust adaptive retrieval multiplier or reranking settings
Issue: High Lambda Costs
Symptoms:
- AWS bill higher than expected
- Lambda invocations spiking
Causes:
- Runaway Lambda invocations
- No concurrency limits
- Infinite retry loops
- Memory overprovisioning
Diagnosis:
# 1. Check Lambda invocation count
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum
# 2. Check concurrent executions
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name ConcurrentExecutions \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Maximum
# 3. Review Lambda configuration
aws lambda get-function-configuration --function-name nb-rag-sys-chat
Solution:
# 1. Set reserved concurrency limit
aws lambda put-function-concurrency \
--function-name nb-rag-sys-chat \
--reserved-concurrent-executions 10
# 2. Review and reduce memory if overprovisioned
# Check memory usage in CloudWatch Logs
aws lambda update-function-configuration \
--function-name nb-rag-sys-chat \
--memory-size 512
# 3. Implement exponential backoff for retries
# Add to Lambda code or SQS DLQ
# 4. Enable cost allocation tags
aws lambda tag-resource \
--resource arn:aws:lambda:us-east-1:ACCOUNT_ID:function:nb-rag-sys-chat \
--tags CostCenter=engineering,Project=rag-system
# 5. Set up billing alarms
aws cloudwatch put-metric-alarm \
--alarm-name lambda-high-cost \
--alarm-description "Alert on high Lambda cost" \
--metric-name EstimatedCharges \
--namespace AWS/Billing \
--statistic Maximum \
--period 21600 \
--threshold 50 \
--comparison-operator GreaterThanThreshold
Data Issues
Issue: Documents Not Searchable After Ingestion
Symptoms:
- Documents uploaded but not returned in search results
- Webhook processed successfully but no results
Causes:
- Knowledge Base sync not triggered
- Embedding generation failed
- Document not in correct S3 location
- Metadata missing
Diagnosis:
# 1. Check webhook Lambda logs
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-webhook-fathom \
--filter-pattern "Document" \
--start-time $(date -u -d '1 hour ago' +%s)000
# 2. Verify document exists in S3
aws s3 ls s3://nb-rag-sys-documents/ --recursive
# 3. Check Knowledge Base sync status
aws bedrock-agent list-ingestion-jobs \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id]
# 4. Check for errors in webhook processing
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-webhook-fathom \
--filter-pattern "ERROR" \
--start-time $(date -u -d '1 hour ago' +%s)000
Solution:
# 1. Manually test embedding generation
aws bedrock-runtime invoke-model \
--model-id amazon.titan-embed-text-v2:0 \
--body '{"inputText": "test document"}' \
--cli-binary-format raw-in-base64-out \
/tmp/embedding.json
# 2. Trigger Knowledge Base sync
aws bedrock-agent start-ingestion-job \
--knowledge-base-id [kb-id] \
--data-source-id [ds-id]
# 3. Re-process webhook event
# Resend webhook from external service or manually invoke Lambda
# 4. Check Knowledge Base status
aws bedrock-agent get-knowledge-base --knowledge-base-id [kb-id]
Issue: Classification Results Missing
Symptoms:
- Documents ingested but no classification in DynamoDB
- Classify Lambda not being invoked
Causes:
- Classify Lambda not triggered
- DynamoDB write permissions missing
- Classify Lambda error
Diagnosis:
# 1. Check Classify Lambda logs
aws logs tail /aws/lambda/nb-rag-sys-classify --follow
# 2. Check DynamoDB table
aws dynamodb scan --table-name nb-rag-sys-classify --max-items 10
# 3. Check IAM permissions
aws iam get-role-policy \
--role-name nb-rag-sys-classify-lambda-role \
--policy-name nb-rag-sys-classify-lambda-policy
Solution:
# 1. Manually invoke Classify Lambda
aws lambda invoke \
--function-name nb-rag-sys-classify \
--payload '{"document_id": "test", "content": "Test content"}' \
/tmp/response.json
# 2. Verify DynamoDB write permissions
# Should have dynamodb:PutItem permission
# 3. Check webhook Lambda is invoking Classify Lambda
# Review webhook Lambda code for invoke call
# 4. Manually insert classification result to test
aws dynamodb put-item \
--table-name nb-rag-sys-classify \
--item '{
"document_id": {"S": "test-123"},
"timestamp": {"S": "2025-11-08T12:00:00Z"},
"categories": {"L": [{"S": "support"}, {"S": "technical"}]},
"sentiment": {"S": "neutral"}
}'
Security Issues
Issue: Secrets Manager Access Denied
Symptoms:
User: arn:aws:sts::ACCOUNT_ID:assumed-role/nb-rag-sys-query-lambda-role/nb-rag-sys-query is not authorized to perform: secretsmanager:GetSecretValue
Causes:
- Lambda IAM role missing permissions
- Secret ARN incorrect
- Secret deleted
Solution:
# 1. Verify secrets exist
aws secretsmanager list-secrets | grep nb-rag-sys
# 2. Check Lambda IAM policy
aws iam get-role-policy \
--role-name nb-rag-sys-chat-lambda-role \
--policy-name nb-rag-sys-chat-lambda-policy
# 3. Update IAM policy
# Add secretsmanager:GetSecretValue permission
# 4. Redeploy via Terraform
cd terraform
terraform apply
Issue: Cognito Google OAuth Not Working
Symptoms:
- “Invalid redirect URI” error
- Login button doesn’t redirect
- OAuth consent screen shows error
Causes:
- Redirect URI mismatch
- Google client secret incorrect
- Cognito identity provider misconfigured
Solution:
# 1. Get Cognito domain
aws cognito-idp describe-user-pool --user-pool-id [pool-id] | jq -r '.UserPool.Domain'
# 2. Verify redirect URI in Google Console
# Should be: https://[cognito-domain].auth.us-east-1.amazoncognito.com/oauth2/idpresponse
# 3. Check Cognito identity provider
aws cognito-idp describe-identity-provider \
--user-pool-id [pool-id] \
--provider-name Google
# 4. Update Google client secret in Secrets Manager
aws secretsmanager update-secret \
--secret-id nb-rag-sys-google-client-secret \
--secret-string '{"client_secret": "GOCSPX-..."}'
# 5. Update Cognito identity provider
aws cognito-idp update-identity-provider \
--user-pool-id [pool-id] \
--provider-name Google \
--provider-details client_id=[client-id],client_secret=[client-secret],authorize_scopes="openid email profile"
Monitoring & Alerting
Issue: CloudWatch Alarms Not Firing
Symptoms:
- No SNS notifications received
- Alarms stuck in “Insufficient data” state
Causes:
- SNS subscription not confirmed
- Alarm threshold too high
- Metric data not being published
Solution:
# 1. Check alarm state
aws cloudwatch describe-alarms --alarm-names nb-rag-sys-lambda-errors
# 2. Confirm SNS subscription
aws sns list-subscriptions-by-topic --topic-arn [topic-arn]
# If "PendingConfirmation", check email and confirm
# 3. Check metric data exists
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
# 4. Update alarm threshold
aws cloudwatch put-metric-alarm \
--alarm-name nb-rag-sys-lambda-errors \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--metric-name Errors \
--namespace AWS/Lambda \
--period 300 \
--statistic Sum \
--threshold 5 \
--alarm-actions [sns-topic-arn]
Getting Help
Support Channels
- Check Documentation: https://craftcodery.github.io/compass
- Review Logs: CloudWatch Logs for detailed error messages
- GitHub Issues: https://github.com/craftcodery/compass/issues
- AWS Support: For AWS service issues (requires support plan)
Collecting Debug Information
When reporting issues, include:
# 1. System info
terraform output
aws sts get-caller-identity
# 2. Recent errors
aws logs filter-log-events \
--log-group-name /aws/lambda/nb-rag-sys-chat \
--filter-pattern "ERROR" \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--limit 20
# 3. Resource status
aws lambda get-function --function-name nb-rag-sys-chat
aws apigatewayv2 get-api --api-id [api-id]
# 4. Metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=nb-rag-sys-chat \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum
Last updated: 2025-12-30