Codebase Audit Findings
Comprehensive security, code quality, and architecture audit of the NorthBuilt RAG System. This document tracks issues identified during code reviews and their resolution status.
Table of Contents
- Audit Overview
- Critical Issues
- CRIT-001: Hardcoded AWS Account ID in Repository
- CRIT-002: Missing DynamoDB State Locking for Terraform
- CRIT-003: Overly Permissive CloudWatch IAM Policy
- CRIT-004: Bare Exception Handlers Silently Suppress Errors
- CRIT-005: Missing JSON Decode Error Handling
- CRIT-006: Missing Request Timeouts in HelpScout Connection
- CRIT-007: Hardcoded Dates in Integration Tests
- CRIT-008: Untested Classification Strategies
- High Severity Issues
- HIGH-001: Cognito Security Features Disabled
- HIGH-002: CORS Wildcard Default in API Gateway
- HIGH-003: Missing DynamoDB Encryption Configuration
- HIGH-004: S3 Buckets Using AWS-Managed Encryption Keys
- HIGH-005: CloudFront Missing Security Headers
- HIGH-006: No Lambda Reserved Concurrency Limits
- HIGH-007: Missing Lambda Dead Letter Queues
- HIGH-008: Short Log Retention (7 Days)
- HIGH-009: No Retry Logic for External API Calls
- HIGH-010: Tokens Stored in localStorage (Web)
- MED-002: No MFA Configuration
- MED-003: S3 Backup Immediate Glacier Transition
- MED-004: Unimplemented Strategies Return Null Silently
- MED-005: DynamoDB Client Created Per Request
- MED-006: Hardcoded Rate Limiting Sleep Values
- MED-007: DEBUG Mode Logs Sensitive Data (Web)
- MED-008: Error Messages Expose Backend Details (Web)
- MED-009: Incomplete Test Body
- MED-010 through MED-025: Additional Medium Issues
- Low Severity Issues
- Positive Findings
- Resolution Tracking
- Appendix: Related Documents
- Audit Findings - 2025-12-30
- NEW-001: Frontend Debug Mode Defaults to True
- NEW-002: Missing EntityType in Classify Lambda Response
- NEW-003: Query Understanding Not Documented in CLAUDE.md
- NEW-004: CLASSIFY_TABLE Environment Variable Undocumented
- NEW-005: System Architecture Doc Says “Vanilla JS”
- NEW-006: No Entity Loader Tests
- NEW-007: Integration Tests Not Run on PRs
- NEW-008: No Post-Deployment Smoke Tests
- NEW-009: Security Gates Don’t Fail Builds
- NEW-010: No Token Refresh Logic in Frontend
Audit Overview
Audit Date: 2025-12-29 Scope: Full codebase including Lambda functions, Terraform infrastructure, React web application, shared utilities, and test suite.
Summary Statistics
| Severity | Count | Status |
|---|---|---|
| Critical | 8 | Open |
| High | 17 | 1 Resolved, 16 Open |
| Medium | 25 | Open |
| Low | 20+ | Open |
Critical Issues
Issues that pose significant security risks, could cause data loss, or will cause system failures.
CRIT-001: Hardcoded AWS Account ID in Repository
Severity: Critical
Component: Terraform
File: terraform/variables.tf (lines 196-197)
Status: Open
Description
The AWS account ID and ACM certificate ARN are hardcoded as default values in the Terraform variables file. This is a security anti-pattern because:
-
Account ID Exposure: While not a secret per se, the AWS account ID provides attackers with targeting information for social engineering, phishing, or crafting valid-looking IAM principal ARNs.
-
Certificate Tracking: The ACM certificate ARN reveals the exact certificate being used, which could be useful for reconnaissance.
-
Version Control Persistence: Once committed, these values exist in Git history permanently, even if removed later.
Current Code
variable "acm_certificate_arn" {
description = "ACM certificate ARN for CloudFront distribution"
type = string
default = "arn:aws:acm:us-east-1:554192034098:certificate/883924c2-0006-476f-bce6-05e25ebcb9e7"
}
Recommended Fix
Remove the hardcoded default and require explicit configuration:
variable "acm_certificate_arn" {
description = "ACM certificate ARN for CloudFront distribution. Must be in us-east-1 for CloudFront."
type = string
# No default - must be provided via terraform.tfvars or environment variable
}
Then create a terraform.tfvars.example file (committed) and terraform.tfvars (gitignored):
# terraform.tfvars.example
acm_certificate_arn = "arn:aws:acm:us-east-1:YOUR_ACCOUNT_ID:certificate/YOUR_CERT_ID"
Impact
- Security: Reduces attack surface
- Operations: Requires explicit configuration per environment
Resolution Steps
- Create
terraform.tfvars.examplewith placeholder values - Add
terraform.tfvarsto.gitignore - Remove default values from
variables.tf - Update deployment documentation
- Consider using AWS Secrets Manager or Parameter Store for sensitive configurations
CRIT-002: Missing DynamoDB State Locking for Terraform
Severity: Critical
Component: Terraform
File: terraform/backend.tf (lines 12-19)
Status: Open
Description
The Terraform backend configuration uses S3 native locking (use_lockfile = true) instead of DynamoDB state locking. S3 native locking is:
- Less Reliable: S3’s conditional writes are eventually consistent in some edge cases
- Not Recommended: AWS and HashiCorp recommend DynamoDB for production state locking
- Race Condition Risk: Concurrent
terraform applyoperations could corrupt state
Current Code
terraform {
backend "s3" {
bucket = "nb-rag-sys-terraform-state"
key = "terraform.tfstate"
region = "us-east-1"
encrypt = true
use_lockfile = true # S3 native locking - NOT RECOMMENDED
}
}
Recommended Fix
Add DynamoDB state locking:
terraform {
backend "s3" {
bucket = "nb-rag-sys-terraform-state"
key = "terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "nb-rag-sys-terraform-lock"
# Remove use_lockfile when using DynamoDB
}
}
Create the DynamoDB table:
resource "aws_dynamodb_table" "terraform_lock" {
name = "nb-rag-sys-terraform-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Lock"
Purpose = "terraform-state-locking"
}
}
Impact
- Reliability: Prevents state corruption from concurrent operations
- Safety: CI/CD pipelines and multiple developers can safely run Terraform
Resolution Steps
- Create DynamoDB table for locking (can be done manually or via bootstrap script)
- Update backend configuration
- Run
terraform init -migrate-stateto apply new backend config - Test with concurrent plan operations to verify locking works
CRIT-003: Overly Permissive CloudWatch IAM Policy
Severity: Critical
Component: Terraform
File: terraform/modules/lambda/main.tf (lines 536-542)
Status: Open
Description
Lambda functions are granted cloudwatch:PutMetricData permission with a wildcard resource (Resource = "*"). This violates the principle of least privilege because:
- Metric Pollution: Lambda could write metrics to any namespace, including AWS-reserved namespaces
- Cost Implications: Malicious or buggy code could create thousands of custom metrics (each costing $0.30/month)
- Monitoring Confusion: Metrics could be written to incorrect namespaces, confusing dashboards
Current Code
{
Effect = "Allow"
Action = [
"cloudwatch:PutMetricData"
]
Resource = "*" # Too permissive
}
Recommended Fix
Unfortunately, CloudWatch PutMetricData does not support resource-level permissions. The mitigation is to add a condition limiting to specific namespaces:
{
Effect = "Allow"
Action = [
"cloudwatch:PutMetricData"
]
Resource = "*"
Condition = {
StringEquals = {
"cloudwatch:namespace" = [
"NorthBuilt/RAG",
"NorthBuilt/Sync",
"NorthBuilt/Webhooks"
]
}
}
}
Update Lambda code to use only these approved namespaces:
cloudwatch.put_metric_data(
Namespace='NorthBuilt/RAG', # Must match IAM condition
MetricData=[...]
)
Impact
- Security: Limits metric write scope to approved namespaces
- Cost: Prevents accidental metric creation in wrong namespaces
Resolution Steps
- Audit all Lambda functions for current namespace usage
- Define approved namespace list
- Update IAM policy with namespace condition
- Update Lambda code to use approved namespaces
- Test metric publishing still works
CRIT-004: Bare Exception Handlers Silently Suppress Errors
Severity: Critical Component: Lambda Shared Utilities Files:
lambda/shared/connections/fathom.py(line 260)lambda/shared/connections/helpscout.py(line 536) Status: Open
Description
Bare except: clauses (without specifying exception types) catch ALL exceptions, including critical system exceptions like SystemExit, KeyboardInterrupt, and MemoryError. This can:
- Mask Real Bugs: Programming errors like
NameError,TypeError, orAttributeErrorare silently swallowed - Hide Memory Issues:
MemoryErrorexceptions go unnoticed, leading to silent failures - Prevent Debugging: No logging or stack traces for unexpected exceptions
- Break Signal Handling:
KeyboardInterruptandSystemExitare caught, preventing graceful shutdown
Current Code (Fathom)
# lambda/shared/connections/fathom.py:260
try:
# Date calculation logic
from datetime import datetime
if isinstance(recording_start, str):
recording_start = datetime.fromisoformat(recording_start.replace('Z', '+00:00'))
if isinstance(recording_end, str):
recording_end = datetime.fromisoformat(recording_end.replace('Z', '+00:00'))
duration = int((recording_end - recording_start).total_seconds() / 60)
except: # DANGEROUS: Catches everything including MemoryError, SystemExit
duration = 0
Current Code (HelpScout)
# lambda/shared/connections/helpscout.py:536
try:
dt = datetime.fromisoformat(dt_string.replace('Z', '+00:00'))
return dt.strftime('%B %d, %Y %I:%M %p')
except: # DANGEROUS: Silently returns original string for ANY error
return dt_string
Recommended Fix
Catch only the specific exceptions that are expected:
# Fathom duration calculation
try:
from datetime import datetime
if isinstance(recording_start, str):
recording_start = datetime.fromisoformat(recording_start.replace('Z', '+00:00'))
if isinstance(recording_end, str):
recording_end = datetime.fromisoformat(recording_end.replace('Z', '+00:00'))
duration = int((recording_end - recording_start).total_seconds() / 60)
except (ValueError, TypeError, AttributeError) as e:
logger.warning(f"Failed to calculate meeting duration: {e}")
duration = 0
# HelpScout datetime formatting
try:
dt = datetime.fromisoformat(dt_string.replace('Z', '+00:00'))
return dt.strftime('%B %d, %Y %I:%M %p')
except (ValueError, TypeError) as e:
logger.warning(f"Failed to parse datetime '{dt_string}': {e}")
return dt_string
Impact
- Reliability: Unexpected errors are properly logged and can be debugged
- Observability: CloudWatch logs will show warnings for parsing failures
- Safety: System exceptions (
MemoryError,SystemExit) propagate correctly
Resolution Steps
- Replace all bare
except:clauses in the codebase - Identify specific exception types that should be caught
- Add logging for caught exceptions
- Run linting with
pylintorflake8to catch future occurrences - Add
bare-exceptcheck to CI/CD pipeline
CRIT-005: Missing JSON Decode Error Handling
Severity: Critical Component: Lambda Shared Utilities Files:
lambda/shared/utils/utils.py(line 48)lambda/shared/common_utils.py(line 48) Status: Open
Description
The parse_event_body() function calls json.loads() without catching json.JSONDecodeError. When API Gateway passes a malformed JSON body, the Lambda crashes with an unhandled exception instead of returning a proper 400 Bad Request response.
This is a critical issue because:
- Poor User Experience: Users see a generic 500 error instead of helpful “Invalid JSON” message
- Confusing Error Logs: Stack traces for invalid JSON look like bugs
- Security Information Leak: Some JSON parsers include the malformed input in error messages
Current Code
def parse_event_body(event: Dict[str, Any]) -> Dict[str, Any]:
"""Parse the event body from API Gateway."""
if 'body' in event:
body = event['body']
if isinstance(body, str):
return json.loads(body) # CRASHES on invalid JSON!
return body
return event
Example Failure
Request with invalid JSON:
curl -X POST https://api.example.com/chat \
-H "Content-Type: application/json" \
-d '{"message": "hello' # Missing closing brace and quote
Result: Lambda crashes with json.JSONDecodeError, API Gateway returns 500.
Recommended Fix
def parse_event_body(event: Dict[str, Any]) -> Dict[str, Any]:
"""
Parse the event body from API Gateway.
Args:
event: API Gateway event dictionary
Returns:
Parsed body as dictionary
Raises:
ValueError: If body contains invalid JSON
"""
if 'body' in event:
body = event['body']
if isinstance(body, str):
try:
return json.loads(body)
except json.JSONDecodeError as e:
# Log the error but don't include the malformed input (security)
logger.warning(f"Invalid JSON in request body at position {e.pos}")
raise ValueError(f"Invalid JSON in request body: {e.msg}") from e
return body
return event
Then update Lambda handlers to catch ValueError and return 400:
try:
body = parse_event_body(event)
except ValueError as e:
return {
'statusCode': 400,
'headers': {'Content-Type': 'application/json'},
'body': json.dumps({
'error': 'Invalid request',
'message': str(e)
})
}
Impact
- User Experience: Clear error messages for malformed requests
- Security: Prevents information leakage in error messages
- Operations: Cleaner logs without stack traces for user errors
Resolution Steps
- Update
parse_event_body()to catch and re-raise asValueError - Update all Lambda handlers to catch
ValueErrorand return 400 - Add unit tests for malformed JSON handling
- Test with various malformed payloads
CRIT-006: Missing Request Timeouts in HelpScout Connection
Severity: Critical
Component: Lambda Shared Connections
File: lambda/shared/connections/helpscout.py (lines 82, 181, 227)
Status: Open
Description
Three HTTP requests in the HelpScout connection class are missing the timeout parameter. Without explicit timeouts, requests can hang indefinitely, causing:
- Lambda Timeout: The Lambda function hits its 15-minute maximum and is killed
- Resource Exhaustion: Connection pools are exhausted by hanging requests
- Cost Increase: Lambda is billed for the full hanging duration
- No Error Recovery: No opportunity for retry logic to kick in
Current Code
# Line 82 - authenticate() method
response = requests.get(f'{self.base_url}/users/me', headers=headers)
# NO TIMEOUT - can hang forever
# Line 181 - fetch_all_items() method
response = requests.get(url, headers=self._get_headers(), params=params)
# NO TIMEOUT - can hang forever
# Line 227 - fetch_conversation() method
response = requests.get(url, headers=self._get_headers(), params=params)
# NO TIMEOUT - can hang forever
Note: The token request (line 103) correctly has timeout=10, but other requests are inconsistent.
Recommended Fix
Add explicit timeouts to all HTTP requests:
# Define timeout constants at module level
HELPSCOUT_CONNECT_TIMEOUT = 5 # seconds to establish connection
HELPSCOUT_READ_TIMEOUT = 30 # seconds to receive response
HELPSCOUT_TIMEOUT = (HELPSCOUT_CONNECT_TIMEOUT, HELPSCOUT_READ_TIMEOUT)
# In authenticate() method
response = requests.get(
f'{self.base_url}/users/me',
headers=headers,
timeout=HELPSCOUT_TIMEOUT
)
# In fetch_all_items() method
response = requests.get(
url,
headers=self._get_headers(),
params=params,
timeout=HELPSCOUT_TIMEOUT
)
# In fetch_conversation() method
response = requests.get(
url,
headers=self._get_headers(),
params=params,
timeout=HELPSCOUT_TIMEOUT
)
Impact
- Reliability: Failed requests timeout quickly instead of hanging
- Cost: Prevents Lambda from running for 15 minutes on a single hung request
- Recovery: Allows retry logic to attempt alternative actions
Resolution Steps
- Define timeout constants in the module
- Add timeout to all
requests.get()andrequests.post()calls - Add similar timeouts to
fathom.pyandlinear.pyconnections - Consider adding exponential backoff retry logic
- Add unit tests for timeout behavior
CRIT-007: Hardcoded Dates in Integration Tests
Severity: Critical Component: Test Suite Files:
tests/test_fathom_webhook.py(lines 30-31)tests/test_helpscout_webhook.py(lines 41, 58, 75, 92)tests/test_linear_webhook.py(lines 35-36, 50-51, 57) Status: Open
Description
Integration tests use hardcoded future dates like “2025-11-05” and “2025-12-06”. These tests will start failing when the current date passes these hardcoded values, because:
- Date Validation Logic: Some code validates that dates are not in the past
- TTL Calculations: DynamoDB TTL values may become negative
- Sorting Logic: Tests expecting chronological order may fail
This is a time bomb that will cause CI/CD failures on or after November 5, 2025.
Current Code
# test_fathom_webhook.py:30-31
"recording_start_time": "2025-11-05T10:00:00Z", # WILL FAIL after Nov 5, 2025
"recording_end_time": "2025-11-05T11:00:00Z",
# test_helpscout_webhook.py:41
"createdAt": "2025-11-05T10:00:00Z", # WILL FAIL after Nov 5, 2025
# test_linear_webhook.py:35-36
"createdAt": "2025-12-06T14:00:00Z", # WILL FAIL after Dec 6, 2025
"updatedAt": "2025-12-06T14:30:00Z",
Recommended Fix
Use dynamic date generation:
from datetime import datetime, timedelta, timezone
def get_future_timestamp(days_ahead: int = 30) -> str:
"""Generate an ISO timestamp N days in the future."""
future_date = datetime.now(timezone.utc) + timedelta(days=days_ahead)
return future_date.strftime("%Y-%m-%dT%H:%M:%SZ")
# In test fixtures
@pytest.fixture
def sample_fathom_event():
return {
"recording_start_time": get_future_timestamp(30),
"recording_end_time": get_future_timestamp(30), # or add 1 hour
# ...
}
Or use freezegun to freeze time during tests:
from freezegun import freeze_time
@freeze_time("2025-06-15T12:00:00Z")
def test_webhook_processing():
event = {
"recording_start_time": "2025-11-05T10:00:00Z", # Always in the future
# ...
}
# Test logic
Impact
- CI/CD Reliability: Tests won’t randomly fail based on current date
- Maintainability: No need to update tests periodically
Resolution Steps
- Install
freezegunas dev dependency:pip install freezegun - Update all test files with hardcoded dates
- Either use dynamic dates or freeze time with
@freeze_time - Add CI/CD check for hardcoded date patterns
- Document the chosen approach in test conventions
CRIT-008: Untested Classification Strategies
Severity: Critical Component: Test Suite Files:
lambda/classify/strategies/slack.py- 0% test coveragelambda/classify/strategies/basecamp.py- 0% test coveragelambda/classify/strategies/googledrive.py- 0% test coverage Status: Open
Description
Three classification strategy classes are registered in the StrategyFactory but have absolutely no unit tests:
- SlackStrategy - Returns
null_result()silently - BasecampStrategy - Returns
null_result()silently - GoogleDriveStrategy - Returns
null_result()silently
The test file lambda/classify/tests/test_strategies.py only tests FathomStrategy and HelpScoutStrategy.
This is critical because:
- Silent Failures: If someone tries to use these integrations, classification silently returns null
- No Error Indication: No exception, no log, just empty results
- Factory Mismatch:
test_handler.pyline 229 loops through all 6 sources but only 2 have actual test coverage - Future Bug Risk: When implemented, no tests will catch regressions
Current Code
# lambda/classify/strategies/slack.py:12
class SlackStrategy(BaseStrategy):
def classify(self, data: Any, table_name: str) -> Dict[str, Optional[str]]:
"""Classify Slack data."""
# TODO: Implement when Slack integration is ready.
return self.null_result() # Silent failure!
Recommended Fix
Option A: Remove from factory until implemented
# lambda/classify/factory.py
STRATEGIES = {
'fathom': FathomStrategy,
'helpscout': HelpScoutStrategy,
'linear': LinearStrategy,
# Removed until implemented:
# 'slack': SlackStrategy,
# 'basecamp': BasecampStrategy,
# 'googledrive': GoogleDriveStrategy,
}
Option B: Raise NotImplementedError
class SlackStrategy(BaseStrategy):
def classify(self, data: Any, table_name: str) -> Dict[str, Optional[str]]:
"""Classify Slack data."""
raise NotImplementedError(
"Slack classification strategy is not yet implemented. "
"See docs/roadmap.md for planned integrations."
)
Option C: Add placeholder tests
# lambda/classify/tests/test_strategies.py
class TestSlackStrategy:
"""Tests for SlackStrategy - currently a placeholder."""
def test_slack_returns_null_result_when_not_implemented(self):
"""Verify Slack classification returns null until implemented."""
strategy = SlackStrategy()
result = strategy.classify({'channel': 'general'}, 'test-table')
# Document expected behavior
assert result == {'client_name': None, 'project_name': None}
def test_slack_raises_when_implemented(self):
"""This test should fail when Slack is implemented, prompting real tests."""
pytest.skip("Update this test when Slack integration is implemented")
Impact
- Reliability: Clear indication when unimplemented features are used
- Maintainability: Tests document expected behavior
- Developer Experience: New developers understand feature status
Resolution Steps
- Decide on approach (A, B, or C above)
- If Option A: Remove from factory, update handler to return helpful error
- If Option B: Add NotImplementedError, update handler to catch and return 501
- If Option C: Add placeholder tests documenting current behavior
- Update documentation about integration status
High Severity Issues
Issues that affect security, reliability, or code quality but have workarounds or lower immediate impact.
HIGH-001: Cognito Security Features Disabled
Severity: High
Component: Terraform
File: terraform/main.tf (lines 152-153)
Status: Open
Description
AWS Cognito advanced security features and deletion protection are explicitly disabled with comments indicating they should be enabled for production:
enable_advanced_security = false # Set to true for production (has additional costs)
enable_deletion_protection = false # Set to true for production
Advanced Security provides:
- Compromised credentials detection
- Adaptive authentication based on risk
- Account takeover protection
- Bot detection
Deletion Protection prevents:
- Accidental deletion of user pool
- Loss of all user data and sessions
- Authentication outage
Impact
- Security: Missing protection against credential stuffing and account takeover
- Reliability: User pool could be accidentally deleted
Resolution Steps
- Enable deletion protection immediately (no cost)
- Evaluate cost of advanced security ($0.05 per MAU)
- Enable advanced security if budget allows
- Add environment-specific configuration
HIGH-002: CORS Wildcard Default in API Gateway
Severity: High
Component: Terraform
File: terraform/modules/api-gateway/variables.tf (lines 8-12)
Status: Open
Description
The API Gateway CORS configuration defaults to allowing requests from any origin:
variable "cors_allow_origins" {
description = "CORS allowed origins"
type = list(string)
default = ["*"] # Allows all origins
}
While the API uses Cognito authentication, CORS wildcard:
- Allows any website to make authenticated requests
- Increases attack surface for XSS-based attacks
- May not meet security compliance requirements
Recommended Fix
variable "cors_allow_origins" {
description = "CORS allowed origins"
type = list(string)
default = [] # Force explicit configuration
}
# In main.tf
cors_allow_origins = [
"https://compass.northbuilt.com",
"https://www.compass.northbuilt.com",
var.environment == "development" ? "http://localhost:5173" : null
]
HIGH-003: Missing DynamoDB Encryption Configuration
Severity: High
Component: Terraform
File: terraform/modules/dynamodb/main.tf (lines 10-72)
Status: Open
Description
DynamoDB tables do not explicitly configure server-side encryption with customer-managed KMS keys. While AWS enables encryption by default with AWS-managed keys, customer-managed keys provide:
- Key rotation control
- Key usage auditing via CloudTrail
- Cross-account access control
- Compliance with security standards (SOC2, HIPAA)
Recommended Fix
resource "aws_dynamodb_table" "main" {
# ... existing config ...
server_side_encryption {
enabled = true
kms_key_arn = aws_kms_key.dynamodb.arn
}
}
resource "aws_kms_key" "dynamodb" {
description = "KMS key for DynamoDB encryption"
deletion_window_in_days = 30
enable_key_rotation = true
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "Enable IAM policies"
Effect = "Allow"
Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
Action = "kms:*"
Resource = "*"
}
]
})
}
HIGH-004: S3 Buckets Using AWS-Managed Encryption Keys
Severity: High
Component: Terraform
File: terraform/modules/s3/main.tf (lines 34-41, 116-124, 169-176)
Status: Open
Description
All S3 buckets use AES256 encryption (AWS-managed keys) instead of customer-managed KMS keys:
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256" # AWS-managed
}
}
Customer-managed KMS keys provide:
- Fine-grained access control via key policies
- Key usage auditing
- Cross-account sharing capabilities
- Compliance with security frameworks
Recommended Fix
resource "aws_s3_bucket_server_side_encryption_configuration" "documents" {
bucket = aws_s3_bucket.documents.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.s3.arn
}
bucket_key_enabled = true # Reduces KMS API costs
}
}
HIGH-005: CloudFront Missing Security Headers
Severity: High
Component: Terraform
File: terraform/modules/cloudfront/main.tf
Status: Open
Description
The CloudFront distribution does not configure security headers. Missing headers leave the application vulnerable to:
- XSS Attacks: No Content-Security-Policy (CSP)
- Clickjacking: No X-Frame-Options
- MIME Sniffing: No X-Content-Type-Options
- Downgrade Attacks: No Strict-Transport-Security (HSTS)
Recommended Fix
Create a response headers policy:
resource "aws_cloudfront_response_headers_policy" "security_headers" {
name = "${var.resource_prefix}-security-headers"
security_headers_config {
strict_transport_security {
access_control_max_age_sec = 31536000 # 1 year
include_subdomains = true
override = true
preload = true
}
content_type_options {
override = true # Adds X-Content-Type-Options: nosniff
}
frame_options {
frame_option = "DENY"
override = true
}
referrer_policy {
referrer_policy = "strict-origin-when-cross-origin"
override = true
}
content_security_policy {
content_security_policy = "default-src 'self'; script-src 'self' 'wasm-unsafe-eval'; style-src 'self' 'unsafe-inline' https://fonts.googleapis.com; font-src 'self' https://fonts.gstatic.com; img-src 'self' https: data:; connect-src 'self' ${var.api_endpoint} ${var.cognito_domain};"
override = true
}
}
}
HIGH-006: No Lambda Reserved Concurrency Limits
Severity: High
Component: Terraform
File: terraform/modules/lambda/main.tf (lines 866-1333)
Status: Open
Description
No Lambda functions have reserved_concurrent_executions configured. All functions share the default AWS account limit (typically 1000 concurrent executions). This means:
- One function could monopolize all concurrency during traffic spike
- Other critical functions could be throttled
- No isolation between function workloads
Recommended Fix
Add reserved concurrency to each function based on expected traffic:
resource "aws_lambda_function" "chat" {
# ... existing config ...
reserved_concurrent_executions = 100 # Chat gets 100 concurrent
}
resource "aws_lambda_function" "webhook_fathom" {
# ... existing config ...
reserved_concurrent_executions = 10 # Webhooks get 10 each
}
resource "aws_lambda_function" "sync_worker" {
# ... existing config ...
reserved_concurrent_executions = 5 # Sync workers get 5 each
}
HIGH-007: Missing Lambda Dead Letter Queues
Severity: High
Component: Terraform
File: terraform/modules/lambda/main.tf (lines 866-1333)
Status: Open
Description
Async/scheduled Lambda functions don’t have Dead Letter Queue (DLQ) configuration. When these functions fail, the events are silently dropped:
ingest- EventBridge triggeredfathom_sync_scheduler- Scheduledhelpscout_sync_scheduler- Scheduledfathom_sync_worker- Async invocation
Without DLQs:
- Failed events are lost forever
- No visibility into failure patterns
- No ability to replay failed events
Recommended Fix
resource "aws_sqs_queue" "lambda_dlq" {
name = "${var.resource_prefix}-lambda-dlq"
message_retention_seconds = 1209600 # 14 days
visibility_timeout_seconds = 300
tags = var.tags
}
resource "aws_lambda_function" "sync_worker" {
# ... existing config ...
dead_letter_config {
target_arn = aws_sqs_queue.lambda_dlq.arn
}
}
Also create CloudWatch alarm for DLQ messages:
resource "aws_cloudwatch_metric_alarm" "dlq_messages" {
alarm_name = "${var.resource_prefix}-dlq-messages"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 300
statistic = "Sum"
threshold = 0
alarm_description = "Lambda DLQ has messages - investigate failures"
dimensions = {
QueueName = aws_sqs_queue.lambda_dlq.name
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
HIGH-008: Short Log Retention (7 Days)
Severity: High
Component: Terraform
File: terraform/modules/lambda/main.tf (line 229)
Status: Open
Description
CloudWatch log retention is set to only 7 days:
log_retention_days = 7
This is insufficient for:
- Security incident investigation (often discovered weeks later)
- Compliance audits (typically require 90+ days)
- Performance trend analysis
- Debugging intermittent issues
Recommended Fix
# For development environments
log_retention_days = 14
# For production environments
log_retention_days = 90 # Or 365 for compliance
Alternatively, archive logs to S3 for longer retention at lower cost:
resource "aws_cloudwatch_log_subscription_filter" "to_s3" {
name = "${var.resource_prefix}-log-archive"
log_group_name = aws_cloudwatch_log_group.lambda.name
filter_pattern = "" # All logs
destination_arn = aws_kinesis_firehose_delivery_stream.log_archive.arn
}
HIGH-009: No Retry Logic for External API Calls
Severity: High Component: Lambda Connections Files:
lambda/shared/connections/helpscout.pylambda/shared/connections/linear.pylambda/shared/connections/fathom.pyStatus: Open
Description
API calls to external services use basic error handling without retry logic. A single transient failure (network timeout, 429 rate limit, 503 service unavailable) causes the entire operation to fail:
response = requests.get(url, headers=self._get_headers(), params=params)
response.raise_for_status() # Fails immediately on 429, 503, etc.
In production, transient failures are common:
- 429 Too Many Requests (rate limiting)
- 503 Service Unavailable (maintenance)
- Connection timeouts (network issues)
- 500 Internal Server Error (temporary backend issues)
Recommended Fix
Implement retry with exponential backoff:
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests
RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}
class RetryableHTTPError(Exception):
"""HTTP error that should be retried."""
pass
def raise_for_retryable_status(response):
"""Raise RetryableHTTPError for status codes that warrant retry."""
if response.status_code in RETRYABLE_STATUS_CODES:
raise RetryableHTTPError(f"HTTP {response.status_code}: {response.reason}")
response.raise_for_status()
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type((RetryableHTTPError, requests.exceptions.Timeout))
)
def fetch_with_retry(url, headers, params=None, timeout=30):
"""Fetch URL with automatic retry on transient failures."""
response = requests.get(url, headers=headers, params=params, timeout=timeout)
raise_for_retryable_status(response)
return response
HIGH-010: Tokens Stored in localStorage (Web)
Severity: High
Component: React Web App
File: web/src/hooks/useAuth.ts (lines 26-28, 51-55, 65-67)
Status: Open
Description
JWT tokens (id_token, access_token, refresh_token) are stored in localStorage:
const idToken = localStorage.getItem('id_token');
const accessToken = localStorage.getItem('access_token');
const refreshToken = localStorage.getItem('refresh_token');
localStorage is vulnerable to XSS attacks. If an attacker injects JavaScript (through any XSS vulnerability), they can:
- Read all tokens from localStorage
- Exfiltrate tokens to attacker-controlled server
- Impersonate the user indefinitely (until token expiry)
Recommended Fix
Migrate to HTTP-only cookies:
- Backend Changes:
# In Cognito callback handler response = make_response(redirect('/')) response.set_cookie( 'access_token', access_token, httponly=True, # Can't be read by JavaScript secure=True, # HTTPS only samesite='Strict', # CSRF protection max_age=3600 # 1 hour ) - Frontend Changes: ```typescript // Remove localStorage token management // Cookies are automatically sent with requests
const response = await fetch(endpoint, { credentials: ‘include’, // Send cookies // No Authorization header needed });
**Note:** This requires coordination with backend/Cognito configuration.
---
### HIGH-011: Missing Content Security Policy (Web)
**Severity:** High
**Component:** React Web App
**File:** Application-wide (CloudFront configuration needed)
**Status:** Open
#### Description
No Content Security Policy (CSP) headers are configured. CSP prevents:
- Inline script injection (XSS)
- Loading scripts from unauthorized domains
- Unauthorized form submissions
- Clickjacking via framing
#### Recommended Fix
Configure CSP at CloudFront level (see HIGH-005) with policy:
Content-Security-Policy: default-src ‘self’; script-src ‘self’ ‘wasm-unsafe-eval’; style-src ‘self’ ‘unsafe-inline’ https://fonts.googleapis.com; font-src ‘self’ https://fonts.gstatic.com; img-src ‘self’ https: data:; connect-src ‘self’ https://s0vt4k7o47.execute-api.us-east-1.amazonaws.com https://nb-rag-sys-auth.auth.us-east-1.amazoncognito.com; frame-ancestors ‘none’; base-uri ‘self’; form-action ‘self’;
---
### HIGH-012: Missing Connection Tests (Fathom, HelpScout)
**Severity:** High
**Component:** Test Suite
**Files:**
- `lambda/shared/connections/fathom.py`
- `lambda/shared/connections/helpscout.py`
**Status:** Resolved
#### Description
The `FathomConnection` and `HelpScoutConnection` classes have over 500 lines of code each with multiple API calls. Unit tests have been added.
| Connection | Lines of Code | Test Coverage |
|------------|--------------|---------------|
| Base | ~100 | 100% (test_base.py) |
| Linear | ~300 | ~100% (test_linear.py) |
| Fathom | ~350 | ~100% (test_fathom.py) |
| HelpScout | ~550 | ~100% (test_helpscout.py) |
#### Resolution
Tests were added in `lambda/shared/tests/connections/`:
- `test_fathom.py` - 42 tests covering initialization, auth, fetch, S3 keys, transforms
- `test_helpscout.py` - 45 tests covering OAuth2, pagination, conversation fetch, transforms
---
## Medium Severity Issues
Issues that affect maintainability, performance, or developer experience.
---
### MED-001: Cognito Password Temporary Validity Too Long
**File:** `terraform/modules/cognito/main.tf` (lines 20-27)
Temporary passwords are valid for 7 days, which provides a longer window for brute force attacks on new accounts. Reduce to 3-5 days:
```terraform
temporary_password_validity_days = 3
MED-002: No MFA Configuration
File: terraform/modules/cognito/main.tf
No Multi-Factor Authentication is configured for Cognito. Add optional MFA:
mfa_configuration = "OPTIONAL"
software_token_mfa_configuration {
enabled = true
}
MED-003: S3 Backup Immediate Glacier Transition
File: terraform/modules/s3/main.tf (lines 222-226)
Backups transition to Glacier immediately (day 0), making quick recovery expensive:
transition {
days = 0 # Immediate - expensive to retrieve
storage_class = "GLACIER_IR"
}
Change to:
transition {
days = 30 # Keep accessible for 30 days
storage_class = "GLACIER_IR"
}
MED-004: Unimplemented Strategies Return Null Silently
Files: lambda/classify/strategies/slack.py, basecamp.py, googledrive.py
These strategies are registered in the factory but return null results silently:
def classify(self, data: Any, table_name: str) -> Dict[str, Optional[str]]:
return self.null_result() # No error, no log
Should either:
- Raise
NotImplementedError - Log a warning
- Be removed from factory until implemented
MED-005: DynamoDB Client Created Per Request
File: lambda/chat/handler.py (lines 128-147)
Each Lambda invocation creates new boto3 clients:
def create_aws_clients(config: Config) -> AWSClients:
dynamodb = boto3.resource('dynamodb', region_name=config.aws_region) # New instance
Should create clients at module level for reuse across warm invocations:
_clients = None
def get_clients(config):
global _clients
if _clients is None:
_clients = create_aws_clients(config)
return _clients
MED-006: Hardcoded Rate Limiting Sleep Values
Files:
lambda/shared/connections/helpscout.py(line 200)lambda/shared/connections/fathom.py(lines 173, 184)
Fixed time.sleep() values that can’t be tuned:
time.sleep(0.5) # Hardcoded
Should be configurable and respect API rate limit headers:
RATE_LIMIT_DELAY = float(os.environ.get('FATHOM_RATE_LIMIT_DELAY', '0.5'))
def respect_rate_limit(response):
"""Check response headers and sleep if rate limited."""
retry_after = response.headers.get('Retry-After')
if retry_after:
time.sleep(int(retry_after))
else:
time.sleep(RATE_LIMIT_DELAY)
MED-007: DEBUG Mode Logs Sensitive Data (Web)
File: web/src/config.ts (line 64), multiple hooks
DEBUG mode enabled in config logs sensitive information:
console.log('OAuth callback successful:', parsedUser);
Should gate by both DEBUG and development mode:
DEBUG: import.meta.env.VITE_DEBUG === 'true' && import.meta.env.MODE === 'development'
MED-008: Error Messages Expose Backend Details (Web)
File: web/src/components/ChatPage.tsx (line 142)
Error messages from API displayed directly to users:
<p>Error: {error}</p>
Should map to user-friendly messages:
const getUserFriendlyError = (error: string) => {
if (error.includes('Authentication')) return 'Please sign in again';
if (error.includes('Network')) return 'Connection lost. Please try again.';
return 'An error occurred. Please try again.';
};
MED-009: Incomplete Test Body
File: lambda/ingest/tests/test_handler.py (line 99)
Test with only pass statement:
def test_handles_metric_send_failure_gracefully(self):
"""Should not raise exception if metric send fails."""
mock_cloudwatch = Mock()
mock_cloudwatch.put_metric_data.side_effect = Exception("CloudWatch error")
handler.send_metric(mock_cloudwatch, 'TestMetric')
# NO ASSERTIONS!
Should assert the expected behavior.
MED-010 through MED-025: Additional Medium Issues
| ID | Component | File | Issue |
|---|---|---|---|
| MED-010 | Terraform | api-gateway/variables.tf | API throttle defaults may be too high (10/sec) |
| MED-011 | Terraform | cloudfront/main.tf | CloudFront min TTL is 0 (no caching by default) |
| MED-012 | Terraform | cloudfront/main.tf | Price class limits to NA/Europe only |
| MED-013 | Terraform | cognito/main.tf | 30-day refresh token validity is long |
| MED-014 | Lambda | chat/handler.py | Async Lambda invocation not verified |
| MED-015 | Lambda | shared/connections | S3 permission errors treated as “not found” |
| MED-016 | Lambda | shared/connections/fathom.py | Type coercion issues in data extraction |
| MED-017 | Lambda | All handlers | Generic exception handling hides specific errors |
| MED-018 | Lambda | shared/utils | Overly permissive CORS * in responses |
| MED-019 | Lambda | connections | No logging context for multi-client scenarios |
| MED-020 | Web | hooks/useAuth.ts | OAuth tokens visible in URL hash |
| MED-021 | Web | components/Chat | Missing React.memo on ChatMessage |
| MED-022 | Web | lib/api.ts | No runtime validation of API responses |
| MED-023 | Tests | Multiple | Tests depend on real AWS infrastructure |
| MED-024 | Tests | Multiple | Timestamp assertions without tolerance |
| MED-025 | Tests | Multiple | Duplicate test utilities (common_utils.py) |
Low Severity Issues
Minor issues that affect code cleanliness and maintainability.
LOW-001 through LOW-020: Minor Issues
| ID | Component | Issue |
|---|---|---|
| LOW-001 | Terraform | Missing version constraints on modules |
| LOW-002 | Terraform | Missing README.md in module directories |
| LOW-003 | Terraform | Some variables lack descriptions |
| LOW-004 | Terraform | Unused local variables |
| LOW-005 | Terraform | EventBridge rules missing disable option |
| LOW-006 | Lambda | Unused sys import in chat handler |
| LOW-007 | Lambda | Inconsistent type hints (tuple vs Tuple) |
| LOW-008 | Lambda | Hardcoded Lambda function name fallback |
| LOW-009 | Lambda | Implicit None return in worker |
| LOW-010 | Lambda | Missing docstrings in some utilities |
| LOW-011 | Lambda | Information leakage in error messages |
| LOW-012 | Web | Composite key with index in MessageList |
| LOW-013 | Web | Missing image lazy loading |
| LOW-014 | Web | Gravatar URL generated on each render |
| LOW-015 | Tests | Missing test factories for varied data |
| LOW-016 | Tests | Tests that test multiple things |
| LOW-017 | Tests | Overly permissive assertions |
| LOW-018 | Tests | Missing test docstrings |
| LOW-019 | Tests | Scattered conftest files |
| LOW-020 | Docs | Missing inline code documentation |
Positive Findings
The audit also identified many security and quality strengths:
Security Strengths
- Comprehensive input validation and sanitization
- Proper use of AWS Secrets Manager (no hardcoded secrets)
- Webhook signature verification with HMAC and constant-time comparison
- Multi-tenant isolation at vector search level with defense-in-depth
- No
eval(),exec(), or unsafe string formatting - Proper CORS headers on API responses
Code Quality Strengths
- Extensive type hints throughout Python code
- Strict TypeScript with ~100% type coverage
- Good React patterns with proper error boundaries
- Well-structured test organization (2.6:1 test-to-code ratio)
- Consistent coding style with linting
Infrastructure Strengths
- S3 versioning enabled for state protection
- CloudFront HTTPS redirect configured
- Proper IAM role separation per Lambda
- Environment-specific tagging
Resolution Tracking
Sprint 1 (Immediate)
- CRIT-001: Remove hardcoded AWS account ID
- CRIT-002: Add DynamoDB state locking
- CRIT-004: Replace bare except clauses
- CRIT-005: Add JSON decode error handling
- CRIT-006: Add request timeouts
- CRIT-007: Fix hardcoded test dates
Sprint 2 (Short-Term)
- CRIT-003: Restrict CloudWatch IAM permissions
- CRIT-008: Add tests for unimplemented strategies
- HIGH-001: Enable Cognito deletion protection
- HIGH-002: Restrict CORS origins
- HIGH-005: Add CloudFront security headers
- HIGH-008: Increase log retention
Sprint 3 (Medium-Term)
- HIGH-003: Add DynamoDB KMS encryption
- HIGH-004: Add S3 KMS encryption
- HIGH-006: Configure Lambda concurrency limits
- HIGH-007: Add Lambda DLQs
- HIGH-009: Implement retry logic
- HIGH-012: Add missing connection tests
Future
- HIGH-010: Migrate tokens to HTTP-only cookies
- HIGH-011: Configure CSP headers
- All Medium and Low severity items
Appendix: Related Documents
- RAG System Improvements - RAG-specific optimizations
- Architecture Decision Records - Design decisions
- Security Documentation - Security architecture
- Operations Runbook - Operational procedures
Audit Findings - 2025-12-30
Additional findings from comprehensive RAG system audit.
NEW-001: Frontend Debug Mode Defaults to True
Severity: Medium
Component: Web Application
File: web/src/config.ts (line 64)
Status: Open
Description
DEBUG mode defaults to true which logs sensitive information (session IDs, query content, entity matches) in production.
DEBUG: import.meta.env.VITE_DEBUG === 'true' || true, // Defaults to true
Recommended Fix
Change default to false:
DEBUG: import.meta.env.VITE_DEBUG === 'true',
NEW-002: Missing EntityType in Classify Lambda Response
Severity: High
Component: Lambda
File: lambda/classify/handler.py
Status: Open
Description
The Classify Lambda returns classification results but doesn’t include entityType field. Query understanding’s entity loader requires EntityType field for GSI queries. When classify is used to update DynamoDB records, the EntityType may not be set.
Recommended Fix
Add entityType to classification response to match what the entity loader expects.
NEW-003: Query Understanding Not Documented in CLAUDE.md
Severity: Medium
Component: Documentation
File: CLAUDE.md
Status: Open
Description
The query understanding feature (entity extraction, clarification prompts, metadata filtering) is implemented but not documented in CLAUDE.md. Developers won’t know about:
- Automatic entity extraction from natural language queries
- Clarification responses for ambiguous queries
- Adaptive retrieval multiplier based on filter count
Recommended Fix
Add Query Understanding section to CLAUDE.md documenting these features.
NEW-004: CLASSIFY_TABLE Environment Variable Undocumented
Severity: Low
Component: Documentation
File: CLAUDE.md (line 81-82)
Status: Open
Description
Chat handler expects CLASSIFY_TABLE environment variable for entity loading, but it’s not documented in the Lambda environment variables section.
Recommended Fix
Add CLASSIFY_TABLE to the documented environment variables.
NEW-005: System Architecture Doc Says “Vanilla JS”
Severity: Medium
Component: Documentation
File: docs/architecture/system-architecture.md (line 68)
Status: Open
Description
System architecture documentation describes the web application as “Vanilla JavaScript (no framework dependencies)” but the actual implementation uses React 19 + TypeScript + Vite.
Recommended Fix
Update line 68 to accurately reflect the React implementation.
NEW-006: No Entity Loader Tests
Severity: High
Component: Test Suite
File: lambda/shared/utils/entities.py
Status: Open
Description
The entity loader module (entities.py) has zero test coverage. This is critical for verifying multi-tenant isolation works correctly.
Recommended Fix
Create comprehensive tests for load_entities_from_dynamodb() covering:
- DynamoDB load and pagination
- Entity validation and schema errors
- EntityType GSI queries
- Edge cases (empty results, missing fields)
NEW-007: Integration Tests Not Run on PRs
Severity: Medium
Component: CI/CD
File: .github/workflows/test.yml
Status: Open
Description
Integration tests only run on main branch, not on PRs. Real AWS service issues won’t surface until after merge and deployment.
Recommended Fix
Add integration test stage for PRs against a staging environment, or run a subset of integration tests.
NEW-008: No Post-Deployment Smoke Tests
Severity: Medium
Component: CI/CD
File: .github/workflows/main.yml
Status: Open
Description
No validation runs after Lambda deployment. Bad deployments could go unnoticed.
Recommended Fix
Add health check endpoint and sanity tests after deployment completes.
NEW-009: Security Gates Don’t Fail Builds
Severity: Medium
Component: CI/CD
File: .github/workflows/test.yml
Status: Open
Description
Snyk, Semgrep, and Checkov security scans all use continue-on-error: true. Security issues are logged but don’t block the build.
Recommended Fix
Change to fail on HIGH/CRITICAL findings to ensure security issues are addressed before merge.
NEW-010: No Token Refresh Logic in Frontend
Severity: High
Component: Web Application
File: web/src/hooks/useAuth.ts
Status: Open
Description
When Cognito tokens expire (1 hour default), users get authentication errors and must manually refresh the page. No automatic token refresh mechanism exists.
Recommended Fix
Implement token refresh on 401 responses or proactively before expiration.
Last updated: 2025-12-30