Codebase Audit Findings

Comprehensive security, code quality, and architecture audit of the NorthBuilt RAG System. This document tracks issues identified during code reviews and their resolution status.

Audit Overview
1. Summary Statistics
Critical Issues
High Severity Issues
Low Severity Issues
1. LOW-001 through LOW-020: Minor Issues
Positive Findings
Resolution Tracking
Appendix: Related Documents
Audit Findings - 2025-12-30

Audit Overview

Audit Date: 2025-12-29 Scope: Full codebase including Lambda functions, Terraform infrastructure, React web application, shared utilities, and test suite.

Summary Statistics

Severity	Count	Status
Critical	8	Open
High	17	1 Resolved, 16 Open
Medium	25	Open
Low	20+	Open

Critical Issues

Issues that pose significant security risks, could cause data loss, or will cause system failures.

CRIT-001: Hardcoded AWS Account ID in Repository

Severity: Critical Component: Terraform File: terraform/variables.tf (lines 196-197) Status: Open

Description

The AWS account ID and ACM certificate ARN are hardcoded as default values in the Terraform variables file. This is a security anti-pattern because:

Account ID Exposure: While not a secret per se, the AWS account ID provides attackers with targeting information for social engineering, phishing, or crafting valid-looking IAM principal ARNs.
Certificate Tracking: The ACM certificate ARN reveals the exact certificate being used, which could be useful for reconnaissance.
Version Control Persistence: Once committed, these values exist in Git history permanently, even if removed later.

Current Code

variable "acm_certificate_arn" {
  description = "ACM certificate ARN for CloudFront distribution"
  type        = string
  default     = "arn:aws:acm:us-east-1:554192034098:certificate/883924c2-0006-476f-bce6-05e25ebcb9e7"
}

Recommended Fix

Remove the hardcoded default and require explicit configuration:

variable "acm_certificate_arn" {
  description = "ACM certificate ARN for CloudFront distribution. Must be in us-east-1 for CloudFront."
  type        = string
  # No default - must be provided via terraform.tfvars or environment variable
}

Then create a terraform.tfvars.example file (committed) and terraform.tfvars (gitignored):

# terraform.tfvars.example
acm_certificate_arn = "arn:aws:acm:us-east-1:YOUR_ACCOUNT_ID:certificate/YOUR_CERT_ID"

Impact

Security: Reduces attack surface
Operations: Requires explicit configuration per environment

Resolution Steps

Create terraform.tfvars.example with placeholder values
Add terraform.tfvars to .gitignore
Remove default values from variables.tf
Update deployment documentation
Consider using AWS Secrets Manager or Parameter Store for sensitive configurations

CRIT-002: Missing DynamoDB State Locking for Terraform

Severity: Critical Component: Terraform File: terraform/backend.tf (lines 12-19) Status: Open

Description

The Terraform backend configuration uses S3 native locking (use_lockfile = true) instead of DynamoDB state locking. S3 native locking is:

Less Reliable: S3’s conditional writes are eventually consistent in some edge cases
Not Recommended: AWS and HashiCorp recommend DynamoDB for production state locking
Race Condition Risk: Concurrent terraform apply operations could corrupt state

Current Code

terraform {
  backend "s3" {
    bucket       = "nb-rag-sys-terraform-state"
    key          = "terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true  # S3 native locking - NOT RECOMMENDED
  }
}

Recommended Fix

Add DynamoDB state locking:

terraform {
  backend "s3" {
    bucket         = "nb-rag-sys-terraform-state"
    key            = "terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "nb-rag-sys-terraform-lock"
    # Remove use_lockfile when using DynamoDB
  }
}

Create the DynamoDB table:

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "nb-rag-sys-terraform-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name    = "Terraform State Lock"
    Purpose = "terraform-state-locking"
  }
}

Impact

Reliability: Prevents state corruption from concurrent operations
Safety: CI/CD pipelines and multiple developers can safely run Terraform

Resolution Steps

Create DynamoDB table for locking (can be done manually or via bootstrap script)
Update backend configuration
Run terraform init -migrate-state to apply new backend config
Test with concurrent plan operations to verify locking works

CRIT-003: Overly Permissive CloudWatch IAM Policy

Severity: Critical Component: Terraform File: terraform/modules/lambda/main.tf (lines 536-542) Status: Open

Description

Lambda functions are granted cloudwatch:PutMetricData permission with a wildcard resource (Resource = "*"). This violates the principle of least privilege because:

Metric Pollution: Lambda could write metrics to any namespace, including AWS-reserved namespaces
Cost Implications: Malicious or buggy code could create thousands of custom metrics (each costing $0.30/month)
Monitoring Confusion: Metrics could be written to incorrect namespaces, confusing dashboards

Current Code

{
  Effect = "Allow"
  Action = [
    "cloudwatch:PutMetricData"
  ]
  Resource = "*"  # Too permissive
}

Recommended Fix

Unfortunately, CloudWatch PutMetricData does not support resource-level permissions. The mitigation is to add a condition limiting to specific namespaces:

{
  Effect = "Allow"
  Action = [
    "cloudwatch:PutMetricData"
  ]
  Resource = "*"
  Condition = {
    StringEquals = {
      "cloudwatch:namespace" = [
        "NorthBuilt/RAG",
        "NorthBuilt/Sync",
        "NorthBuilt/Webhooks"
      ]
    }
  }
}

Update Lambda code to use only these approved namespaces:

cloudwatch.put_metric_data(
    Namespace='NorthBuilt/RAG',  # Must match IAM condition
    MetricData=[...]
)

Impact

Security: Limits metric write scope to approved namespaces
Cost: Prevents accidental metric creation in wrong namespaces

Resolution Steps

Audit all Lambda functions for current namespace usage
Define approved namespace list
Update IAM policy with namespace condition
Update Lambda code to use approved namespaces
Test metric publishing still works

CRIT-004: Bare Exception Handlers Silently Suppress Errors

Severity: Critical Component: Lambda Shared Utilities Files:

lambda/shared/connections/fathom.py (line 260)
lambda/shared/connections/helpscout.py (line 536) Status: Open

Description

Bare except: clauses (without specifying exception types) catch ALL exceptions, including critical system exceptions like SystemExit, KeyboardInterrupt, and MemoryError. This can:

Mask Real Bugs: Programming errors like NameError, TypeError, or AttributeError are silently swallowed
Hide Memory Issues: MemoryError exceptions go unnoticed, leading to silent failures
Prevent Debugging: No logging or stack traces for unexpected exceptions
Break Signal Handling: KeyboardInterrupt and SystemExit are caught, preventing graceful shutdown

Current Code (Fathom)

# lambda/shared/connections/fathom.py:260
try:
    # Date calculation logic
    from datetime import datetime
    if isinstance(recording_start, str):
        recording_start = datetime.fromisoformat(recording_start.replace('Z', '+00:00'))
    if isinstance(recording_end, str):
        recording_end = datetime.fromisoformat(recording_end.replace('Z', '+00:00'))
    duration = int((recording_end - recording_start).total_seconds() / 60)
except:  # DANGEROUS: Catches everything including MemoryError, SystemExit
    duration = 0

Current Code (HelpScout)

# lambda/shared/connections/helpscout.py:536
try:
    dt = datetime.fromisoformat(dt_string.replace('Z', '+00:00'))
    return dt.strftime('%B %d, %Y %I:%M %p')
except:  # DANGEROUS: Silently returns original string for ANY error
    return dt_string

Recommended Fix

Catch only the specific exceptions that are expected:

# Fathom duration calculation
try:
    from datetime import datetime
    if isinstance(recording_start, str):
        recording_start = datetime.fromisoformat(recording_start.replace('Z', '+00:00'))
    if isinstance(recording_end, str):
        recording_end = datetime.fromisoformat(recording_end.replace('Z', '+00:00'))
    duration = int((recording_end - recording_start).total_seconds() / 60)
except (ValueError, TypeError, AttributeError) as e:
    logger.warning(f"Failed to calculate meeting duration: {e}")
    duration = 0

# HelpScout datetime formatting
try:
    dt = datetime.fromisoformat(dt_string.replace('Z', '+00:00'))
    return dt.strftime('%B %d, %Y %I:%M %p')
except (ValueError, TypeError) as e:
    logger.warning(f"Failed to parse datetime '{dt_string}': {e}")
    return dt_string

Impact

Reliability: Unexpected errors are properly logged and can be debugged
Observability: CloudWatch logs will show warnings for parsing failures
Safety: System exceptions (MemoryError, SystemExit) propagate correctly

Resolution Steps

Replace all bare except: clauses in the codebase
Identify specific exception types that should be caught
Add logging for caught exceptions
Run linting with pylint or flake8 to catch future occurrences
Add bare-except check to CI/CD pipeline

CRIT-005: Missing JSON Decode Error Handling

Severity: Critical Component: Lambda Shared Utilities Files:

lambda/shared/utils/utils.py (line 48)
lambda/shared/common_utils.py (line 48) Status: Open

Description

The parse_event_body() function calls json.loads() without catching json.JSONDecodeError. When API Gateway passes a malformed JSON body, the Lambda crashes with an unhandled exception instead of returning a proper 400 Bad Request response.

This is a critical issue because:

Poor User Experience: Users see a generic 500 error instead of helpful “Invalid JSON” message
Confusing Error Logs: Stack traces for invalid JSON look like bugs
Security Information Leak: Some JSON parsers include the malformed input in error messages

Current Code

def parse_event_body(event: Dict[str, Any]) -> Dict[str, Any]:
    """Parse the event body from API Gateway."""
    if 'body' in event:
        body = event['body']
        if isinstance(body, str):
            return json.loads(body)  # CRASHES on invalid JSON!
        return body
    return event

Example Failure

Request with invalid JSON:

curl -X POST https://api.example.com/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "hello'  # Missing closing brace and quote

Result: Lambda crashes with json.JSONDecodeError, API Gateway returns 500.

Recommended Fix

def parse_event_body(event: Dict[str, Any]) -> Dict[str, Any]:
    """
    Parse the event body from API Gateway.

    Args:
        event: API Gateway event dictionary

    Returns:
        Parsed body as dictionary

    Raises:
        ValueError: If body contains invalid JSON
    """
    if 'body' in event:
        body = event['body']
        if isinstance(body, str):
            try:
                return json.loads(body)
            except json.JSONDecodeError as e:
                # Log the error but don't include the malformed input (security)
                logger.warning(f"Invalid JSON in request body at position {e.pos}")
                raise ValueError(f"Invalid JSON in request body: {e.msg}") from e
        return body
    return event

Then update Lambda handlers to catch ValueError and return 400:

try:
    body = parse_event_body(event)
except ValueError as e:
    return {
        'statusCode': 400,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps({
            'error': 'Invalid request',
            'message': str(e)
        })
    }

Impact

User Experience: Clear error messages for malformed requests
Security: Prevents information leakage in error messages
Operations: Cleaner logs without stack traces for user errors

Resolution Steps

Update parse_event_body() to catch and re-raise as ValueError
Update all Lambda handlers to catch ValueError and return 400
Add unit tests for malformed JSON handling
Test with various malformed payloads

CRIT-006: Missing Request Timeouts in HelpScout Connection

Severity: Critical Component: Lambda Shared Connections File: lambda/shared/connections/helpscout.py (lines 82, 181, 227) Status: Open

Description

Three HTTP requests in the HelpScout connection class are missing the timeout parameter. Without explicit timeouts, requests can hang indefinitely, causing:

Lambda Timeout: The Lambda function hits its 15-minute maximum and is killed
Resource Exhaustion: Connection pools are exhausted by hanging requests
Cost Increase: Lambda is billed for the full hanging duration
No Error Recovery: No opportunity for retry logic to kick in

Current Code

# Line 82 - authenticate() method
response = requests.get(f'{self.base_url}/users/me', headers=headers)
# NO TIMEOUT - can hang forever

# Line 181 - fetch_all_items() method
response = requests.get(url, headers=self._get_headers(), params=params)
# NO TIMEOUT - can hang forever

# Line 227 - fetch_conversation() method
response = requests.get(url, headers=self._get_headers(), params=params)
# NO TIMEOUT - can hang forever

Note: The token request (line 103) correctly has timeout=10, but other requests are inconsistent.

Recommended Fix

Add explicit timeouts to all HTTP requests:

# Define timeout constants at module level
HELPSCOUT_CONNECT_TIMEOUT = 5   # seconds to establish connection
HELPSCOUT_READ_TIMEOUT = 30     # seconds to receive response
HELPSCOUT_TIMEOUT = (HELPSCOUT_CONNECT_TIMEOUT, HELPSCOUT_READ_TIMEOUT)

# In authenticate() method
response = requests.get(
    f'{self.base_url}/users/me',
    headers=headers,
    timeout=HELPSCOUT_TIMEOUT
)

# In fetch_all_items() method
response = requests.get(
    url,
    headers=self._get_headers(),
    params=params,
    timeout=HELPSCOUT_TIMEOUT
)

# In fetch_conversation() method
response = requests.get(
    url,
    headers=self._get_headers(),
    params=params,
    timeout=HELPSCOUT_TIMEOUT
)

Impact

Reliability: Failed requests timeout quickly instead of hanging
Cost: Prevents Lambda from running for 15 minutes on a single hung request
Recovery: Allows retry logic to attempt alternative actions

Resolution Steps

Define timeout constants in the module
Add timeout to all requests.get() and requests.post() calls
Add similar timeouts to fathom.py connections
Consider adding exponential backoff retry logic
Add unit tests for timeout behavior

CRIT-007: Hardcoded Dates in Integration Tests

Severity: Critical Component: Test Suite Files:

tests/test_fathom_webhook.py (lines 30-31)
tests/test_helpscout_webhook.py (lines 41, 58, 75, 92) Status: Open

Description

Integration tests use hardcoded future dates like “2025-11-05” and “2025-12-06”. These tests will start failing when the current date passes these hardcoded values, because:

Date Validation Logic: Some code validates that dates are not in the past
TTL Calculations: DynamoDB TTL values may become negative
Sorting Logic: Tests expecting chronological order may fail

This is a time bomb that will cause CI/CD failures on or after November 5, 2025.

Current Code

# test_fathom_webhook.py:30-31
"recording_start_time": "2025-11-05T10:00:00Z",  # WILL FAIL after Nov 5, 2025
"recording_end_time": "2025-11-05T11:00:00Z",

# test_helpscout_webhook.py:41
"createdAt": "2025-11-05T10:00:00Z",  # WILL FAIL after Nov 5, 2025

Recommended Fix

Use dynamic date generation:

from datetime import datetime, timedelta, timezone

def get_future_timestamp(days_ahead: int = 30) -> str:
    """Generate an ISO timestamp N days in the future."""
    future_date = datetime.now(timezone.utc) + timedelta(days=days_ahead)
    return future_date.strftime("%Y-%m-%dT%H:%M:%SZ")

# In test fixtures
@pytest.fixture
def sample_fathom_event():
    return {
        "recording_start_time": get_future_timestamp(30),
        "recording_end_time": get_future_timestamp(30),  # or add 1 hour
        # ...
    }

Or use freezegun to freeze time during tests:

from freezegun import freeze_time

@freeze_time("2025-06-15T12:00:00Z")
def test_webhook_processing():
    event = {
        "recording_start_time": "2025-11-05T10:00:00Z",  # Always in the future
        # ...
    }
    # Test logic

Impact

CI/CD Reliability: Tests won’t randomly fail based on current date
Maintainability: No need to update tests periodically

Resolution Steps

Install freezegun as dev dependency: pip install freezegun
Update all test files with hardcoded dates
Either use dynamic dates or freeze time with @freeze_time
Add CI/CD check for hardcoded date patterns
Document the chosen approach in test conventions

CRIT-008: Untested Classification Strategies

Severity: Critical Component: Test Suite Files:

lambda/classification/strategies/slack.py - 0% test coverage
lambda/classification/strategies/basecamp.py - 0% test coverage
lambda/classification/strategies/googledrive.py - 0% test coverage Status: Open

Description

Three classification strategy classes are registered in the StrategyFactory but have absolutely no unit tests:

SlackStrategy - Returns null_result() silently
BasecampStrategy - Returns null_result() silently
GoogleDriveStrategy - Returns null_result() silently

The test file lambda/classification/tests/test_strategies.py only tests FathomStrategy and HelpScoutStrategy.

This is critical because:

Silent Failures: If someone tries to use these integrations, classification silently returns null
No Error Indication: No exception, no log, just empty results
Factory Mismatch: test_handler.py line 229 loops through all 6 sources but only 2 have actual test coverage
Future Bug Risk: When implemented, no tests will catch regressions

Current Code

# lambda/classification/strategies/slack.py:12
class SlackStrategy(BaseStrategy):
    def classify(self, data: Any, table_name: str) -> Dict[str, Optional[str]]:
        """Classify Slack data."""
        # TODO: Implement when Slack integration is ready.
        return self.null_result()  # Silent failure!

Recommended Fix

Option A: Remove from factory until implemented

# lambda/classification/factory.py
STRATEGIES = {
    'fathom': FathomStrategy,
    'helpscout': HelpScoutStrategy,
    # Removed until implemented:
    # 'slack': SlackStrategy,
    # 'basecamp': BasecampStrategy,
    # 'googledrive': GoogleDriveStrategy,
}

Option B: Raise NotImplementedError

class SlackStrategy(BaseStrategy):
    def classify(self, data: Any, table_name: str) -> Dict[str, Optional[str]]:
        """Classify Slack data."""
        raise NotImplementedError(
            "Slack classification strategy is not yet implemented. "
            "See docs/roadmap.md for planned integrations."
        )

Option C: Add placeholder tests

# lambda/classification/tests/test_strategies.py
class TestSlackStrategy:
    """Tests for SlackStrategy - currently a placeholder."""

    def test_slack_returns_null_result_when_not_implemented(self):
        """Verify Slack classification returns null until implemented."""
        strategy = SlackStrategy()
        result = strategy.classify({'channel': 'general'}, 'test-table')

        # Document expected behavior
        assert result == {'client_name': None, 'project_name': None}

    def test_slack_raises_when_implemented(self):
        """This test should fail when Slack is implemented, prompting real tests."""
        pytest.skip("Update this test when Slack integration is implemented")

Impact

Reliability: Clear indication when unimplemented features are used
Maintainability: Tests document expected behavior
Developer Experience: New developers understand feature status

Resolution Steps

Decide on approach (A, B, or C above)
If Option A: Remove from factory, update handler to return helpful error
If Option B: Add NotImplementedError, update handler to catch and return 501
If Option C: Add placeholder tests documenting current behavior
Update documentation about integration status

High Severity Issues

Issues that affect security, reliability, or code quality but have workarounds or lower immediate impact.

HIGH-001: Cognito Security Features Disabled

Severity: High Component: Terraform File: terraform/main.tf (lines 152-153) Status: Open

Description

AWS Cognito advanced security features and deletion protection are explicitly disabled with comments indicating they should be enabled for production:

enable_advanced_security   = false # Set to true for production (has additional costs)
enable_deletion_protection = false # Set to true for production

Advanced Security provides:

Compromised credentials detection
Adaptive authentication based on risk
Account takeover protection
Bot detection

Deletion Protection prevents:

Accidental deletion of user pool
Loss of all user data and sessions
Authentication outage

Impact

Security: Missing protection against credential stuffing and account takeover
Reliability: User pool could be accidentally deleted

Resolution Steps

Enable deletion protection immediately (no cost)
Evaluate cost of advanced security ($0.05 per MAU)
Enable advanced security if budget allows
Add environment-specific configuration

HIGH-002: CORS Wildcard Default in API Gateway

Severity: High Component: Terraform File: terraform/modules/api-gateway/variables.tf (lines 8-12) Status: Open

Description

The API Gateway CORS configuration defaults to allowing requests from any origin:

variable "cors_allow_origins" {
  description = "CORS allowed origins"
  type        = list(string)
  default     = ["*"]  # Allows all origins
}

While the API uses Cognito authentication, CORS wildcard:

Allows any website to make authenticated requests
Increases attack surface for XSS-based attacks
May not meet security compliance requirements

Recommended Fix

variable "cors_allow_origins" {
  description = "CORS allowed origins"
  type        = list(string)
  default     = []  # Force explicit configuration
}

# In main.tf
cors_allow_origins = [
  "https://compass.northbuilt.com",
  "https://www.compass.northbuilt.com",
  var.environment == "development" ? "http://localhost:5173" : null
]

HIGH-003: Missing DynamoDB Encryption Configuration

Severity: High Component: Terraform File: terraform/modules/dynamodb/main.tf (lines 10-72) Status: Open

Description

DynamoDB tables do not explicitly configure server-side encryption with customer-managed KMS keys. While AWS enables encryption by default with AWS-managed keys, customer-managed keys provide:

Key rotation control
Key usage auditing via CloudTrail
Cross-account access control
Compliance with security standards (SOC2, HIPAA)

Recommended Fix

resource "aws_dynamodb_table" "main" {
  # ... existing config ...

  server_side_encryption {
    enabled     = true
    kms_key_arn = aws_kms_key.dynamodb.arn
  }
}

resource "aws_kms_key" "dynamodb" {
  description             = "KMS key for DynamoDB encryption"
  deletion_window_in_days = 30
  enable_key_rotation     = true

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid       = "Enable IAM policies"
        Effect    = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action    = "kms:*"
        Resource  = "*"
      }
    ]
  })
}

HIGH-004: S3 Buckets Using AWS-Managed Encryption Keys

Severity: High Component: Terraform File: terraform/modules/s3/main.tf (lines 34-41, 116-124, 169-176) Status: Open

Description

All S3 buckets use AES256 encryption (AWS-managed keys) instead of customer-managed KMS keys:

rule {
  apply_server_side_encryption_by_default {
    sse_algorithm = "AES256"  # AWS-managed
  }
}

Customer-managed KMS keys provide:

Fine-grained access control via key policies
Key usage auditing
Cross-account sharing capabilities
Compliance with security frameworks

Recommended Fix

resource "aws_s3_bucket_server_side_encryption_configuration" "documents" {
  bucket = aws_s3_bucket.documents.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.s3.arn
    }
    bucket_key_enabled = true  # Reduces KMS API costs
  }
}

HIGH-005: CloudFront Missing Security Headers

Severity: High Component: Terraform File: terraform/modules/cloudfront/main.tf Status: Open

Description

The CloudFront distribution does not configure security headers. Missing headers leave the application vulnerable to:

XSS Attacks: No Content-Security-Policy (CSP)
Clickjacking: No X-Frame-Options
MIME Sniffing: No X-Content-Type-Options
Downgrade Attacks: No Strict-Transport-Security (HSTS)

Recommended Fix

Create a response headers policy:

resource "aws_cloudfront_response_headers_policy" "security_headers" {
  name = "${var.resource_prefix}-security-headers"

  security_headers_config {
    strict_transport_security {
      access_control_max_age_sec = 31536000  # 1 year
      include_subdomains         = true
      override                   = true
      preload                    = true
    }

    content_type_options {
      override = true  # Adds X-Content-Type-Options: nosniff
    }

    frame_options {
      frame_option = "DENY"
      override     = true
    }

    referrer_policy {
      referrer_policy = "strict-origin-when-cross-origin"
      override        = true
    }

    content_security_policy {
      content_security_policy = "default-src 'self'; script-src 'self' 'wasm-unsafe-eval'; style-src 'self' 'unsafe-inline' https://fonts.googleapis.com; font-src 'self' https://fonts.gstatic.com; img-src 'self' https: data:; connect-src 'self' ${var.api_endpoint} ${var.cognito_domain};"
      override                = true
    }
  }
}

HIGH-006: No Lambda Reserved Concurrency Limits

Severity: High Component: Terraform File: terraform/modules/lambda/main.tf (lines 866-1333) Status: Open

Description

No Lambda functions have reserved_concurrent_executions configured. All functions share the default AWS account limit (typically 1000 concurrent executions). This means:

One function could monopolize all concurrency during traffic spike
Other critical functions could be throttled
No isolation between function workloads

Recommended Fix

Add reserved concurrency to each function based on expected traffic:

resource "aws_lambda_function" "chat" {
  # ... existing config ...
  reserved_concurrent_executions = 100  # Chat gets 100 concurrent
}

resource "aws_lambda_function" "webhook_fathom" {
  # ... existing config ...
  reserved_concurrent_executions = 10  # Webhooks get 10 each
}

resource "aws_lambda_function" "sync_worker" {
  # ... existing config ...
  reserved_concurrent_executions = 5  # Sync workers get 5 each
}

HIGH-007: Missing Lambda Dead Letter Queues

Severity: High Component: Terraform File: terraform/modules/lambda/main.tf (lines 866-1333) Status: Open

Description

Async/scheduled Lambda functions don’t have Dead Letter Queue (DLQ) configuration. When these functions fail, the events are silently dropped:

ingest - EventBridge triggered
fathom_sync_scheduler - Scheduled
helpscout_sync_scheduler - Scheduled
fathom_sync_worker - Async invocation

Without DLQs:

Failed events are lost forever
No visibility into failure patterns
No ability to replay failed events

Recommended Fix

resource "aws_sqs_queue" "lambda_dlq" {
  name                       = "${var.resource_prefix}-lambda-dlq"
  message_retention_seconds  = 1209600  # 14 days
  visibility_timeout_seconds = 300

  tags = var.tags
}

resource "aws_lambda_function" "sync_worker" {
  # ... existing config ...

  dead_letter_config {
    target_arn = aws_sqs_queue.lambda_dlq.arn
  }
}

Also create CloudWatch alarm for DLQ messages:

resource "aws_cloudwatch_metric_alarm" "dlq_messages" {
  alarm_name          = "${var.resource_prefix}-dlq-messages"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Sum"
  threshold           = 0
  alarm_description   = "Lambda DLQ has messages - investigate failures"

  dimensions = {
    QueueName = aws_sqs_queue.lambda_dlq.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

HIGH-008: Short Log Retention (7 Days)

Severity: High Component: Terraform File: terraform/modules/lambda/main.tf (line 229) Status: Open

Description

CloudWatch log retention is set to only 7 days:

log_retention_days = 7

This is insufficient for:

Security incident investigation (often discovered weeks later)
Compliance audits (typically require 90+ days)
Performance trend analysis
Debugging intermittent issues

Recommended Fix

# For development environments
log_retention_days = 14

# For production environments
log_retention_days = 90  # Or 365 for compliance

Alternatively, archive logs to S3 for longer retention at lower cost:

resource "aws_cloudwatch_log_subscription_filter" "to_s3" {
  name            = "${var.resource_prefix}-log-archive"
  log_group_name  = aws_cloudwatch_log_group.lambda.name
  filter_pattern  = ""  # All logs
  destination_arn = aws_kinesis_firehose_delivery_stream.log_archive.arn
}

HIGH-009: No Retry Logic for External API Calls

Severity: High Component: Lambda Connections Files:

lambda/shared/connections/helpscout.py
lambda/shared/connections/fathom.py Status: Open

Description

API calls to external services use basic error handling without retry logic. A single transient failure (network timeout, 429 rate limit, 503 service unavailable) causes the entire operation to fail:

response = requests.get(url, headers=self._get_headers(), params=params)
response.raise_for_status()  # Fails immediately on 429, 503, etc.

In production, transient failures are common:

429 Too Many Requests (rate limiting)
503 Service Unavailable (maintenance)
Connection timeouts (network issues)
500 Internal Server Error (temporary backend issues)

Recommended Fix

Implement retry with exponential backoff:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests

RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}

class RetryableHTTPError(Exception):
    """HTTP error that should be retried."""
    pass

def raise_for_retryable_status(response):
    """Raise RetryableHTTPError for status codes that warrant retry."""
    if response.status_code in RETRYABLE_STATUS_CODES:
        raise RetryableHTTPError(f"HTTP {response.status_code}: {response.reason}")
    response.raise_for_status()

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((RetryableHTTPError, requests.exceptions.Timeout))
)
def fetch_with_retry(url, headers, params=None, timeout=30):
    """Fetch URL with automatic retry on transient failures."""
    response = requests.get(url, headers=headers, params=params, timeout=timeout)
    raise_for_retryable_status(response)
    return response

HIGH-010: Tokens Stored in localStorage (Web)

Severity: High Component: React Web App File: web/src/hooks/useAuth.ts (lines 26-28, 51-55, 65-67) Status: Open

Description

JWT tokens (id_token, access_token, refresh_token) are stored in localStorage:

const idToken = localStorage.getItem('id_token');
const accessToken = localStorage.getItem('access_token');
const refreshToken = localStorage.getItem('refresh_token');

localStorage is vulnerable to XSS attacks. If an attacker injects JavaScript (through any XSS vulnerability), they can:

Read all tokens from localStorage
Exfiltrate tokens to attacker-controlled server
Impersonate the user indefinitely (until token expiry)

Recommended Fix

Migrate to HTTP-only cookies:

Backend Changes:

# In Cognito callback handler
response = make_response(redirect('/'))
response.set_cookie(
 'access_token',
 access_token,
 httponly=True,      # Can't be read by JavaScript
 secure=True,        # HTTPS only
 samesite='Strict',  # CSRF protection
 max_age=3600        # 1 hour
)

Frontend Changes: ```typescript // Remove localStorage token management // Cookies are automatically sent with requests

const response = await fetch(endpoint, { credentials: ‘include’, // Send cookies // No Authorization header needed });

**Note:** This requires coordination with backend/Cognito configuration.

---

### HIGH-011: Missing Content Security Policy (Web)

**Severity:** High
**Component:** React Web App
**File:** Application-wide (CloudFront configuration needed)
**Status:** Open

#### Description

No Content Security Policy (CSP) headers are configured. CSP prevents:
- Inline script injection (XSS)
- Loading scripts from unauthorized domains
- Unauthorized form submissions
- Clickjacking via framing

#### Recommended Fix

Configure CSP at CloudFront level (see HIGH-005) with policy:

Content-Security-Policy: default-src ‘self’; script-src ‘self’ ‘wasm-unsafe-eval’; style-src ‘self’ ‘unsafe-inline’ https://fonts.googleapis.com; font-src ‘self’ https://fonts.gstatic.com; img-src ‘self’ https: data:; connect-src ‘self’ https://*.execute-api.us-east-1.amazonaws.com https://nb-rag-sys-auth.auth.us-east-1.amazoncognito.com; frame-ancestors ‘none’; base-uri ‘self’; form-action ‘self’;

---

### HIGH-012: Missing Connection Tests (Fathom, HelpScout)

**Severity:** High
**Component:** Test Suite
**Files:**
- `lambda/shared/connections/fathom.py`
- `lambda/shared/connections/helpscout.py`
**Status:** Resolved

#### Description

The `FathomConnection` and `HelpScoutConnection` classes have over 500 lines of code each with multiple API calls. Unit tests have been added.

| Connection | Lines of Code | Test Coverage |
|------------|--------------|---------------|
| Base | ~100 | 100% (test_base.py) |
| Fathom | ~350 | ~100% (test_fathom.py) |
| HelpScout | ~550 | ~100% (test_helpscout.py) |

#### Resolution

Tests were added in `lambda/shared/tests/connections/`:
- `test_fathom.py` - 42 tests covering initialization, auth, fetch, S3 keys, transforms
- `test_helpscout.py` - 45 tests covering OAuth2, pagination, conversation fetch, transforms

---

## Medium Severity Issues

Issues that affect maintainability, performance, or developer experience.

---

### MED-001: Cognito Password Temporary Validity Too Long

**File:** `terraform/modules/cognito/main.tf` (lines 20-27)

Temporary passwords are valid for 7 days, which provides a longer window for brute force attacks on new accounts. Reduce to 3-5 days:

```terraform
temporary_password_validity_days = 3

MED-002: No MFA Configuration

File: terraform/modules/cognito/main.tf

No Multi-Factor Authentication is configured for Cognito. Add optional MFA:

mfa_configuration = "OPTIONAL"

software_token_mfa_configuration {
  enabled = true
}

MED-003: S3 Backup Immediate Glacier Transition

File: terraform/modules/s3/main.tf (lines 222-226)

Backups transition to Glacier immediately (day 0), making quick recovery expensive:

transition {
  days          = 0  # Immediate - expensive to retrieve
  storage_class = "GLACIER_IR"
}

Change to:

transition {
  days          = 30  # Keep accessible for 30 days
  storage_class = "GLACIER_IR"
}

MED-004: Unimplemented Strategies Return Null Silently

Files: lambda/classification/strategies/slack.py, basecamp.py, googledrive.py

These strategies are registered in the factory but return null results silently:

def classify(self, data: Any, table_name: str) -> Dict[str, Optional[str]]:
    return self.null_result()  # No error, no log

Should either:

Raise NotImplementedError
Log a warning
Be removed from factory until implemented

MED-005: DynamoDB Client Created Per Request

File: lambda/chat/handler.py (lines 128-147)

Each Lambda invocation creates new boto3 clients:

def create_aws_clients(config: Config) -> AWSClients:
    dynamodb = boto3.resource('dynamodb', region_name=config.aws_region)  # New instance

Should create clients at module level for reuse across warm invocations:

_clients = None

def get_clients(config):
    global _clients
    if _clients is None:
        _clients = create_aws_clients(config)
    return _clients

MED-006: Hardcoded Rate Limiting Sleep Values

Files:

lambda/shared/connections/helpscout.py (line 200)
lambda/shared/connections/fathom.py (lines 173, 184)

Fixed time.sleep() values that can’t be tuned:

time.sleep(0.5)  # Hardcoded

Should be configurable and respect API rate limit headers:

RATE_LIMIT_DELAY = float(os.environ.get('FATHOM_RATE_LIMIT_DELAY', '0.5'))

def respect_rate_limit(response):
    """Check response headers and sleep if rate limited."""
    retry_after = response.headers.get('Retry-After')
    if retry_after:
        time.sleep(int(retry_after))
    else:
        time.sleep(RATE_LIMIT_DELAY)

MED-007: DEBUG Mode Logs Sensitive Data (Web)

File: web/src/config.ts (line 64), multiple hooks

DEBUG mode enabled in config logs sensitive information:

console.log('OAuth callback successful:', parsedUser);

Should gate by both DEBUG and development mode:

DEBUG: import.meta.env.VITE_DEBUG === 'true' && import.meta.env.MODE === 'development'

MED-008: Error Messages Expose Backend Details (Web)

File: web/src/components/ChatPage.tsx (line 142)

Error messages from API displayed directly to users:

<p>Error: {error}</p>

Should map to user-friendly messages:

const getUserFriendlyError = (error: string) => {
    if (error.includes('Authentication')) return 'Please sign in again';
    if (error.includes('Network')) return 'Connection lost. Please try again.';
    return 'An error occurred. Please try again.';
};

MED-009: Incomplete Test Body

File: lambda/ingest/tests/test_handler.py (line 99)

Test with only pass statement:

def test_handles_metric_send_failure_gracefully(self):
    """Should not raise exception if metric send fails."""
    mock_cloudwatch = Mock()
    mock_cloudwatch.put_metric_data.side_effect = Exception("CloudWatch error")

    handler.send_metric(mock_cloudwatch, 'TestMetric')
    # NO ASSERTIONS!

Should assert the expected behavior.

MED-010 through MED-025: Additional Medium Issues

ID	Component	File	Issue
MED-010	Terraform	api-gateway/variables.tf	API throttle defaults may be too high (10/sec)
MED-011	Terraform	cloudfront/main.tf	CloudFront min TTL is 0 (no caching by default)
MED-012	Terraform	cloudfront/main.tf	Price class limits to NA/Europe only
MED-013	Terraform	cognito/main.tf	30-day refresh token validity is long
MED-014	Lambda	chat/handler.py	Async Lambda invocation not verified
MED-015	Lambda	shared/connections	S3 permission errors treated as “not found”
MED-016	Lambda	shared/connections/fathom.py	Type coercion issues in data extraction
MED-017	Lambda	All handlers	Generic exception handling hides specific errors
MED-018	Lambda	shared/utils	Overly permissive CORS `*` in responses
MED-019	Lambda	connections	No logging context for multi-client scenarios
MED-020	Web	hooks/useAuth.ts	OAuth tokens visible in URL hash
MED-021	Web	components/Chat	Missing React.memo on ChatMessage
MED-022	Web	lib/api.ts	No runtime validation of API responses
MED-023	Tests	Multiple	Tests depend on real AWS infrastructure
MED-024	Tests	Multiple	Timestamp assertions without tolerance
MED-025	Tests	Multiple	Duplicate test utilities (common_utils.py)

Low Severity Issues

Minor issues that affect code cleanliness and maintainability.

LOW-001 through LOW-020: Minor Issues

ID	Component	Issue
LOW-001	Terraform	Missing version constraints on modules
LOW-002	Terraform	Missing README.md in module directories
LOW-003	Terraform	Some variables lack descriptions
LOW-004	Terraform	Unused local variables
LOW-005	Terraform	EventBridge rules missing disable option
LOW-006	Lambda	Unused `sys` import in chat handler
LOW-007	Lambda	Inconsistent type hints (`tuple` vs `Tuple`)
LOW-008	Lambda	Hardcoded Lambda function name fallback
LOW-009	Lambda	Implicit None return in worker
LOW-010	Lambda	Missing docstrings in some utilities
LOW-011	Lambda	Information leakage in error messages
LOW-012	Web	Composite key with index in MessageList
LOW-013	Web	Missing image lazy loading
LOW-014	Web	Gravatar URL generated on each render
LOW-015	Tests	Missing test factories for varied data
LOW-016	Tests	Tests that test multiple things
LOW-017	Tests	Overly permissive assertions
LOW-018	Tests	Missing test docstrings
LOW-019	Tests	Scattered conftest files
LOW-020	Docs	Missing inline code documentation

Positive Findings

The audit also identified many security and quality strengths:

Security Strengths

Comprehensive input validation and sanitization
Proper use of AWS Secrets Manager (no hardcoded secrets)
Webhook signature verification with HMAC and constant-time comparison
Multi-tenant isolation at vector search level with defense-in-depth
No eval(), exec(), or unsafe string formatting
Proper CORS headers on API responses

Code Quality Strengths

Extensive type hints throughout Python code
Strict TypeScript with ~100% type coverage
Good React patterns with proper error boundaries
Well-structured test organization (2.6:1 test-to-code ratio)
Consistent coding style with linting

Infrastructure Strengths

S3 versioning enabled for state protection
CloudFront HTTPS redirect configured
Proper IAM role separation per Lambda
Environment-specific tagging

Resolution Tracking

Sprint 1 (Immediate)

CRIT-001: Remove hardcoded AWS account ID
CRIT-002: Add DynamoDB state locking
CRIT-004: Replace bare except clauses
CRIT-005: Add JSON decode error handling
CRIT-006: Add request timeouts
CRIT-007: Fix hardcoded test dates

Sprint 2 (Short-Term)

CRIT-003: Restrict CloudWatch IAM permissions
CRIT-008: Add tests for unimplemented strategies
HIGH-001: Enable Cognito deletion protection
HIGH-002: Restrict CORS origins
HIGH-005: Add CloudFront security headers
HIGH-008: Increase log retention

Sprint 3 (Medium-Term)

HIGH-003: Add DynamoDB KMS encryption
HIGH-004: Add S3 KMS encryption
HIGH-006: Configure Lambda concurrency limits
HIGH-007: Add Lambda DLQs
HIGH-009: Implement retry logic
HIGH-012: Add missing connection tests

Future

HIGH-010: Migrate tokens to HTTP-only cookies
HIGH-011: Configure CSP headers
All Medium and Low severity items

RAG System Improvements - RAG-specific optimizations
Architecture Decision Records - Design decisions
Security Documentation - Security architecture
Operations Runbook - Operational procedures

Audit Findings - 2025-12-30

Additional findings from comprehensive RAG system audit.

NEW-001: Frontend Debug Mode Defaults to True

Severity: Medium Component: Web Application File: web/src/config.ts (line 64) Status: Open

Description

DEBUG mode defaults to true which logs sensitive information (session IDs, query content, entity matches) in production.

DEBUG: import.meta.env.VITE_DEBUG === 'true' || true,  // Defaults to true

Recommended Fix

Change default to false:

DEBUG: import.meta.env.VITE_DEBUG === 'true',

NEW-002: Missing EntityType in Classification Lambda Response

Severity: High Component: Lambda File: lambda/classification/handler.py Status: Open

Description

The Classification Lambda returns classification results but doesn’t include entityType field. Query understanding’s entity loader requires EntityType field for GSI queries. When classification is used to update DynamoDB records, the EntityType may not be set.

Recommended Fix

Add entityType to classification response to match what the entity loader expects.

NEW-003: Query Understanding Not Documented in CLAUDE.md

Severity: Medium Component: Documentation File: CLAUDE.md Status: Open

Description

The query understanding feature (entity extraction, clarification prompts, metadata filtering) is implemented but not documented in CLAUDE.md. Developers won’t know about:

Automatic entity extraction from natural language queries
Clarification responses for ambiguous queries
Adaptive retrieval multiplier based on filter count

Recommended Fix

Add Query Understanding section to CLAUDE.md documenting these features.

NEW-004: CLASSIFICATION_TABLE Environment Variable Undocumented

Severity: Low Component: Documentation File: CLAUDE.md (line 81-82) Status: Resolved

Description

Chat handler expected CLASSIFICATION_TABLE environment variable for entity loading, but it was not documented.

Resolution

DynamoDB tables were consolidated into a single table (nb-rag-sys) with DYNAMODB_TABLE_NAME environment variable used consistently across all Lambdas.

NEW-005: System Architecture Doc Says “Vanilla JS”

Severity: Medium Component: Documentation File: docs/architecture/system-architecture.md (line 68) Status: Open

Description

System architecture documentation describes the web application as “Vanilla JavaScript (no framework dependencies)” but the actual implementation uses React 19 + TypeScript + Vite.

Recommended Fix

Update line 68 to accurately reflect the React implementation.

NEW-006: No Entity Loader Tests

Severity: High Component: Test Suite File: lambda/shared/utils/entities.py Status: Open

Description

The entity loader module (entities.py) has zero test coverage. This is critical for verifying multi-tenant isolation works correctly.

Recommended Fix

Create comprehensive tests for load_entities_from_dynamodb() covering:

DynamoDB load and pagination
Entity validation and schema errors
EntityType GSI queries
Edge cases (empty results, missing fields)

NEW-007: Integration Tests Not Run on PRs

Severity: Medium Component: CI/CD File: .github/workflows/test.yml Status: Open

Description

Integration tests only run on main branch, not on PRs. Real AWS service issues won’t surface until after merge and deployment.

Recommended Fix

Add integration test stage for PRs against a staging environment, or run a subset of integration tests.

NEW-008: No Post-Deployment Smoke Tests

Severity: Medium Component: CI/CD File: .github/workflows/main.yml Status: Open

Description

No validation runs after Lambda deployment. Bad deployments could go unnoticed.

Recommended Fix

Add health check endpoint and sanity tests after deployment completes.

NEW-009: Security Gates Don’t Fail Builds

Severity: Medium Component: CI/CD File: .github/workflows/test.yml Status: Open

Description

Snyk, Semgrep, and Checkov security scans all use continue-on-error: true. Security issues are logged but don’t block the build.

Recommended Fix

Change to fail on HIGH/CRITICAL findings to ensure security issues are addressed before merge.

NEW-010: No Token Refresh Logic in Frontend

Severity: High Component: Web Application File: web/src/hooks/useAuth.ts Status: Open

Description

When Cognito tokens expire (1 hour default), users get authentication errors and must manually refresh the page. No automatic token refresh mechanism exists.

Recommended Fix

Implement token refresh on 401 responses or proactively before expiration.

Last updated: 2025-12-30

Codebase Audit Findings

Table of Contents

Audit Overview

Summary Statistics

Critical Issues

CRIT-001: Hardcoded AWS Account ID in Repository

Description

Current Code

Recommended Fix

Impact

Resolution Steps

CRIT-002: Missing DynamoDB State Locking for Terraform

Description

Current Code

Recommended Fix

Impact

Resolution Steps

CRIT-003: Overly Permissive CloudWatch IAM Policy

Description

Current Code

Recommended Fix

Impact

Resolution Steps

CRIT-004: Bare Exception Handlers Silently Suppress Errors

Description

Current Code (Fathom)

Current Code (HelpScout)

Recommended Fix

Impact

Resolution Steps

CRIT-005: Missing JSON Decode Error Handling

Description

Current Code

Example Failure

Recommended Fix

Impact

Resolution Steps

CRIT-006: Missing Request Timeouts in HelpScout Connection

Description

Current Code

Recommended Fix

Impact

Resolution Steps

CRIT-007: Hardcoded Dates in Integration Tests

Description

Current Code

Recommended Fix

Impact

Resolution Steps

CRIT-008: Untested Classification Strategies

Description

Current Code

Recommended Fix

Impact

Resolution Steps

High Severity Issues

HIGH-001: Cognito Security Features Disabled

Description

Impact

Resolution Steps

HIGH-002: CORS Wildcard Default in API Gateway

Description

Recommended Fix

HIGH-003: Missing DynamoDB Encryption Configuration

Description

Recommended Fix

HIGH-004: S3 Buckets Using AWS-Managed Encryption Keys

Description

Recommended Fix

HIGH-005: CloudFront Missing Security Headers

Description

Recommended Fix

HIGH-006: No Lambda Reserved Concurrency Limits

Description

Recommended Fix

HIGH-007: Missing Lambda Dead Letter Queues

Description

Recommended Fix

HIGH-008: Short Log Retention (7 Days)

Description