Automating Everything: CI/CD, Monitoring, and Workflows

Part 3 of 4: Building Production-Grade Infrastructure

From Infrastructure to Operations

In Part 1, we identified the business problem and designed our architecture. In Part 2, we built secure, modular infrastructure with Terraform.

But infrastructure alone isn’t production-ready.

A system is only as good as our ability to:

Know when it breaks (monitoring)
Deploy changes safely (CI/CD)
Automate business logic (workflows)

This article covers making infrastructure operational:

CloudWatch monitoring with SLO-based alerting
GitHub Actions CI/CD with OIDC (zero long-lived secrets)
n8n workflow automation with conditional logic
End-to-end testing strategies

Target audience: DevOps engineers, SREs, and anyone responsible for keeping systems running.

The SRE Mindset: SLOs, Not Just Uptime

Why “Five Nines” is a Lie

Early in my career, I made this promise:

“Our system will have 99.999% uptime.”

Five nines = 5.26 minutes downtime per year.

Then reality hit:

AWS had a region outage (4 hours)
We deployed a bad configuration (45 minutes)
A dependency API went down (2 hours)

Total actual uptime: 98.9% (not even three nines)

The lesson: Stop promising impossible uptime. Instead, set Service Level Objectives (SLOs) based on what users actually need.

Understanding SLI, SLO, and SLA

Definitions:

Term	Definition	Example
SLI (Indicator)	What we measure	“P95 response time”
SLO (Objective)	Target for measurement	“P95 < 2 seconds”
SLA (Agreement)	Contractual commitment with consequences	“99.9% uptime or refund”

For our school absence system:

SLI: Percentage of successful absence submissions
SLO: 99.5% of submissions succeed
Error Budget: 0.5% = 5 failed requests per 1,000

SLI: P95 response time for form submission  
SLO: 95th percentile < 2 seconds
Error Budget: 5% of requests can be slower

SLI: System availability (uptime)
SLO: 99.9% uptime per month
Error Budget: 43.8 minutes downtime allowed

Why error budgets matter:

If we’re at 99.95% uptime (exceeding our 99.9% SLO), we have budget to spend:

Deploy risky features
Experiment with new tech
Aggressive performance optimizations

If we’re at 99.85% uptime (below SLO), we freeze features:

Only critical bug fixes
Focus on reliability
Slow down and stabilize

Interview talking point: “I use error budgets to balance innovation with reliability. SLOs aren’t ceilings to hit—they’re contracts with users. If we’re exceeding SLO, we can take risks. If we’re missing SLO, we focus on stability.”

Production Monitoring with CloudWatch

Our Monitoring Strategy

Three types of alarms:

Symptom-based (user-facing problems)
- “Users can’t submit forms” → ALB 5xx errors
- “Form is slow” → P95 response time
Cause-based (infrastructure problems)
- “EC2 instance failed” → Status check
- “Target unhealthy” → ALB health check
SLO-based (error budget consumption)
- “Approaching monthly downtime limit”
- “Error rate trending toward breach”

We prioritize symptom-based alarms because they tell us what users experience.

Implementing Monitoring in Terraform

File: modules/monitoring/main.tf

# SNS Topic - central alerting hub
resource "aws_sns_topic" "alerts" {
  name = "${var.project_name}-alerts"
  
  tags = {
    Name = "${var.project_name}-alerts"
  }
}

# Email subscription (must confirm via email)
resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = var.alert_email
}

How it works:

Alarm triggers → Publishes to SNS topic
SNS → Emails subscribed addresses
You receive alert, investigate

Production tip: Use PagerDuty, Slack, or OpsGenie instead of email for real production systems. Email is fine for small teams.

EC2 Health Alarm (Infrastructure Monitoring)

resource "aws_cloudwatch_metric_alarm" "ec2_status_check" {
  alarm_name          = "${var.project_name}-ec2-status-check-failed"
  alarm_description   = "EC2 instance status check failed - n8n may be down"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = 2          # 2 consecutive failures required
  metric_name         = "StatusCheckFailed"
  namespace           = "AWS/EC2"
  period              = 60         # Check every 60 seconds
  statistic           = "Maximum"
  threshold           = 1          # Any failure triggers
  alarm_actions       = [aws_sns_topic.alerts.arn]
  treat_missing_data  = "notBreaching"  # Important!
  
  dimensions = {
    InstanceId = var.instance_id
  }
  
  tags = {
    Name = "${var.project_name}-ec2-health-alarm"
  }
}

Configuration explained:

evaluation_periods = 2

Requires 2 consecutive failures before alarming
Why? Prevents false alarms from brief hiccups
Math: 2 periods × 60 seconds = 2 minutes to detect failure

period = 60

Check every 60 seconds
Why? Fast detection (infrastructure failures)
Tradeoff: More frequent checks = more API calls (minimal cost)

statistic = "Maximum"

Use the highest value in the period
Why? For binary metrics (0 or 1), Maximum catches any failure
Alternative: Average would be 0.5 if half the checks failed (confusing)

threshold = 1

Trigger if value ≥ 1
Why? Status checks return 0 (pass) or 1 (fail)

treat_missing_data = "notBreaching"

If CloudWatch can’t collect data, don’t trigger alarm
Why? Missing data usually means CloudWatch collection issue, not EC2 failure
Alternative: "breaching" would alarm on missing data (too sensitive for AWS-managed metrics)

Real-world scenario:

8:00 AM: Instance healthy (StatusCheckFailed = 0)
8:01 AM: Instance fails (StatusCheckFailed = 1) - Evaluation 1
8:02 AM: Instance still failing (StatusCheckFailed = 1) - Evaluation 2 → ALARM
8:03 AM: SNS sends email
8:04 AM: You investigate

Detection time: 2 minutes

ALB Unhealthy Target Alarm (Application Monitoring)

resource "aws_cloudwatch_metric_alarm" "alb_unhealthy_targets" {
  alarm_name          = "${var.project_name}-alb-unhealthy-targets"
  alarm_description   = "ALB target is unhealthy - n8n not responding"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = 2
  metric_name         = "UnHealthyHostCount"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Maximum"
  threshold           = 1
  alarm_actions       = [aws_sns_topic.alerts.arn]
  treat_missing_data  = "breaching"  # Different from EC2!
  
  dimensions = {
    TargetGroup  = var.target_group_arn_suffix
    LoadBalancer = var.alb_arn_suffix
  }
  
  tags = {
    Name = "${var.project_name}-alb-unhealthy-alarm"
  }
}

Why treat_missing_data = "breaching" here?

This is a key distinction from the EC2 alarm:

EC2 status checks (AWS-managed):

Missing data = CloudWatch collection issue (AWS problem)
notBreaching = don’t alarm on AWS infrastructure problems

ALB health checks (application-level):

Missing data = ALB can’t reach target (our problem)
breaching = if we can’t measure health, assume unhealthy

Interview talking point: “I use different treat_missing_data strategies based on what the metric measures. For AWS-managed infrastructure metrics, missing data is usually a collection issue. For application health checks, missing data indicates a real problem—the ALB can’t reach the backend.”

Response Time SLO Alarm (User Experience Monitoring)

resource "aws_cloudwatch_metric_alarm" "response_time_slo" {
  alarm_name          = "${var.project_name}-response-time-slo"
  alarm_description   = "Response time SLO breach - 95th percentile exceeds 2 seconds"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = 2
  metric_name         = "TargetResponseTime"
  namespace           = "AWS/ApplicationELB"
  period              = 300        # 5 minutes (not 60!)
  extended_statistic  = "p95"      # 95th percentile
  threshold           = 2          # 2 seconds
  alarm_actions       = [aws_sns_topic.alerts.arn]
  treat_missing_data  = "notBreaching"
  
  dimensions = {
    LoadBalancer = var.alb_arn_suffix
  }
  
  tags = {
    Name = "${var.project_name}-response-time-slo-alarm"
  }
}

This alarm is fundamentally different from the others:

Aspect	Infrastructure Alarms	SLO Alarm
Purpose	Detect failures	Track user experience
Period	60 seconds	300 seconds
Why	Fast detection	Trend analysis
Statistic	Maximum	p95
Why	Binary (up/down)	Percentile (performance)
Urgency	Page immediately	Review during business hours

Why extended_statistic = "p95" instead of statistic = "Average"?

Average hides problems:

Request 1: 100ms
Request 2: 150ms
Request 3: 200ms
Request 4: 5,000ms  ← One slow request

Average = 1,362ms (looks okay!)
P95 = 5,000ms (captures the outlier)

P95 captures what 5% of users experience (the slowest requests). This is what matters for SLOs.

Why period = 300 (5 minutes)?

SLO tracking looks at trends, not instant spikes
One slow request shouldn’t trigger an alarm
5 minutes smooths out noise
Still responsive enough (10 minutes to detect sustained slowness)

Real-world scenario:

8:00-8:05: P95 = 1.8s (good)
8:05-8:10: P95 = 2.3s (slow) - Evaluation 1
8:10-8:15: P95 = 2.5s (still slow) - Evaluation 2 → ALARM
8:16: Email arrives: "Response time SLO breach - investigate"

Detection time: 10-15 minutes (acceptable for performance degradation)

Monitoring Module Variables

File: modules/monitoring/variables.tf

variable "project_name" {
  description = "Project name for resource naming"
  type        = string
}

variable "instance_id" {
  description = "EC2 instance ID to monitor"
  type        = string
}

variable "alb_arn_suffix" {
  description = "ALB ARN suffix for CloudWatch dimensions"
  type        = string
}

variable "target_group_arn_suffix" {
  description = "Target group ARN suffix for CloudWatch dimensions"
  type        = string
}

variable "alert_email" {
  description = "Email address for CloudWatch alarms"
  type        = string
}

Why ARN suffix instead of full ARN?

CloudWatch dimensions use ARN suffix:

# Wrong
dimensions = {
  LoadBalancer = "arn:aws:elasticloadbalancing:us-east-1:123:loadbalancer/app/alb/xyz"
}

# Right
dimensions = {
  LoadBalancer = "app/alb/xyz"  # Just the suffix!
}

We extract suffix in load_balancer module:

output "alb_arn_suffix" {
  value = aws_lb.main.arn_suffix  # Terraform provides this!
}

CI/CD with GitHub Actions & OIDC

The Problem with Traditional CI/CD

The old way (DON’T do this):

# .github/workflows/deploy.yml
- name: Configure AWS Credentials
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY }}

Problems:

❌ Long-lived credentials (keys never expire)
❌ If GitHub is compromised, attacker has permanent AWS access
❌ Must manually rotate keys
❌ Can’t restrict by branch (any branch can use the key)
❌ Violates principle of least privilege

Real-world horror story: A company stored AWS keys in GitHub Secrets. An intern accidentally pushed them to a public repo’s commit history. $50,000 AWS bill from crypto miners before they noticed.

The Modern Way: OIDC (OpenID Connect)

How OIDC works:

1. GitHub Actions job starts
2. GitHub generates short-lived JWT token (signed by GitHub)
3. Job presents token to AWS STS (Security Token Service)
4. AWS verifies:
   - Is this really GitHub? (validates signature)
   - Is it the right repo? (checks token claims)
   - Is it the right branch? (checks ref)
5. AWS issues temporary credentials (15 min expiration)
6. Job uses credentials
7. Credentials automatically expire

Benefits:

✅ Zero long-lived secrets
✅ Credentials expire in 15 minutes (can’t leak permanently)
✅ Can restrict by repo AND branch
✅ Audit trail in CloudTrail (every authentication logged)
✅ No manual rotation needed

Terraform Implementation

File: modules/cicd/main.tf

GitHub OIDC Provider

# One-time setup: Trust GitHub's OIDC provider
resource "aws_iam_openid_connect_provider" "github" {
  url             = "https://token.actions.githubusercontent.com"
  client_id_list  = ["sts.amazonaws.com"]
  
  # GitHub's certificate thumbprints (static, provided by GitHub)
  thumbprint_list = [
    "6938fd4d98bab03faadb97b34396831e3780aea1",
    "1c58a3a8518e8759bf075b76b750d4f2df264fcd"
  ]
  
  tags = {
    Name = "${var.project_name}-github-oidc"
  }
}

What this does: Establishes trust between AWS and GitHub.

IAM Role for GitHub Actions

resource "aws_iam_role" "github_actions" {
  name        = "${var.project_name}-github-actions"
  description = "Role for GitHub Actions to deploy frontend"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.github.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
        }
        StringLike = {
          # CRITICAL: Only main branch of specific repo can assume
          "token.actions.githubusercontent.com:sub" = 
            "repo:${var.github_org}/${var.github_repo}:ref:refs/heads/main"
        }
      }
    }]
  })
  
  tags = {
    Name = "${var.project_name}-github-actions-role"
  }
}

The security magic is in the Condition:

"token.actions.githubusercontent.com:sub" = 
  "repo:brent-hollers/absence-system-infrastructure:ref:refs/heads/main"

This means:

✅ ONLY the brent-hollers/absence-system-infrastructure repo
✅ ONLY the main branch
❌ Feature branches can’t deploy
❌ Other repos can’t assume this role
❌ Pull requests can’t deploy (must merge first)

Interview talking point: “I use OIDC trust policies to enforce GitOps workflows. Only merged code on main can deploy to production. This prevents accidental deployments from feature branches and enforces code review.”

Least-Privilege IAM Policy

resource "aws_iam_policy" "github_actions" {
  name        = "${var.project_name}-github-actions-policy"
  description = "Minimum permissions for deploying frontend"
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "S3Upload"
        Effect = "Allow"
        Action = [
          "s3:PutObject",      # Upload files
          "s3:GetObject"       # Verify uploads
        ]
        Resource = "arn:aws:s3:::${var.s3_bucket_name}/*"
      },
      {
        Sid    = "CloudFrontInvalidation"
        Effect = "Allow"
        Action = [
          "cloudfront:CreateInvalidation",  # Clear cache
          "cloudfront:ListDistributions"    # Find distribution
        ]
        Resource = "*"  # ListDistributions requires wildcard
      }
    ]
  })
}

# Attach policy to role
resource "aws_iam_role_policy_attachment" "github_actions" {
  role       = aws_iam_role.github_actions.name
  policy_arn = aws_iam_policy.github_actions.arn
}

What’s allowed:

✅ Upload files to specific S3 bucket
✅ Invalidate CloudFront cache

What’s NOT allowed:

❌ Delete S3 objects
❌ Modify infrastructure (EC2, VPC, etc.)
❌ Access other S3 buckets
❌ Modify CloudFront distribution settings

Principle: Give just enough permission to do the job, nothing more.

GitHub Actions Workflow

File: .github/workflows/deploy-frontend.yml

name: Deploy Frontend to S3

on:
  push:
    branches:
      - main
    paths:
      - 'frontend/index.html'  # Only trigger when form changes

# Required for OIDC authentication
permissions:
  id-token: write  # Generate OIDC token
  contents: read   # Read repo code

jobs:
  deploy:
    name: Deploy to S3 and Invalidate CloudFront
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Configure AWS Credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::005608856189:role/absence-system-github-actions
          aws-region: us-east-1
      
      - name: Upload to S3
        run: |
          aws s3 cp frontend/index.html \
            s3://absence-system-frontend-xxx/index.html \
            --content-type "text/html" \
            --cache-control "max-age=300"
      
      - name: Get CloudFront Distribution ID
        id: cloudfront
        run: |
          DIST_ID=$(aws cloudfront list-distributions \
            --query "DistributionList.Items[?contains(Origins.Items[0].DomainName, 'absence-system-frontend')].Id" \
            --output text)
          echo "distribution_id=$DIST_ID" >> $GITHUB_OUTPUT
      
      - name: Invalidate CloudFront cache
        run: |
          aws cloudfront create-invalidation \
            --distribution-id ${{ steps.cloudfront.outputs.distribution_id }} \
            --paths "/index.html"
      
      - name: Deployment summary
        run: |
          echo "✅ Frontend deployed successfully!"
          echo "📦 S3 Bucket: absence-system-frontend-xxx"
          echo "🔗 URL: https://form.absences.smaschool.org"

Workflow breakdown:

Trigger:

on:
  push:
    branches: [main]
    paths: ['frontend/index.html']

Only runs when index.html changes on main branch
Saves GitHub Actions minutes (don’t run on README changes)

OIDC authentication:

permissions:
  id-token: write  # THIS IS CRITICAL!

Without this, GitHub won’t generate OIDC tokens
Common mistake: Forgetting this permission

Cache control:

--cache-control "max-age=300"

Browsers/CDN cache for 5 minutes
Balance: Fresh content vs. reduced S3 requests
Alternatives:
- max-age=0: No caching (always fetch latest)
- max-age=3600: Cache for 1 hour (less fresh, cheaper)

Dynamic distribution lookup:

DIST_ID=$(aws cloudfront list-distributions ...)

Don’t hardcode distribution ID
Works if distribution is recreated
Workflow is portable

Invalidation optimization:

--paths "/index.html"

Only invalidate what changed
Cost: First 1,000 invalidations/month free, then $0.005/path
Invalidating /* costs more and takes longer

End-to-End Flow

Developer workflow:

1. Developer edits frontend/index.html locally
2. git add frontend/index.html
3. git commit -m "feat: add period selection"
4. git push origin main

   ↓ (GitHub detects push to main)

5. GitHub Actions starts workflow
6. GitHub generates OIDC token
7. Workflow assumes AWS IAM role (temporary credentials)
8. Upload index.html to S3 (replaces old version)
9. Invalidate CloudFront cache
10. Within 5 minutes: Changes live!

Total time: ~2-3 minutes
Developer involvement: Just git push

Zero manual AWS console work!

Workflow Automation with n8n

Business Logic Review

Our absence request workflow:

Teacher submits form
    ↓
Is date in the past? (already happened)
    ├─ YES → Auto-approve (retroactive sick leave)
    │         → Log to Google Sheets
    │         → Email confirmation to teacher
    │
    └─ NO → Send approval email to principal
               ↓
            Principal clicks Approve/Deny
               ↓
            If APPROVED:
               ├─ Coverage needed?
               │   ├─ YES → Email front desk (periods 1-8)
               │   └─ NO → Skip
               ├─ Log to Google Sheets
               └─ Email confirmation to teacher
            
            If DENIED:
               └─ Email teacher (explain next steps)

n8n Implementation

Node 1: Webhook Trigger

{
  "path": "/webhook/absence-request",
  "method": "POST",
  "responseMode": "onReceived",
  "authentication": "none"
}

Receives data:

{
  "name": "Jane Doe",
  "email": "jane@smaschool.org",
  "date": "2026-02-20",
  "reason": "Medical appointment",
  "coverageNeeded": "yes",
  "periodsAbsent": ["Period 1", "Period 2", "Period 3"]
}

Node 2: Code - Date Logic

// Check if request date is in the past
const requestDate = new Date(items[0].json.date);
const today = new Date();
today.setHours(0, 0, 0, 0);  // Normalize to midnight

const isPastDate = requestDate < today;

return items.map(item => ({
  json: {
    ...item.json,
    isPastDate: isPastDate,
    requiresApproval: !isPastDate,
    requestId: `REQ-${Date.now()}`,  // Unique ID for tracking
    timestamp: new Date().toISOString()
  }
}));

What this adds:

isPastDate: Boolean for routing decision
requiresApproval: Inverse (clearer naming)
requestId: Unique identifier for this request
timestamp: When workflow processed it

Node 3: IF - Branch by Date

// Condition
{{ $json.isPastDate }} === true

Creates two paths:

TRUE path: Past date → Auto-approve
FALSE path: Future date → Approval workflow

Node 4a (Past Date): Set - Format Data

{
  "status": "Approved - Retroactive",
  "approver": "System (Auto-approved sick leave)",
  "approvalDate": "{{ $now }}",
  "requiresCoverage": false
}

Node 4b (Future Date): Gmail - Send Approval Request

<h2>Absence Request from {{ $json.name }}</h2>

<p><strong>Email:</strong> {{ $json.email }}</p>
<p><strong>Date:</strong> {{ $json.date }}</p>
<p><strong>Reason:</strong> {{ $json.reason }}</p>
<p><strong>Coverage Required:</strong> {{ $json.coverageNeeded }}</p>
<p><strong>Periods:</strong> {{ $json.periodsAbsent.join(', ') }}</p>

<p>
  <a href="https://absences.smaschool.org/webhook/approve?id={{ $json.requestId }}" 
     style="background: #28a745; color: white; padding: 12px 24px; text-decoration: none; border-radius: 4px;">
    ✅ Approve Request
  </a>
  
  <a href="https://absences.smaschool.org/webhook/deny?id={{ $json.requestId }}" 
     style="background: #dc3545; color: white; padding: 12px 24px; text-decoration: none; border-radius: 4px;">
    ❌ Deny Request
  </a>
</p>

Principal receives email with:

All request details
Two buttons (Approve/Deny)
Clicking button triggers another webhook
Request ID embedded in URL for tracking

Node 5: Wait for Webhook (Approval Response)

{
  "path": "/webhook/approve",
  "method": "GET"
}

Workflow pauses until principal clicks button.

When clicked:

GET https://absences.smaschool.org/webhook/approve?id=REQ-1708534800000

Workflow resumes with approval decision.

Node 6: IF - Check Coverage

{{ $json.coverageNeeded }} === "yes"

If TRUE:

Email front desk with period details

If FALSE:

Skip coverage notification

Node 7: Gmail - Notify Front Desk

<h2>Coverage Needed: {{ $json.date }}</h2>

<p><strong>Teacher:</strong> {{ $json.name }}</p>
<p><strong>Date:</strong> {{ $json.date }}</p>
<p><strong>Periods requiring coverage:</strong></p>
<ul>
  {{ $json.periodsAbsent.map(p => `<li>${p}</li>`).join('') }}
</ul>

<p>Please arrange substitute coverage for the periods listed above.</p>

Node 8: Google Sheets - Append Row

{
  "Name": "{{ $json.name }}",
  "Email": "{{ $json.email }}",
  "Date": "{{ $json.date }}",
  "Reason": "{{ $json.reason }}",
  "Status": "{{ $json.status }}",
  "Approver": "{{ $json.approver }}",
  "Coverage": "{{ $json.coverageNeeded }}",
  "Periods": "{{ $json.periodsAbsent.join(', ') }}",
  "Timestamp": "{{ $json.timestamp }}"
}

Creates audit trail in Google Sheets.

Sheet columns:

Name	Email	Date	Reason	Status	Approver	Coverage	Periods	Timestamp
Jane Doe	jane@…	2026-02-20	Medical	Approved	Principal	yes	P1, P2, P3	2026-02-19T15:30:00Z

Node 9: Gmail - Confirmation to Teacher

<h2>Absence Request {{ $json.status }}</h2>

<p>Dear {{ $json.name }},</p>

<p>Your absence request for {{ $json.date }} has been {{ $json.status }}.</p>

<p><strong>Request Details:</strong></p>
<ul>
  <li>Date: {{ $json.date }}</li>
  <li>Reason: {{ $json.reason }}</li>
  <li>Status: {{ $json.status }}</li>
  <li>Request ID: {{ $json.requestId }}</li>
</ul>

<p>This request has been logged and the front desk has been notified.</p>

Workflow as Code

We export the workflow JSON and store in Git:

workflows/
└── absence-approval-workflow.json

Benefits:

✅ Version controlled (track changes)
✅ Can diff between versions
✅ Can import to new n8n instance
✅ Part of infrastructure as code

Importing workflow:

# In n8n UI:
# Workflows → Import from File → Select JSON
# Or via API:
curl -X POST https://absences.smaschool.org/api/v1/workflows \
  -H "Content-Type: application/json" \
  -d @workflows/absence-approval-workflow.json

Testing Strategies

Unit Testing (Individual Components)

Test 1: Date logic

curl -X POST http://localhost:5678/webhook-test/abc123 \
  -H "Content-Type: application/json" \
  -d '{
    "date": "2026-02-10",
    "name": "Test"
  }'

# Expected: isPastDate = true

Test 2: Future date

curl -X POST http://localhost:5678/webhook-test/abc123 \
  -H "Content-Type: application/json" \
  -d '{
    "date": "2026-03-01",
    "name": "Test"
  }'

# Expected: isPastDate = false, approval email sent

Integration Testing (End-to-End)

Test scenario: Complete approval workflow

# 1. Submit future absence request
curl -X POST https://absences.smaschool.org/webhook/absence-request \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Integration Test",
    "email": "test@smaschool.org",
    "date": "2026-03-15",
    "reason": "Vacation",
    "coverageNeeded": "yes",
    "periodsAbsent": ["Period 1", "Period 2"]
  }'

# 2. Check email inbox - approval request received?
# 3. Click "Approve" button
# 4. Check Google Sheets - row added?
# 5. Check email - confirmation sent to teacher?
# 6. Check email - front desk notified of coverage?

Success criteria:

✅ Principal receives approval email within 30 seconds
✅ Google Sheets row appears
✅ Teacher receives confirmation email
✅ Front desk receives coverage notification
✅ All emails contain correct data

Load Testing

Simulate 100 concurrent requests:

# Using Apache Bench
ab -n 100 -c 10 -p request.json -T application/json \
  https://absences.smaschool.org/webhook/absence-request

# Expected results:
# - All requests succeed (100%)
# - P95 response time < 2 seconds
# - No errors in CloudWatch logs

Putting It All Together

Complete deployment flow:

Developer changes index.html
    ↓
git push origin main
    ↓
GitHub Actions (OIDC auth)
    ↓
Upload to S3 + Invalidate CloudFront
    ↓
Changes live in 5 minutes
    
Teacher submits form
    ↓
CloudFront serves form
    ↓
Form POSTs to ALB (HTTPS)
    ↓
ALB → EC2 (private subnet)
    ↓
n8n processes workflow
    ├─ Past date → Auto-approve
    └─ Future date → Principal approval
    ↓
Google Sheets logging
    ↓
Email confirmations
    
CloudWatch monitors
    ├─ EC2 health
    ├─ ALB targets
    └─ Response time (SLO)
    ↓
SNS alerts if threshold exceeded

Everything is automated.

Key Takeaways

SLOs > uptime promises - Error budgets balance innovation with reliability
Different alarms need different configurations - Infrastructure (60s), SLO (300s), treat_missing_data varies
OIDC eliminates credential management - Zero long-lived secrets, branch restrictions built-in
Workflow as code - n8n JSON in Git, just like Terraform
Testing is multi-layered - Unit tests (components), integration tests (E2E), load tests (scale)
Automation compounds - CI/CD + monitoring + workflows = fully operational system

What’s Next

We’ve built the complete operational stack—infrastructure, monitoring, CI/CD, and workflows.

In Part 4 (final article), we’ll reflect on:

What worked, what didn’t, what we’d change
Cost analysis and optimization strategies
Scaling beyond one school (multi-tenant)
Security hardening for production
Future enhancements roadmap
Lessons for your next project

This is Part 3 of a 4-part series. Part 4: Lessons Learned - Cost, Challenges, and What’s Next wraps up with honest reflections and future thinking.

Questions? Connect on LinkedIn or explore the GitHub repo.

Tags: #DevOps #CI/CD #CloudWatch #Monitoring #SRE #GitHubActions #OIDC #Automation #n8n

IaC Part 3: Automating Everything: CI/CD, Monitoring, and Workflows

PUBLISHED ON FEB 17, 2026 / 16 MIN READ

Automating Everything: CI/CD, Monitoring, and Workflows

From Infrastructure to Operations

The SRE Mindset: SLOs, Not Just Uptime

Why “Five Nines” is a Lie

Understanding SLI, SLO, and SLA

Production Monitoring with CloudWatch

Our Monitoring Strategy

Implementing Monitoring in Terraform

EC2 Health Alarm (Infrastructure Monitoring)

ALB Unhealthy Target Alarm (Application Monitoring)

Response Time SLO Alarm (User Experience Monitoring)

Monitoring Module Variables

CI/CD with GitHub Actions & OIDC

The Problem with Traditional CI/CD

The Modern Way: OIDC (OpenID Connect)

Terraform Implementation

GitHub OIDC Provider

IAM Role for GitHub Actions

Least-Privilege IAM Policy

GitHub Actions Workflow

End-to-End Flow

Workflow Automation with n8n

Business Logic Review

n8n Implementation

Workflow as Code

Testing Strategies

Unit Testing (Individual Components)

Integration Testing (End-to-End)

Load Testing

Putting It All Together

Key Takeaways

What’s Next

TAGS: AUTOMATION, CI/CD, CLOUDWATCH, DEVOPS, GITHUB ACTIONS, MONITORING, OIDC, SRE

home

PUBLISHED ON FEB 17, 2026 / 16 MIN READ

Automating Everything: CI/CD, Monitoring, and Workflows

From Infrastructure to Operations

The SRE Mindset: SLOs, Not Just Uptime

Why “Five Nines” is a Lie

Understanding SLI, SLO, and SLA

Production Monitoring with CloudWatch

Our Monitoring Strategy

Implementing Monitoring in Terraform

SNS Topic for Alerts

EC2 Health Alarm (Infrastructure Monitoring)

ALB Unhealthy Target Alarm (Application Monitoring)

Response Time SLO Alarm (User Experience Monitoring)

Monitoring Module Variables

CI/CD with GitHub Actions & OIDC

The Problem with Traditional CI/CD

The Modern Way: OIDC (OpenID Connect)

Terraform Implementation

GitHub OIDC Provider

IAM Role for GitHub Actions

Least-Privilege IAM Policy

GitHub Actions Workflow

End-to-End Flow

Workflow Automation with n8n

Business Logic Review

n8n Implementation

Workflow as Code

Testing Strategies

Unit Testing (Individual Components)

Integration Testing (End-to-End)

Load Testing

Putting It All Together

Key Takeaways

What’s Next

TAGS: AUTOMATION, CI/CD, CLOUDWATCH, DEVOPS, GITHUB ACTIONS, MONITORING, OIDC, SRE

home