Part 3 of 4: Building Production-Grade Infrastructure
In Part 1, we identified the business problem and designed our architecture. In Part 2, we built secure, modular infrastructure with Terraform.
But infrastructure alone isn’t production-ready.
A system is only as good as our ability to:
This article covers making infrastructure operational:
Target audience: DevOps engineers, SREs, and anyone responsible for keeping systems running.
Early in my career, I made this promise:
“Our system will have 99.999% uptime.”
Five nines = 5.26 minutes downtime per year.
Then reality hit:
Total actual uptime: 98.9% (not even three nines)
The lesson: Stop promising impossible uptime. Instead, set Service Level Objectives (SLOs) based on what users actually need.
Definitions:
| Term | Definition | Example |
|---|---|---|
| SLI (Indicator) | What we measure | “P95 response time” |
| SLO (Objective) | Target for measurement | “P95 < 2 seconds” |
| SLA (Agreement) | Contractual commitment with consequences | “99.9% uptime or refund” |
For our school absence system:
SLI: Percentage of successful absence submissions
SLO: 99.5% of submissions succeed
Error Budget: 0.5% = 5 failed requests per 1,000
SLI: P95 response time for form submission
SLO: 95th percentile < 2 seconds
Error Budget: 5% of requests can be slower
SLI: System availability (uptime)
SLO: 99.9% uptime per month
Error Budget: 43.8 minutes downtime allowed
Why error budgets matter:
If we’re at 99.95% uptime (exceeding our 99.9% SLO), we have budget to spend:
If we’re at 99.85% uptime (below SLO), we freeze features:
Interview talking point: “I use error budgets to balance innovation with reliability. SLOs aren’t ceilings to hit—they’re contracts with users. If we’re exceeding SLO, we can take risks. If we’re missing SLO, we focus on stability.”
Three types of alarms:
Symptom-based (user-facing problems)
Cause-based (infrastructure problems)
SLO-based (error budget consumption)
We prioritize symptom-based alarms because they tell us what users experience.
File: modules/monitoring/main.tf
# SNS Topic - central alerting hub
resource "aws_sns_topic" "alerts" {
name = "${var.project_name}-alerts"
tags = {
Name = "${var.project_name}-alerts"
}
}
# Email subscription (must confirm via email)
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = var.alert_email
}
How it works:
Production tip: Use PagerDuty, Slack, or OpsGenie instead of email for real production systems. Email is fine for small teams.
resource "aws_cloudwatch_metric_alarm" "ec2_status_check" {
alarm_name = "${var.project_name}-ec2-status-check-failed"
alarm_description = "EC2 instance status check failed - n8n may be down"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = 2 # 2 consecutive failures required
metric_name = "StatusCheckFailed"
namespace = "AWS/EC2"
period = 60 # Check every 60 seconds
statistic = "Maximum"
threshold = 1 # Any failure triggers
alarm_actions = [aws_sns_topic.alerts.arn]
treat_missing_data = "notBreaching" # Important!
dimensions = {
InstanceId = var.instance_id
}
tags = {
Name = "${var.project_name}-ec2-health-alarm"
}
}
Configuration explained:
evaluation_periods = 2
period = 60
statistic = "Maximum"
threshold = 1
treat_missing_data = "notBreaching"
"breaching" would alarm on missing data (too sensitive for AWS-managed metrics)Real-world scenario:
8:00 AM: Instance healthy (StatusCheckFailed = 0)
8:01 AM: Instance fails (StatusCheckFailed = 1) - Evaluation 1
8:02 AM: Instance still failing (StatusCheckFailed = 1) - Evaluation 2 → ALARM
8:03 AM: SNS sends email
8:04 AM: You investigate
Detection time: 2 minutes
resource "aws_cloudwatch_metric_alarm" "alb_unhealthy_targets" {
alarm_name = "${var.project_name}-alb-unhealthy-targets"
alarm_description = "ALB target is unhealthy - n8n not responding"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = 2
metric_name = "UnHealthyHostCount"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Maximum"
threshold = 1
alarm_actions = [aws_sns_topic.alerts.arn]
treat_missing_data = "breaching" # Different from EC2!
dimensions = {
TargetGroup = var.target_group_arn_suffix
LoadBalancer = var.alb_arn_suffix
}
tags = {
Name = "${var.project_name}-alb-unhealthy-alarm"
}
}
Why treat_missing_data = "breaching" here?
This is a key distinction from the EC2 alarm:
EC2 status checks (AWS-managed):
notBreaching = don’t alarm on AWS infrastructure problemsALB health checks (application-level):
breaching = if we can’t measure health, assume unhealthyInterview talking point: “I use different treat_missing_data strategies based on what the metric measures. For AWS-managed infrastructure metrics, missing data is usually a collection issue. For application health checks, missing data indicates a real problem—the ALB can’t reach the backend.”
resource "aws_cloudwatch_metric_alarm" "response_time_slo" {
alarm_name = "${var.project_name}-response-time-slo"
alarm_description = "Response time SLO breach - 95th percentile exceeds 2 seconds"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = 2
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 300 # 5 minutes (not 60!)
extended_statistic = "p95" # 95th percentile
threshold = 2 # 2 seconds
alarm_actions = [aws_sns_topic.alerts.arn]
treat_missing_data = "notBreaching"
dimensions = {
LoadBalancer = var.alb_arn_suffix
}
tags = {
Name = "${var.project_name}-response-time-slo-alarm"
}
}
This alarm is fundamentally different from the others:
| Aspect | Infrastructure Alarms | SLO Alarm |
|---|---|---|
| Purpose | Detect failures | Track user experience |
| Period | 60 seconds | 300 seconds |
| Why | Fast detection | Trend analysis |
| Statistic | Maximum | p95 |
| Why | Binary (up/down) | Percentile (performance) |
| Urgency | Page immediately | Review during business hours |
Why extended_statistic = "p95" instead of statistic = "Average"?
Average hides problems:
Request 1: 100ms
Request 2: 150ms
Request 3: 200ms
Request 4: 5,000ms ← One slow request
Average = 1,362ms (looks okay!)
P95 = 5,000ms (captures the outlier)
P95 captures what 5% of users experience (the slowest requests). This is what matters for SLOs.
Why period = 300 (5 minutes)?
Real-world scenario:
8:00-8:05: P95 = 1.8s (good)
8:05-8:10: P95 = 2.3s (slow) - Evaluation 1
8:10-8:15: P95 = 2.5s (still slow) - Evaluation 2 → ALARM
8:16: Email arrives: "Response time SLO breach - investigate"
Detection time: 10-15 minutes (acceptable for performance degradation)
File: modules/monitoring/variables.tf
variable "project_name" {
description = "Project name for resource naming"
type = string
}
variable "instance_id" {
description = "EC2 instance ID to monitor"
type = string
}
variable "alb_arn_suffix" {
description = "ALB ARN suffix for CloudWatch dimensions"
type = string
}
variable "target_group_arn_suffix" {
description = "Target group ARN suffix for CloudWatch dimensions"
type = string
}
variable "alert_email" {
description = "Email address for CloudWatch alarms"
type = string
}
Why ARN suffix instead of full ARN?
CloudWatch dimensions use ARN suffix:
# Wrong
dimensions = {
LoadBalancer = "arn:aws:elasticloadbalancing:us-east-1:123:loadbalancer/app/alb/xyz"
}
# Right
dimensions = {
LoadBalancer = "app/alb/xyz" # Just the suffix!
}
We extract suffix in load_balancer module:
output "alb_arn_suffix" {
value = aws_lb.main.arn_suffix # Terraform provides this!
}
The old way (DON’T do this):
# .github/workflows/deploy.yml
- name: Configure AWS Credentials
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY }}
Problems:
Real-world horror story: A company stored AWS keys in GitHub Secrets. An intern accidentally pushed them to a public repo’s commit history. $50,000 AWS bill from crypto miners before they noticed.
How OIDC works:
1. GitHub Actions job starts
2. GitHub generates short-lived JWT token (signed by GitHub)
3. Job presents token to AWS STS (Security Token Service)
4. AWS verifies:
- Is this really GitHub? (validates signature)
- Is it the right repo? (checks token claims)
- Is it the right branch? (checks ref)
5. AWS issues temporary credentials (15 min expiration)
6. Job uses credentials
7. Credentials automatically expire
Benefits:
File: modules/cicd/main.tf
# One-time setup: Trust GitHub's OIDC provider
resource "aws_iam_openid_connect_provider" "github" {
url = "https://token.actions.githubusercontent.com"
client_id_list = ["sts.amazonaws.com"]
# GitHub's certificate thumbprints (static, provided by GitHub)
thumbprint_list = [
"6938fd4d98bab03faadb97b34396831e3780aea1",
"1c58a3a8518e8759bf075b76b750d4f2df264fcd"
]
tags = {
Name = "${var.project_name}-github-oidc"
}
}
What this does: Establishes trust between AWS and GitHub.
resource "aws_iam_role" "github_actions" {
name = "${var.project_name}-github-actions"
description = "Role for GitHub Actions to deploy frontend"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.github.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
}
StringLike = {
# CRITICAL: Only main branch of specific repo can assume
"token.actions.githubusercontent.com:sub" =
"repo:${var.github_org}/${var.github_repo}:ref:refs/heads/main"
}
}
}]
})
tags = {
Name = "${var.project_name}-github-actions-role"
}
}
The security magic is in the Condition:
"token.actions.githubusercontent.com:sub" =
"repo:brent-hollers/absence-system-infrastructure:ref:refs/heads/main"
This means:
brent-hollers/absence-system-infrastructure repomain branchInterview talking point: “I use OIDC trust policies to enforce GitOps workflows. Only merged code on main can deploy to production. This prevents accidental deployments from feature branches and enforces code review.”
resource "aws_iam_policy" "github_actions" {
name = "${var.project_name}-github-actions-policy"
description = "Minimum permissions for deploying frontend"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "S3Upload"
Effect = "Allow"
Action = [
"s3:PutObject", # Upload files
"s3:GetObject" # Verify uploads
]
Resource = "arn:aws:s3:::${var.s3_bucket_name}/*"
},
{
Sid = "CloudFrontInvalidation"
Effect = "Allow"
Action = [
"cloudfront:CreateInvalidation", # Clear cache
"cloudfront:ListDistributions" # Find distribution
]
Resource = "*" # ListDistributions requires wildcard
}
]
})
}
# Attach policy to role
resource "aws_iam_role_policy_attachment" "github_actions" {
role = aws_iam_role.github_actions.name
policy_arn = aws_iam_policy.github_actions.arn
}
What’s allowed:
What’s NOT allowed:
Principle: Give just enough permission to do the job, nothing more.
File: .github/workflows/deploy-frontend.yml
name: Deploy Frontend to S3
on:
push:
branches:
- main
paths:
- 'frontend/index.html' # Only trigger when form changes
# Required for OIDC authentication
permissions:
id-token: write # Generate OIDC token
contents: read # Read repo code
jobs:
deploy:
name: Deploy to S3 and Invalidate CloudFront
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Configure AWS Credentials via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::005608856189:role/absence-system-github-actions
aws-region: us-east-1
- name: Upload to S3
run: |
aws s3 cp frontend/index.html \
s3://absence-system-frontend-xxx/index.html \
--content-type "text/html" \
--cache-control "max-age=300"
- name: Get CloudFront Distribution ID
id: cloudfront
run: |
DIST_ID=$(aws cloudfront list-distributions \
--query "DistributionList.Items[?contains(Origins.Items[0].DomainName, 'absence-system-frontend')].Id" \
--output text)
echo "distribution_id=$DIST_ID" >> $GITHUB_OUTPUT
- name: Invalidate CloudFront cache
run: |
aws cloudfront create-invalidation \
--distribution-id ${{ steps.cloudfront.outputs.distribution_id }} \
--paths "/index.html"
- name: Deployment summary
run: |
echo "✅ Frontend deployed successfully!"
echo "📦 S3 Bucket: absence-system-frontend-xxx"
echo "🔗 URL: https://form.absences.smaschool.org"
Workflow breakdown:
Trigger:
on:
push:
branches: [main]
paths: ['frontend/index.html']
index.html changes on main branchOIDC authentication:
permissions:
id-token: write # THIS IS CRITICAL!
Cache control:
--cache-control "max-age=300"
max-age=0: No caching (always fetch latest)max-age=3600: Cache for 1 hour (less fresh, cheaper)Dynamic distribution lookup:
DIST_ID=$(aws cloudfront list-distributions ...)
Invalidation optimization:
--paths "/index.html"
/* costs more and takes longerDeveloper workflow:
1. Developer edits frontend/index.html locally
2. git add frontend/index.html
3. git commit -m "feat: add period selection"
4. git push origin main
↓ (GitHub detects push to main)
5. GitHub Actions starts workflow
6. GitHub generates OIDC token
7. Workflow assumes AWS IAM role (temporary credentials)
8. Upload index.html to S3 (replaces old version)
9. Invalidate CloudFront cache
10. Within 5 minutes: Changes live!
Total time: ~2-3 minutes
Developer involvement: Just git push
Zero manual AWS console work!
Our absence request workflow:
Teacher submits form
↓
Is date in the past? (already happened)
├─ YES → Auto-approve (retroactive sick leave)
│ → Log to Google Sheets
│ → Email confirmation to teacher
│
└─ NO → Send approval email to principal
↓
Principal clicks Approve/Deny
↓
If APPROVED:
├─ Coverage needed?
│ ├─ YES → Email front desk (periods 1-8)
│ └─ NO → Skip
├─ Log to Google Sheets
└─ Email confirmation to teacher
If DENIED:
└─ Email teacher (explain next steps)
Node 1: Webhook Trigger
{
"path": "/webhook/absence-request",
"method": "POST",
"responseMode": "onReceived",
"authentication": "none"
}
Receives data:
{
"name": "Jane Doe",
"email": "jane@smaschool.org",
"date": "2026-02-20",
"reason": "Medical appointment",
"coverageNeeded": "yes",
"periodsAbsent": ["Period 1", "Period 2", "Period 3"]
}
Node 2: Code - Date Logic
// Check if request date is in the past
const requestDate = new Date(items[0].json.date);
const today = new Date();
today.setHours(0, 0, 0, 0); // Normalize to midnight
const isPastDate = requestDate < today;
return items.map(item => ({
json: {
...item.json,
isPastDate: isPastDate,
requiresApproval: !isPastDate,
requestId: `REQ-${Date.now()}`, // Unique ID for tracking
timestamp: new Date().toISOString()
}
}));
What this adds:
isPastDate: Boolean for routing decisionrequiresApproval: Inverse (clearer naming)requestId: Unique identifier for this requesttimestamp: When workflow processed itNode 3: IF - Branch by Date
// Condition
{{ $json.isPastDate }} === true
Creates two paths:
Node 4a (Past Date): Set - Format Data
{
"status": "Approved - Retroactive",
"approver": "System (Auto-approved sick leave)",
"approvalDate": "{{ $now }}",
"requiresCoverage": false
}
Node 4b (Future Date): Gmail - Send Approval Request
<h2>Absence Request from {{ $json.name }}</h2>
<p><strong>Email:</strong> {{ $json.email }}</p>
<p><strong>Date:</strong> {{ $json.date }}</p>
<p><strong>Reason:</strong> {{ $json.reason }}</p>
<p><strong>Coverage Required:</strong> {{ $json.coverageNeeded }}</p>
<p><strong>Periods:</strong> {{ $json.periodsAbsent.join(', ') }}</p>
<p>
<a href="https://absences.smaschool.org/webhook/approve?id={{ $json.requestId }}"
style="background: #28a745; color: white; padding: 12px 24px; text-decoration: none; border-radius: 4px;">
✅ Approve Request
</a>
<a href="https://absences.smaschool.org/webhook/deny?id={{ $json.requestId }}"
style="background: #dc3545; color: white; padding: 12px 24px; text-decoration: none; border-radius: 4px;">
❌ Deny Request
</a>
</p>
Principal receives email with:
Node 5: Wait for Webhook (Approval Response)
{
"path": "/webhook/approve",
"method": "GET"
}
Workflow pauses until principal clicks button.
When clicked:
GET https://absences.smaschool.org/webhook/approve?id=REQ-1708534800000
Workflow resumes with approval decision.
Node 6: IF - Check Coverage
{{ $json.coverageNeeded }} === "yes"
If TRUE:
If FALSE:
Node 7: Gmail - Notify Front Desk
<h2>Coverage Needed: {{ $json.date }}</h2>
<p><strong>Teacher:</strong> {{ $json.name }}</p>
<p><strong>Date:</strong> {{ $json.date }}</p>
<p><strong>Periods requiring coverage:</strong></p>
<ul>
{{ $json.periodsAbsent.map(p => `<li>${p}</li>`).join('') }}
</ul>
<p>Please arrange substitute coverage for the periods listed above.</p>
Node 8: Google Sheets - Append Row
{
"Name": "{{ $json.name }}",
"Email": "{{ $json.email }}",
"Date": "{{ $json.date }}",
"Reason": "{{ $json.reason }}",
"Status": "{{ $json.status }}",
"Approver": "{{ $json.approver }}",
"Coverage": "{{ $json.coverageNeeded }}",
"Periods": "{{ $json.periodsAbsent.join(', ') }}",
"Timestamp": "{{ $json.timestamp }}"
}
Creates audit trail in Google Sheets.
Sheet columns:
| Name | Date | Reason | Status | Approver | Coverage | Periods | Timestamp | |
|---|---|---|---|---|---|---|---|---|
| Jane Doe | jane@… | 2026-02-20 | Medical | Approved | Principal | yes | P1, P2, P3 | 2026-02-19T15:30:00Z |
Node 9: Gmail - Confirmation to Teacher
<h2>Absence Request {{ $json.status }}</h2>
<p>Dear {{ $json.name }},</p>
<p>Your absence request for {{ $json.date }} has been {{ $json.status }}.</p>
<p><strong>Request Details:</strong></p>
<ul>
<li>Date: {{ $json.date }}</li>
<li>Reason: {{ $json.reason }}</li>
<li>Status: {{ $json.status }}</li>
<li>Request ID: {{ $json.requestId }}</li>
</ul>
<p>This request has been logged and the front desk has been notified.</p>
We export the workflow JSON and store in Git:
workflows/
└── absence-approval-workflow.json
Benefits:
Importing workflow:
# In n8n UI:
# Workflows → Import from File → Select JSON
# Or via API:
curl -X POST https://absences.smaschool.org/api/v1/workflows \
-H "Content-Type: application/json" \
-d @workflows/absence-approval-workflow.json
Test 1: Date logic
curl -X POST http://localhost:5678/webhook-test/abc123 \
-H "Content-Type: application/json" \
-d '{
"date": "2026-02-10",
"name": "Test"
}'
# Expected: isPastDate = true
Test 2: Future date
curl -X POST http://localhost:5678/webhook-test/abc123 \
-H "Content-Type: application/json" \
-d '{
"date": "2026-03-01",
"name": "Test"
}'
# Expected: isPastDate = false, approval email sent
Test scenario: Complete approval workflow
# 1. Submit future absence request
curl -X POST https://absences.smaschool.org/webhook/absence-request \
-H "Content-Type: application/json" \
-d '{
"name": "Integration Test",
"email": "test@smaschool.org",
"date": "2026-03-15",
"reason": "Vacation",
"coverageNeeded": "yes",
"periodsAbsent": ["Period 1", "Period 2"]
}'
# 2. Check email inbox - approval request received?
# 3. Click "Approve" button
# 4. Check Google Sheets - row added?
# 5. Check email - confirmation sent to teacher?
# 6. Check email - front desk notified of coverage?
Success criteria:
Simulate 100 concurrent requests:
# Using Apache Bench
ab -n 100 -c 10 -p request.json -T application/json \
https://absences.smaschool.org/webhook/absence-request
# Expected results:
# - All requests succeed (100%)
# - P95 response time < 2 seconds
# - No errors in CloudWatch logs
Complete deployment flow:
Developer changes index.html
↓
git push origin main
↓
GitHub Actions (OIDC auth)
↓
Upload to S3 + Invalidate CloudFront
↓
Changes live in 5 minutes
Teacher submits form
↓
CloudFront serves form
↓
Form POSTs to ALB (HTTPS)
↓
ALB → EC2 (private subnet)
↓
n8n processes workflow
├─ Past date → Auto-approve
└─ Future date → Principal approval
↓
Google Sheets logging
↓
Email confirmations
CloudWatch monitors
├─ EC2 health
├─ ALB targets
└─ Response time (SLO)
↓
SNS alerts if threshold exceeded
Everything is automated.
We’ve built the complete operational stack—infrastructure, monitoring, CI/CD, and workflows.
In Part 4 (final article), we’ll reflect on:
This is Part 3 of a 4-part series. Part 4: Lessons Learned - Cost, Challenges, and What’s Next wraps up with honest reflections and future thinking.
Questions? Connect on LinkedIn or explore the GitHub repo.
Tags: #DevOps #CI/CD #CloudWatch #Monitoring #SRE #GitHubActions #OIDC #Automation #n8n