From Manual Chaos to Infrastructure as Code: Designing a School Absence System

Part 1 of 4: Building Production-Grade Infrastructure

The 7:45 AM Problem

It’s 7:45 AM on a Tuesday morning. The school secretary’s phone rings. It’s Mrs. Johnson calling in sick—again. The secretary scribbles a note on a sticky pad, crosses off Mrs. Johnson’s name on the coverage spreadsheet, and starts making calls to find a substitute for periods 3, 5, and 7.

By 8:00 AM, three more teachers have called. Two sent emails. One left a voicemail. The principal needs to approve Mr. Smith’s vacation request from next week, but that conversation happened in the hallway yesterday and nobody documented it. The front desk has no idea which periods need coverage because half the notifications came via different channels.

Welcome to absence tracking at a typical small school.

This article series documents how we transformed this chaos into an automated, enterprise-grade system using Infrastructure as Code.

The Business Problem: Death by Sticky Notes

Current State Pain Points

Our school, like many small institutions, relied on a patchwork of manual processes:

For Teachers:

Call or email the front desk to report absences
No confirmation that the request was received
No visibility into approval status
Repeat the same information to multiple people

For the Principal:

Approval requests come via hallway conversations
No audit trail of who approved what
Can’t distinguish retroactive sick leave from planned vacations
No way to track patterns or generate reports

For the Front Desk:

Manually track absences in a spreadsheet
Phone calls interrupt other work
No standardized information collection
Coverage requirements unclear (which periods? full day?)
Reporting is impossible (“How many sick days did Mrs. Johnson take this semester?”)

The Real Cost:

~10 hours per month of secretarial time on absence tracking
Missed coverage leading to classroom disruptions
Compliance risk (no audit trail for HR)
Principal approval bottlenecks

At $25/hour for administrative time, we’re spending $250/month on a process that could be automated.

Stakeholder Requirements

Through interviews with teachers, administrators, and front desk staff, we identified clear success criteria:

Teachers Need:

✅ Submit absence request in under 2 minutes
✅ Automatic confirmation email
✅ No phone calls during planning period

Principal Needs:

✅ Approve/deny planned absences with one click
✅ Automatic approval for retroactive sick leave (already happened)
✅ Audit trail for HR compliance

Front Desk Needs:

✅ Know exactly which periods need coverage
✅ Centralized tracking in Google Sheets (existing tool)
✅ Email notifications for coverage requirements

IT (Me) Needs:

✅ Secure, maintainable, cost-effective
✅ No secret credentials to manage
✅ Disaster recovery in under 30 minutes
✅ 99.9% uptime (no more than 43 minutes downtime per month)

Why Infrastructure as Code?

Before diving into architecture, let’s address the fundamental question: Why build this with Infrastructure as Code instead of just clicking around the AWS console?

The Traditional Approach (What We Avoided)

The “ClickOps” Method:

Log into AWS console
Click through wizards to create VPC, subnets, security groups
Launch EC2 instance, configure manually
Set up load balancer by clicking through forms
Screenshot settings for “documentation”
Pray nothing breaks

Problems with This Approach:

❌ “Works on my machine” syndrome: No guarantee staging matches production
❌ Knowledge loss: When I leave, how do the next IT person recreate this?
❌ No version control: Can’t track who changed what, when, or why
❌ No rollback: If a change breaks things, good luck remembering what you clicked
❌ Documentation drift: Screenshots go stale immediately
❌ No peer review: Changes go live without oversight
❌ Environment inconsistency: Staging is “kinda like” production

The IaC Approach (What We Built)

Infrastructure Defined in Code:

# This creates a VPC - same every time, reviewable, version controlled
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  
  tags = {
    Name = "absence-system-vpc"
  }
}

Benefits:

✅ Reproducible: terraform apply creates identical infrastructure every time
✅ Version controlled: Every change in Git with author, timestamp, reason
✅ Peer reviewed: Pull requests catch issues before production
✅ Self-documenting: Code IS the documentation (can’t drift)
✅ Disaster recovery: Lost infrastructure? terraform apply rebuilds in minutes
✅ Team collaboration: Multiple engineers can work without conflicts (state locking)
✅ Environment parity: Dev, staging, prod use same code with different variables

Real-World Example: The 3 PM Incident

Two weeks after launch, our EC2 instance crashed at 3 PM—right when teachers submit afternoon absences.

Without IaC:

Panic
Try to remember: What instance type was it? Which AMI? What IAM role?
Manually recreate from screenshots (45+ minutes)
Cross fingers that it matches original config
Hope nothing was forgotten

With IaC:

$ terraform apply
# 12 minutes later: identical infrastructure rebuilt
# Same configuration, same security, same monitoring
# Guaranteed.

Our actual downtime: 12 minutes. Manual approach: Would have been 60+ minutes.

This incident alone justified the IaC investment.

Why This Matters for Schools

Small IT teams (often just 1-2 people) face unique challenges:

Limited Resources:

Can’t afford dedicated DevOps engineer
IaC multiplies effectiveness of small teams
Automation reduces toil

Budget Constraints:

Manual labor is expensive ($250/month in our case)
IaC enables self-service (teachers submit forms, no admin time)
Infrastructure costs are predictable and optimizable

Compliance Requirements:

Audit logs required for HR
No manual changes = no unauthorized modifications
Version control provides complete change history

Staff Turnover:

Knowledge captured in code, not in people’s heads
New IT staff can understand system by reading Terraform
Onboarding time reduced from weeks to days

Cost-Benefit Analysis:

Time investment in IaC: ~40 hours initial setup
Time saved per infrastructure change: ~2 hours (manual) → 10 minutes (code)
Break-even point: After ~20 changes (about 6 months)
ROI after 1 year: Massive (changes become trivial)

Architecture Design Decisions

Now that we understand why Infrastructure as Code, let’s examine how we designed the system.

Every architecture decision involves tradeoffs. Here’s our decision-making process:

Decision 1: Cloud Provider - AWS

Chosen: Amazon Web Services (AWS)

Why:

School already uses Google Workspace → needed integration
AWS has mature Gmail/Google Sheets SDK support
My team has AWS experience
Strong ecosystem (CloudFormation, Terraform support, extensive documentation)
Cost predictability with usage calculators

Alternatives Considered:

Azure: Excellent for Microsoft-heavy environments, but we’re Google Workspace
Google Cloud: Considered seriously (native Google integration), but AWS had better Terraform module ecosystem
On-premises: Rejected immediately (hardware costs, maintenance, no HA)

Tradeoff:

AWS complexity vs. mature ecosystem
We chose maturity and community support
Accepted: Steeper learning curve for some AWS-specific services

Decision 2: Networking - Private Subnet Isolation

Chosen: VPC with public/private subnet architecture

Design:

Internet
    ↓
Internet Gateway
    ↓
Public Subnets (ALB, NAT Gateway)
    ↓
Private Subnet (EC2 - no public IP!)
    ↓
NAT Gateway (for outbound only)
    ↓
Internet (Gmail API, package updates)

Why:

Defense in depth: EC2 instance has zero exposure to internet
Compliance: Meets security best practices for sensitive data
Attack surface reduction: Can’t SSH from internet even if port opened accidentally
Audit trail: All access via AWS Systems Manager Session Manager (logged)

Alternatives Considered:

Public subnet with SSH: Simple, but violates security principles
VPN access only: Considered, but overhead not justified for small team
Single subnet (public only): Rejected immediately (unacceptable risk)

Tradeoff:

Added complexity (NAT Gateway, routing tables) vs. security
Added cost (~$32/month for NAT Gateway) vs. peace of mind
We chose security. Worth every penny.

Interview talking point: “I implemented private subnet isolation because treating EC2 instances as internet-accessible by default is an anti-pattern. Defense in depth means assuming breach at every layer.”

Decision 3: HTTPS Everywhere with Custom Domain

Chosen: Application Load Balancer with SSL/TLS termination + custom domain

Design:

User visits: https://absences.smaschool.org
    ↓
ALB terminates SSL (port 443)
    ↓
Forwards plain HTTP to EC2 (port 5678)
    ↓
n8n responds
    ↓
ALB encrypts response, sends to user

Why:

Trust: Teachers trust absences.smaschool.org more than random AWS URLs
Security: HTTPS required for OAuth (Gmail login)
Professionalism: Green padlock = legitimate school system
Simplicity: ALB handles SSL certificates, EC2 doesn’t need to

Alternatives Considered:

Direct EC2 with SSL: Rejected (certificate management burden, no HA)
CloudFront only: Considered, but ALB provides better health checks for our use case
HTTP only: Rejected immediately (unacceptable for authentication)

Tradeoff:

Cost (~$16/month for ALB) vs. professional appearance + security
Certificate management complexity vs. AWS Certificate Manager (ACM) automation
We chose professionalism. The custom domain alone increased teacher adoption.

Decision 4: Workflow Engine - n8n vs. Custom Code

Chosen: n8n (open-source workflow automation platform)

Why:

Visual workflows: Non-developers can understand and modify
Rapid iteration: Drag-and-drop beats coding for workflow changes
Built-in integrations: Gmail, Google Sheets, webhooks out-of-box
Lower maintenance: No custom code to debug or update

Alternatives Considered:

AWS Step Functions: Too expensive for our scale (~$25/month minimum)
Custom Lambda functions: Rejected (more code to maintain, slower iteration)
Zapier/Make: Rejected (vendor lock-in, recurring SaaS costs)
Pure code (Python/Node.js): Rejected (principal can’t modify workflow logic)

Tradeoff:

Running another service (Docker container) vs. development speed
n8n learning curve vs. Lambda familiarity
We chose speed. Workflow changes take minutes, not days.

Real-world benefit: When principal requested adding “coverage period” tracking, I updated the n8n workflow in 15 minutes. With Lambda, this would have been a sprint story.

Decision 5: CI/CD - GitHub Actions with OIDC

Chosen: GitHub Actions with OpenID Connect (OIDC) authentication

Design:

Developer pushes to GitHub
    ↓
GitHub Actions triggers
    ↓
GitHub generates short-lived token
    ↓
AWS verifies token (OIDC)
    ↓
AWS issues temporary credentials (15 min)
    ↓
Deploy to S3, invalidate CloudFront
    ↓
Credentials expire automatically

Why:

Zero long-lived secrets: No AWS keys stored in GitHub
Automated deployment: Frontend changes go live in ~5 minutes
Audit trail: Every deployment tracked in GitHub Actions logs
Branch restrictions: Only main branch can deploy to production

Alternatives Considered:

AWS access keys in GitHub Secrets: Rejected (security anti-pattern)
Manual deployment: Rejected (slow, error-prone, doesn’t scale)
Jenkins self-hosted: Rejected (more infrastructure to manage)

Tradeoff:

Setup complexity (OIDC, IAM trust policies) vs. long-term security
We chose security. OIDC is industry best practice.

Interview talking point: “I use OIDC instead of long-lived credentials because credentials that can’t leak are credentials that won’t leak. The upfront complexity pays dividends in security posture.”

Decision 6: Monitoring - CloudWatch with SLO Tracking

Chosen: CloudWatch alarms with Service Level Objective (SLO) tracking

Design:

Error budget: 99.9% uptime = 43.8 minutes downtime allowed per month
SLO alarm: P95 latency < 2 seconds
Proactive alerting: Email when error budget consumption accelerates

Why:

Data-driven decisions: Know when we’re trending toward SLO breach
Prevents alert fatigue: SLO-based alarms vs. arbitrary thresholds
Business alignment: SLOs match what stakeholders care about (“Is the system fast enough?”)

Alternatives Considered:

Just uptime monitoring: Rejected (too coarse, misses degraded performance)
Third-party monitoring (Datadog): Rejected (cost not justified for our scale)
No monitoring: Rejected immediately (flying blind is malpractice)

Tradeoff:

Effort to define SLOs vs. ad-hoc “is it up?” checks
We chose rigor. SLOs provide confidence, not just hope.

Final Architecture Overview

Here’s what we built:

High-Level System Flow

Teacher submits absence request
    ↓
Static HTML form (S3/CloudFront) - HTTPS
    ↓
Application Load Balancer - SSL termination
    ↓
EC2 instance (private subnet, no public IP)
    ↓
n8n workflow engine
    ↓
Decision: Past date or future date?
    ├─ Past → Auto-approve → Log to Google Sheets → Email confirmation
    └─ Future → Email principal for approval → Wait → Log → Notify

Tech Stack

Infrastructure Layer:

Infrastructure as Code: Terraform (7 modules, 30+ AWS resources)
Cloud Provider: AWS
Networking: Custom VPC (10.0.0.0/16), public/private subnets, NAT Gateway
Compute: EC2 t3.micro (Amazon Linux 2)
Load Balancing: Application Load Balancer with HTTPS
DNS/SSL: Route53 + AWS Certificate Manager (ACM)
Frontend: S3 + CloudFront (static HTML form)

Application Layer:

Workflow Engine: n8n (Docker container)
Integrations: Gmail (OAuth), Google Sheets (API)
Authentication: Google OAuth 2.0

DevOps Layer:

CI/CD: GitHub Actions with OIDC
Monitoring: CloudWatch (alarms, dashboards, SLO tracking)
Alerting: SNS → Email
State Management: S3 backend + DynamoDB locking
Version Control: Git (all code, workflows, documentation)

Security Layer:

Network: Private subnets, security groups (least privilege)
Access: IAM roles (no access keys), Systems Manager Session Manager
Encryption: HTTPS (in transit), S3 encryption (at rest)
Compliance: CloudTrail logs, audit trail in Google Sheets

Architecture Diagram

System Architecture Auto-generated using Python Diagrams library - because documentation that’s code can’t go stale

Key architectural principles visible in the diagram:

Defense in Depth: Multiple security layers (Internet Gateway → ALB → Security Groups → Private Subnet)
Single Responsibility: Each component does one thing well (networking, compute, load balancing separate)
Fail-Safe Defaults: Everything blocked by default, explicitly allow needed traffic only
Least Privilege: Minimal IAM permissions, no SSH access, read-only where possible

Cost & Resource Summary

Monthly AWS Costs (estimated):

EC2 t3.micro: $7.50
Application Load Balancer: $16.20
NAT Gateway: $32.85
S3 + CloudFront: $1.50
Other (Route53, CloudWatch, etc.): $7.00
Total: ~$65/month

Resources Created:

1 VPC with 3 subnets across 2 availability zones
1 Internet Gateway
1 NAT Gateway with Elastic IP
2 Security Groups (ALB, EC2)
1 EC2 instance (t3.micro)
1 Application Load Balancer
1 Target Group with health checks
2 ALB listeners (HTTP → HTTPS redirect, HTTPS)
1 S3 bucket + CloudFront distribution
1 ACM certificate
3 CloudWatch alarms
1 SNS topic
IAM roles and policies

Development Time:

Initial setup: ~40 hours
Iterations and refinements: ~20 hours
Total: ~60 hours over 3 weeks

ROI Calculation:

Cost: $65/month infrastructure + $150/month amortized development time (first year)
Savings: $250/month in administrative labor
Net savings: $35/month (positive ROI from day one)
Year 2+: $250/month savings (development costs fully amortized)

What’s Next

We’ve explored the why (business problem, IaC benefits) and the what (architecture decisions, tradeoffs).

In Part 2, we’ll dive deep into the how:

Terraform module design philosophy
VPC networking implementation (with actual code)
HTTPS setup with ACM and ALB
Security group least-privilege implementation
Remote state configuration and team collaboration

You’ll see:

Real Terraform code with explanations
Common pitfalls and how to avoid them
Interview-worthy talking points for every decision
Production-ready patterns you can use tomorrow

Follow along: All code is available on GitHub

Key Takeaways

Real problems drive better learning than toy examples
Infrastructure as Code isn’t just for large teams - small organizations benefit even more
Architecture is tradeoffs - document your reasoning, not just your choices
Security and monitoring should be designed in, not retrofitted
Start simple, iterate - we didn’t build everything on day one

This is Part 1 of a 4-part series on building production-grade infrastructure. Part 2: Building Secure, Modular Infrastructure with Terraform dives into the implementation details.

Questions or feedback? Connect with me on LinkedIn or check out the GitHub repo.

Tags: #DevOps #InfrastructureAsCode #Terraform #AWS #CloudArchitecture #SRE #SystemDesign #CaseStudy

IaC Part 1: From Manual Chaos to Infrastructure as Code: Designing a School Absence System

PUBLISHED ON FEB 15, 2026 / 11 MIN READ

From Manual Chaos to Infrastructure as Code: Designing a School Absence System

The 7:45 AM Problem

The Business Problem: Death by Sticky Notes

Current State Pain Points

Stakeholder Requirements

Why Infrastructure as Code?

The Traditional Approach (What We Avoided)

The IaC Approach (What We Built)

Real-World Example: The 3 PM Incident

Why This Matters for Schools

Architecture Design Decisions

Decision 1: Cloud Provider - AWS

Decision 2: Networking - Private Subnet Isolation

Decision 3: HTTPS Everywhere with Custom Domain

Decision 4: Workflow Engine - n8n vs. Custom Code

Decision 5: CI/CD - GitHub Actions with OIDC

Decision 6: Monitoring - CloudWatch with SLO Tracking

Final Architecture Overview

High-Level System Flow

Tech Stack

Architecture Diagram

Cost & Resource Summary

What’s Next

Key Takeaways

TAGS: AWS, DEVOPS, INFRASTRUCTURE AS CODE, SYSTEM DESIGN, TERRAFORM

home