Part 4 of 4: Building Production-Grade Infrastructure
In Parts 1-3, we built a production-grade absence tracking system with Terraform, monitoring, and CI/CD. The code worked. The architecture was sound. The deployment succeeded.
Then we turned it on.
This final article covers what actually happened in production: real costs, unexpected problems, decisions we’d change, and lessons that only come from running systems at scale.
No perfect code here. Just honest reflections.
| Service | Monthly Cost | % of Total |
|---|---|---|
| NAT Gateway | $32.85 | 50% |
| Application Load Balancer | $16.20 | 25% |
| EC2 t3.micro | $7.50 | 12% |
| Everything else | $8.45 | 13% |
The shock: Half our infrastructure budget is one service.
NAT Gateway enables our EC2 instance (in a private subnet) to reach the internet for Gmail API calls and package updates. It’s essential for our security model, but expensive.
Optimization we implemented:
Changed EC2 from on-demand to Reserved Instance:
Optimization we’re considering:
Replace NAT Gateway with NAT instance (t3.nano):
Total 5-year cost:
Total 5-year savings:
Net value: +$14,100 over 5 years
Break-even: Month 11
Lesson: Infrastructure as Code has upfront costs but pays dividends. The ROI is in not spending time on manual work.
The error:
Error: Cycle: module.compute, module.load_balancer
What happened:
Our fix:
# Made ALB DNS optional
module "compute" {
webhook_url = var.webhook_url # Defaults to ""
}
# n8n auto-detects hostname from HTTP headers
# Workaround: Works, but not ideal
Better solution (production):
Lesson: Separate infrastructure provisioning from application configuration. Terraform builds, Ansible configures.
The error:
GitHub Actions Error: Not authorized to perform sts:AssumeRoleWithWebIdentity
Root cause: Terraform variable had wrong GitHub repo name
staff-absence-request-systemabsence-system-infrastructureHow we debugged:
# Checked trust policy
aws iam get-role --role-name absence-system-github-actions \
--query "Role.AssumeRolePolicyDocument.Statement[0].Condition"
# Found mismatch in repo name
The fix: Updated Terraform variable, redeployed IAM role
Lesson: IAM errors are usually configuration mismatches, not AWS problems. Check: Does resource exist? → Does policy allow? → Does identity match?
The problem: terraform apply appeared to hang for 15 minutes during CloudFront creation.
What was actually happening: CloudFront legitimately takes 10-20 minutes to deploy to edge locations worldwide.
Our solution: Set expectations
output "deployment_notice" {
value = <<-EOT
⏳ CloudFront deploys to ~400 edge locations.
Takes 10-20 minutes - this is normal!
Monitor: https://console.aws.amazon.com/cloudfront/
EOT
}
Lesson: Some AWS services are slow by design. Document this so teammates don’t panic.
The error:
docker: Error response from daemon:
EACCES: permission denied, open '/home/node/.n8n/config'
Root cause: Host directory owned by root, container runs as user node (UID 1000)
Our workaround (demo):
# Don't mount volume - acceptable for POC
docker run -d --name n8n -p 5678:5678 n8nio/n8n
Production solution:
# Use AWS EFS with proper IAM
resource "aws_efs_file_system" "n8n" {
encrypted = true
}
# Mount in EC2, let AWS handle permissions
Lesson: Docker volume permissions are tricky. Use managed storage (EFS, EBS) for production.
Current: EC2 user data script Problem: Changes require instance replacement
Better:
# ansible/playbook.yml
- hosts: n8n_servers
roles:
- docker
- n8n
- cloudwatch_agent
Benefits:
Current: Single production environment
Better:
environments/
├── dev/terraform.tfvars
├── staging/terraform.tfvars
└── prod/terraform.tfvars
Benefits:
Should have created:
resource "aws_budgets_budget" "monthly" {
budget_type = "COST"
limit_amount = "100"
limit_unit = "USD"
notification {
threshold = 80 # Alert at $80
notification_type = "ACTUAL"
}
}
Why: First bill was $82 (over our $65 budget). Alert would have caught it.
Current: OAuth tokens in n8n (in memory)
Better:
resource "aws_secretsmanager_secret" "gmail_oauth" {
name = "${var.project_name}/gmail/oauth"
}
# EC2 retrieves at runtime
aws secretsmanager get-secret-value --secret-id ...
Why: Secrets in Secrets Manager = centralized rotation, audit logging, IAM access control
Our timeline:
Lesson: Ship working software, improve incrementally. Perfect is the enemy of shipped.
Things we did right from day one:
Things we added later (should have been day one):
Lesson: Security debt compounds. Build it in from the start.
We set SLOs before launch:
First week: One alarm fired during AWS maintenance window. Caught issue before users noticed.
Lesson: You can’t improve what you don’t measure. Observability enables confidence.
What worked:
What didn’t:
Fix:
# Pre-commit hook
terraform-docs markdown table . > README.md
Lesson: If it’s not code, it will drift. Automate everything.
Month 1: $85 (no optimization)
Month 2: $65 (identified NAT Gateway as 50% of bill)
Month 3: Evaluating Reserved Instances and NAT alternatives
Lesson: Monitor costs like performance. Set budgets. Review monthly.
Every decision involved tradeoffs:
| Decision | Trade-off |
|---|---|
| NAT Gateway vs Instance | Reliability vs cost |
| Self-hosted n8n vs Zapier | Control vs convenience |
| OIDC vs AWS keys | Security vs setup time |
| Private subnet vs public EC2 | Security vs simplicity |
Lesson: No perfect answers. Document your reasoning.
Our strategy:
Lesson: You can’t test everything in staging. Production is the real test. Build safety nets.
Initial investment: 60 hours
Time saved per change:
Break-even: 30 changes (~6 months for us)
After 1 year: Changes are trivial. Massive ROI.
Lesson: IaC feels slow at first. Stick with it. The payoff is real.
Surveyed 48 teachers:
Positive:
Negative:
We went from:
The bigger lesson:
Small teams with limited budgets can build enterprise-grade systems. You don’t need a 10-person team or $100K budget. You need:
This wasn’t just about absence tracking. It was about proving that DevOps/SRE practices work at any scale.
The Complete Series:
Connect:
If this series helped you: Star the repo, share your story, pay it forward.
Thank you for reading. Now go build something amazing. 🚀
Tags: #DevOps #LessonsLearned #Infrastructure #AWS #Terraform #CostOptimization #RealWorld