AWS Compute Best Practices: The Complete Well-Architected Guide

Technical TL;DR

Compute Pillar = 60-70% of most AWS bills. Key takeaways:

**Right-size instances** using computed requirements (not guesses)

**Auto-scaling is mandatory** for production workloads

**Serverless first** for event-driven, spiky, or infrequent workloads

**Use Graviton** for 20-40% cost savings on Linux workloads

**Multi-AZ deployment** for HA, **Warm Standby** for DR

---

1. Choose the Right Compute Service

AWS offers multiple compute services. Selecting the wrong one costs money and creates operational overhead.

Decision Framework

| Use Case | Best Service | Why |

|----------|--------------|-----|

| **Web servers with steady traffic** | EC2 + Auto Scaling | Predictable performance, full OS control |

| **Event-driven tasks (API triggers, file processing)** | Lambda | Pay-per-use, zero provisioning, auto-scales |

| **Containerized microservices** | ECS/Fargate | Managed containers, no cluster management |

| **Kubernetes workloads with complex orchestration** | EKS | Kubernetes consistency, hybrid portability |

| **Batch jobs, data processing** | Lambda or Batch | Fault-tolerant spot pricing, pay-per-job |

Anti-Patterns to Avoid

Using EC2 for sporadic workloads (waste money on idle time)

Using Lambda for long-running processes (>15 min timeout)

Running containers on EC2 (operational burden vs. Fargate)

Single-AZ deployments (violates HA best practices)

---

2. EC2 Best Practices

2.1 Right-Size Your Instances

Never default to large instance types. Most workloads run efficiently on smaller instances than expected.

Sizing Methodology:

1. Benchmark in non-production with CloudWatch metrics

2. Target 70-80% CPU utilization at peak

3. Use memory-optimized instances only when verified needed

4. Consider burstable instances (T3/T4g) for dev/test

Tools:

AWS Compute Optimizer (ML-driven recommendations)

Trusted Advisor (underutilized instances)

CloudWatch Insights (performance baselines)

2.2 Implement Auto Scaling Groups

Auto Scaling Groups (ASGs) are non-negotiable for production EC2.

Required Configuration:

```yaml

Minimum: 2 instances # HA across AZs

Maximum: 10-20x baseline # Handle traffic spikes

Desired: Match steady-state load

Health Check: ELB + EC2

Scaling Policies:

- Target tracking: 70% CPU

- Step scaling: +1 instance per 10% CPU above threshold

- Scheduled scaling: Predictable patterns (business hours)

```

Advanced Patterns:

Use **Launch Templates** for immutable infrastructure

Enable **Instance Refresh** for rolling deployments

Implement **Scale-in protection** for critical instances

Set **Warm pools** for rapid scale-out (pre-provisioned instances)

2.3 Leverage Spot Instances

Spot instances offer up to **90% savings** for fault-tolerant workloads.

Ideal Workloads:

Batch processing, CI/CD, data analysis

Stateless microservices with multiple instances

Container workloads with quick start/stop

Pattern: Spot Fleet + On-Demand Base

```yaml

Spot Allocation Strategy: capacity-optimized

On-Demand Base: 30% # Maintain minimum capacity

Spot Percentage: 70% # Maximize savings

Instance Pools: 10+ # Diversify for interruption resilience

```

2.4 Use AWS Graviton Processors

Graviton (ARM-based) instances deliver **20-40% better price-performance** for most Linux workloads.

Migration Path:

1. Test workloads on r6g, c6g, m6g instances

2. Verify ARM compatibility (most Linux apps work out-of-box)

3. Rebuild or recompile if needed (minimal effort for most)

When NOT to use Graviton:

Windows workloads (not supported)

Legacy x86 dependencies

Proprietary software without ARM support

---

3. Lambda Best Practices

3.1 Lambda Anti-Patterns

| Anti-Pattern | Why It's Problematic |

|--------------|---------------------|

| **Monolithic functions** (500+ lines) | Hard to test, cold starts, timeout risks |

| **Synchronous orchestration** | Chained functions accumulate latency |

| **Ignoring memory = CPU** | Lambda CPU scales with memory; 256MB = slow |

| **No dead-letter queue** | Failed events lost forever |

| **Provisioned concurrency for everything** | Defeats cost benefits of serverless |

3.2 Memory Configuration

Rule: Memory is a performance dial. CPU, network, and disk all scale with memory.

Optimal Sizing:

```python

# Use AWS Lambda Power Tuning to find optimal memory

# Most functions perform best at 1024-1792 MB

# Cost sweet spot: 1024-1536 MB

```

Power Tuning Pattern:

1. Deploy with 128 MB, test execution time

2. Increase to 1024 MB, test again

3. Find inflection point where time decrease < cost increase

4. Set to optimal memory (usually 1024-1792 MB)

3.3 Control Cold Starts

Cold starts add latency when Lambda scales out.

Mitigation Strategies:

```yaml

1. Keep deployment packages small (<50 MB zip)

2. Minimize layers and dependencies

3. Use Provisioned Concurrency for critical paths

4. Implement keep-alive warming (scheduled pings)

5. Choose Python/Go over Java/Cold-start-heavy runtimes

```

Package Optimization:

Use AWS SDK v3 (modular, import only needed services)

Bundle only required dependencies

Consider Lambda container images for complex deps

3.4 Event-Driven Architecture

Lambda excels when triggered by events, not invoked synchronously.

Recommended Event Sources:

S3 (object uploads trigger processing)

SNS/SQS (decoupled async messaging)

EventBridge (cross-service event routing)

DynamoDB Streams (database changes trigger logic)

Pattern: SQS Buffer for Throttling

```yaml

API Gateway -> Lambda -> SQS -> Lambda Worker

```

Prevents throttling, enables retry logic, decouples producers/consumers.

---

4. Container Best Practices (ECS/EKS)

4.1 ECS vs. EKS Decision

| Factor | ECS/Fargate | EKS |

|--------|-------------|-----|

| **Complexity** | Low (AWS-managed) | High (self-managed control plane) |

| **Startup Time** | Seconds | Minutes |

| **Cost** | Higher per vCPU | Lower at scale |

| **Portability** | AWS-only | Kubernetes everywhere |

| **Use Case** | Simple containerized apps | Complex orchestration needs |

4.2 Fargate Best Practices

Fargate eliminates server management. Use it unless you need custom kernel modules.

Configuration:

```yaml

Task CPU: Match application requirements

Task Memory: Include application + container overhead

Task Role: Least-privilege IAM per task

Network Mode: awsvpc (enables security groups)

```

Cost Optimization:

Use Fargate Spot for dev/test and fault-tolerant prod

Right-size tasks (most oversize by 2x)

Enable autoscaling based on CPU/memory metrics

4.3 EKS Best Practices

EKS is for teams committed to Kubernetes with complex orchestration needs.

Cluster Design:

```yaml

Managed Node Groups: Prefer over self-managed

Cluster Autoscaler: Required for cost efficiency

Multiple AZs: Required for HA

Pod Disruption Budgets: Prevent disruption during updates

Horizontal Pod Autoscaler: Scale pods based on demand

```

Cost Control:

Use Spot instances via node selectors

Implement cluster autoscaler (scale down unused nodes)

Set resource limits on all pods (prevent runaway costs)

Use EKS for orchestrating, not for simple container hosting

---

5. High Availability & Disaster Recovery

5.1 Multi-AZ Deployments

Requirement: All production workloads must span multiple Availability Zones.

Implementation:

```yaml

EC2: ASG with subnet in multiple AZs

Lambda: Regional service (automatic AZ redundancy)

ECS: Tasks distributed across AZs

EKS: Nodes in multiple AZs, pod anti-affinity

```

5.2 Disaster Recovery Patterns

|---------|----------|------|------------|

| **Days / 24hr loss** | Backup & Restore | Low | Low |

Pilot Light Implementation:

```yaml

Primary Region: Full production stack

DR Region:

- Minimal resources (single AZ, small instances)

- Automated DNS failover (Route 53 health checks)

- Database read replica (promote to primary)

- S3 cross-region replication

```

---

6. Cost Optimization Checklist

Immediate Actions (Week 1)

[ ] Run Compute Optimizer, apply recommendations

[ ] Enable Auto Scaling on all EC2 ASGs

[ ] Implement Spot Fleet for fault-tolerant workloads

[ ] Test Graviton instances for Linux workloads

Short-Term (Month 1)

[ ] Audit Lambda memory, optimize with Power Tuning

[ ] Migrate dev/test to Fargate Spot or EC2 Spot

[ ] Implement Scheduled Scaling for predictable workloads

[ ] Set up budget alerts via Cost Explorer

Long-Term (Quarter 1)

[ ] Architect for serverless where applicable

[ ] Implement multi-AZ for all production

[ ] Document DR runbook, test quarterly

[ ] Establish compute governance (tagging, policies)

---

7. Monitoring & Observability

Required Metrics

```yaml

Compute Metrics:

- CPU/Memory Utilization (CloudWatch)

- Network In/Out (bottleneck detection)

- Lambda errors and durations

- ASG scaling events

Alerting:

- CPU > 80% for 5 minutes

- Memory > 85% for 5 minutes

- Lambda error rate > 1%

- ASG at max capacity

```

Recommended Tools

**CloudWatch dashboards:** Unified metrics view

**X-Ray:** Distributed tracing for microservices

**Compute Optimizer:** Continuous optimization

**Trusted Advisor:** Weekly performance checks

---

8. Security Best Practices

Compute Security Checklist

```yaml

EC2:

☐ IMDSv2 required (prevent SSRF attacks)

☐ Security groups restrict ingress/egress

☐ IAM roles, never access keys

☐ AWS Systems Manager Session Manager (no SSH keys)

Lambda:

☐ Least-privilege execution roles

☐ VPC configuration for private resources

☐ Environment variables for secrets (no plaintext)

☐ Code signing in production

Containers:

☐ Scan images for vulnerabilities (Amazon ECR)

☐ Run as non-root user

☐ Read-only root filesystems

☐ Secrets via AWS Secrets Manager (not env vars)

```

---

Summary: Compute Excellence Pillars

1. **Right-size everything** (use data, not guesses)

2. **Auto-scale or go serverless** (manual scaling is obsolete)

3. **Leverage Spot/Graviton** (20-90% savings)

4. **Multi-AZ by default** (HA is non-negotiable)

5. **Monitor continuously** (you can't optimize what you don't measure)

---

Need Help Architecting Your Compute?

Our AWS-certified solutions architects can design scalable, cost-optimized compute architectures tailored to your workload patterns.

Schedule a Free Architecture Review →

</a>

Internal Linking Strategy:

For **storage**, see [S3 + EBS Architecture Patterns](/blog/aws-storage-best-practices)

For **security**, refer to [Zero Trust Network Design](/blog/aws-security-best-practices)

For **databases**, explore [Database Selection Guide](/blog/aws-database-best-practices)

---

*Last updated: January 5, 2025*

AWS Compute Best Practices: The Complete Well-Architected Guide

Technical TL;DR

1. Choose the Right Compute Service

Decision Framework

Anti-Patterns to Avoid

2. EC2 Best Practices

2.1 Right-Size Your Instances

2.2 Implement Auto Scaling Groups

2.3 Leverage Spot Instances

2.4 Use AWS Graviton Processors

3. Lambda Best Practices

3.1 Lambda Anti-Patterns

3.2 Memory Configuration

3.3 Control Cold Starts

3.4 Event-Driven Architecture

4. Container Best Practices (ECS/EKS)

4.1 ECS vs. EKS Decision

4.2 Fargate Best Practices

4.3 EKS Best Practices

5. High Availability & Disaster Recovery

5.1 Multi-AZ Deployments

5.2 Disaster Recovery Patterns

6. Cost Optimization Checklist

Immediate Actions (Week 1)

Short-Term (Month 1)

Long-Term (Quarter 1)

7. Monitoring & Observability

Required Metrics

Recommended Tools

8. Security Best Practices

Compute Security Checklist

Summary: Compute Excellence Pillars

Need Help Architecting Your Compute?

Need Help with Your AWS Infrastructure?

Related Articles

AWS Networking Best Practices: VPC, CloudFront, and Connectivity Guide

AWS Storage Best Practices: S3, EBS, and EFS Architecture Guide

AWS AI/ML Best Practices: SageMaker and Bedrock Architecture Guide