AWS Compute Best Practices: The Complete Well-Architected Guide

Master EC2, Lambda, and ECS/EKS compute architectures. Learn right-sizing, auto-scaling, serverless patterns, and cost optimization strategies aligned with AWS Well-Architected Framework.

8 min read
By CloudBridgeHub

Technical TL;DR


Compute Pillar = 60-70% of most AWS bills. Key takeaways:

  • **Right-size instances** using computed requirements (not guesses)
  • **Auto-scaling is mandatory** for production workloads
  • **Serverless first** for event-driven, spiky, or infrequent workloads
  • **Use Graviton** for 20-40% cost savings on Linux workloads
  • **Multi-AZ deployment** for HA, **Warm Standby** for DR

  • ---


    1. Choose the Right Compute Service


    AWS offers multiple compute services. Selecting the wrong one costs money and creates operational overhead.


    Decision Framework


    | Use Case | Best Service | Why |

    |----------|--------------|-----|

    | **Web servers with steady traffic** | EC2 + Auto Scaling | Predictable performance, full OS control |

    | **Event-driven tasks (API triggers, file processing)** | Lambda | Pay-per-use, zero provisioning, auto-scales |

    | **Containerized microservices** | ECS/Fargate | Managed containers, no cluster management |

    | **Kubernetes workloads with complex orchestration** | EKS | Kubernetes consistency, hybrid portability |

    | **Batch jobs, data processing** | Lambda or Batch | Fault-tolerant spot pricing, pay-per-job |


    Anti-Patterns to Avoid


  • Using EC2 for sporadic workloads (waste money on idle time)
  • Using Lambda for long-running processes (>15 min timeout)
  • Running containers on EC2 (operational burden vs. Fargate)
  • Single-AZ deployments (violates HA best practices)

  • ---


    2. EC2 Best Practices


    2.1 Right-Size Your Instances


    Never default to large instance types. Most workloads run efficiently on smaller instances than expected.


    Sizing Methodology:

    1. Benchmark in non-production with CloudWatch metrics

    2. Target 70-80% CPU utilization at peak

    3. Use memory-optimized instances only when verified needed

    4. Consider burstable instances (T3/T4g) for dev/test


    Tools:

  • AWS Compute Optimizer (ML-driven recommendations)
  • Trusted Advisor (underutilized instances)
  • CloudWatch Insights (performance baselines)

  • 2.2 Implement Auto Scaling Groups


    Auto Scaling Groups (ASGs) are non-negotiable for production EC2.


    Required Configuration:

    ```yaml

    Minimum: 2 instances # HA across AZs

    Maximum: 10-20x baseline # Handle traffic spikes

    Desired: Match steady-state load

    Health Check: ELB + EC2

    Scaling Policies:

    - Target tracking: 70% CPU

    - Step scaling: +1 instance per 10% CPU above threshold

    - Scheduled scaling: Predictable patterns (business hours)

    ```


    Advanced Patterns:

  • Use **Launch Templates** for immutable infrastructure
  • Enable **Instance Refresh** for rolling deployments
  • Implement **Scale-in protection** for critical instances
  • Set **Warm pools** for rapid scale-out (pre-provisioned instances)

  • 2.3 Leverage Spot Instances


    Spot instances offer up to **90% savings** for fault-tolerant workloads.


    Ideal Workloads:

  • Batch processing, CI/CD, data analysis
  • Stateless microservices with multiple instances
  • Container workloads with quick start/stop

  • Pattern: Spot Fleet + On-Demand Base

    ```yaml

    Spot Allocation Strategy: capacity-optimized

    On-Demand Base: 30% # Maintain minimum capacity

    Spot Percentage: 70% # Maximize savings

    Instance Pools: 10+ # Diversify for interruption resilience

    ```


    2.4 Use AWS Graviton Processors


    Graviton (ARM-based) instances deliver **20-40% better price-performance** for most Linux workloads.


    Migration Path:

    1. Test workloads on r6g, c6g, m6g instances

    2. Verify ARM compatibility (most Linux apps work out-of-box)

    3. Rebuild or recompile if needed (minimal effort for most)


    When NOT to use Graviton:

  • Windows workloads (not supported)
  • Legacy x86 dependencies
  • Proprietary software without ARM support

  • ---


    3. Lambda Best Practices


    3.1 Lambda Anti-Patterns


    | Anti-Pattern | Why It's Problematic |

    |--------------|---------------------|

    | **Monolithic functions** (500+ lines) | Hard to test, cold starts, timeout risks |

    | **Synchronous orchestration** | Chained functions accumulate latency |

    | **Ignoring memory = CPU** | Lambda CPU scales with memory; 256MB = slow |

    | **No dead-letter queue** | Failed events lost forever |

    | **Provisioned concurrency for everything** | Defeats cost benefits of serverless |


    3.2 Memory Configuration


    Rule: Memory is a performance dial. CPU, network, and disk all scale with memory.


    Optimal Sizing:

    ```python

    # Use AWS Lambda Power Tuning to find optimal memory

    # Most functions perform best at 1024-1792 MB

    # Cost sweet spot: 1024-1536 MB

    ```


    Power Tuning Pattern:

    1. Deploy with 128 MB, test execution time

    2. Increase to 1024 MB, test again

    3. Find inflection point where time decrease < cost increase

    4. Set to optimal memory (usually 1024-1792 MB)


    3.3 Control Cold Starts


    Cold starts add latency when Lambda scales out.


    Mitigation Strategies:

    ```yaml

    1. Keep deployment packages small (<50 MB zip)

    2. Minimize layers and dependencies

    3. Use Provisioned Concurrency for critical paths

    4. Implement keep-alive warming (scheduled pings)

    5. Choose Python/Go over Java/Cold-start-heavy runtimes

    ```


    Package Optimization:

  • Use AWS SDK v3 (modular, import only needed services)
  • Bundle only required dependencies
  • Consider Lambda container images for complex deps

  • 3.4 Event-Driven Architecture


    Lambda excels when triggered by events, not invoked synchronously.


    Recommended Event Sources:

  • S3 (object uploads trigger processing)
  • SNS/SQS (decoupled async messaging)
  • EventBridge (cross-service event routing)
  • DynamoDB Streams (database changes trigger logic)

  • Pattern: SQS Buffer for Throttling

    ```yaml

    API Gateway -> Lambda -> SQS -> Lambda Worker

    ```

    Prevents throttling, enables retry logic, decouples producers/consumers.


    ---


    4. Container Best Practices (ECS/EKS)


    4.1 ECS vs. EKS Decision


    | Factor | ECS/Fargate | EKS |

    |--------|-------------|-----|

    | **Complexity** | Low (AWS-managed) | High (self-managed control plane) |

    | **Startup Time** | Seconds | Minutes |

    | **Cost** | Higher per vCPU | Lower at scale |

    | **Portability** | AWS-only | Kubernetes everywhere |

    | **Use Case** | Simple containerized apps | Complex orchestration needs |


    4.2 Fargate Best Practices


    Fargate eliminates server management. Use it unless you need custom kernel modules.


    Configuration:

    ```yaml

    Task CPU: Match application requirements

    Task Memory: Include application + container overhead

    Task Role: Least-privilege IAM per task

    Network Mode: awsvpc (enables security groups)

    ```


    Cost Optimization:

  • Use Fargate Spot for dev/test and fault-tolerant prod
  • Right-size tasks (most oversize by 2x)
  • Enable autoscaling based on CPU/memory metrics

  • 4.3 EKS Best Practices


    EKS is for teams committed to Kubernetes with complex orchestration needs.


    Cluster Design:

    ```yaml

    Managed Node Groups: Prefer over self-managed

    Cluster Autoscaler: Required for cost efficiency

    Multiple AZs: Required for HA

    Pod Disruption Budgets: Prevent disruption during updates

    Horizontal Pod Autoscaler: Scale pods based on demand

    ```


    Cost Control:

  • Use Spot instances via node selectors
  • Implement cluster autoscaler (scale down unused nodes)
  • Set resource limits on all pods (prevent runaway costs)
  • Use EKS for orchestrating, not for simple container hosting

  • ---


    5. High Availability & Disaster Recovery


    5.1 Multi-AZ Deployments


    Requirement: All production workloads must span multiple Availability Zones.


    Implementation:

    ```yaml

    EC2: ASG with subnet in multiple AZs

    Lambda: Regional service (automatic AZ redundancy)

    ECS: Tasks distributed across AZs

    EKS: Nodes in multiple AZs, pod anti-affinity

    ```


    5.2 Disaster Recovery Patterns


    | RTO/RPO | Strategy | Cost | Complexity |

    |---------|----------|------|------------|

    | **Minutes / Zero data loss** | Multi-Region Active-Active | High | High |

    | **Hours / Minimal loss** | Pilot Light (warm standby) | Medium | Medium |

    | **Days / 24hr loss** | Backup & Restore | Low | Low |


    Pilot Light Implementation:

    ```yaml

    Primary Region: Full production stack

    DR Region:

    - Minimal resources (single AZ, small instances)

    - Automated DNS failover (Route 53 health checks)

    - Database read replica (promote to primary)

    - S3 cross-region replication

    ```


    ---


    6. Cost Optimization Checklist


    Immediate Actions (Week 1)

  • [ ] Run Compute Optimizer, apply recommendations
  • [ ] Enable Auto Scaling on all EC2 ASGs
  • [ ] Implement Spot Fleet for fault-tolerant workloads
  • [ ] Test Graviton instances for Linux workloads

  • Short-Term (Month 1)

  • [ ] Audit Lambda memory, optimize with Power Tuning
  • [ ] Migrate dev/test to Fargate Spot or EC2 Spot
  • [ ] Implement Scheduled Scaling for predictable workloads
  • [ ] Set up budget alerts via Cost Explorer

  • Long-Term (Quarter 1)

  • [ ] Architect for serverless where applicable
  • [ ] Implement multi-AZ for all production
  • [ ] Document DR runbook, test quarterly
  • [ ] Establish compute governance (tagging, policies)

  • ---


    7. Monitoring & Observability


    Required Metrics

    ```yaml

    Compute Metrics:

    - CPU/Memory Utilization (CloudWatch)

    - Network In/Out (bottleneck detection)

    - Lambda errors and durations

    - ASG scaling events


    Alerting:

    - CPU > 80% for 5 minutes

    - Memory > 85% for 5 minutes

    - Lambda error rate > 1%

    - ASG at max capacity

    ```


    Recommended Tools

  • **CloudWatch dashboards:** Unified metrics view
  • **X-Ray:** Distributed tracing for microservices
  • **Compute Optimizer:** Continuous optimization
  • **Trusted Advisor:** Weekly performance checks

  • ---


    8. Security Best Practices


    Compute Security Checklist

    ```yaml

    EC2:

    ☐ IMDSv2 required (prevent SSRF attacks)

    ☐ Security groups restrict ingress/egress

    ☐ IAM roles, never access keys

    ☐ AWS Systems Manager Session Manager (no SSH keys)


    Lambda:

    ☐ Least-privilege execution roles

    ☐ VPC configuration for private resources

    ☐ Environment variables for secrets (no plaintext)

    ☐ Code signing in production


    Containers:

    ☐ Scan images for vulnerabilities (Amazon ECR)

    ☐ Run as non-root user

    ☐ Read-only root filesystems

    ☐ Secrets via AWS Secrets Manager (not env vars)

    ```


    ---


    Summary: Compute Excellence Pillars


    1. **Right-size everything** (use data, not guesses)

    2. **Auto-scale or go serverless** (manual scaling is obsolete)

    3. **Leverage Spot/Graviton** (20-90% savings)

    4. **Multi-AZ by default** (HA is non-negotiable)

    5. **Monitor continuously** (you can't optimize what you don't measure)


    ---


    Need Help Architecting Your Compute?


    Our AWS-certified solutions architects can design scalable, cost-optimized compute architectures tailored to your workload patterns.


    <a href="/contact" className="text-aws-orange font-semibold hover:text-aws-light">

    Schedule a Free Architecture Review →

    </a>


    Internal Linking Strategy:

  • For **storage**, see [S3 + EBS Architecture Patterns](/blog/aws-storage-best-practices)
  • For **security**, refer to [Zero Trust Network Design](/blog/aws-security-best-practices)
  • For **databases**, explore [Database Selection Guide](/blog/aws-database-best-practices)

  • ---


    *Last updated: January 5, 2025*


    Need Help with Your AWS Infrastructure?

    Our AWS certified experts can help you optimize costs, improve security, and build scalable solutions.