Serverless on AWS — Beyond the Hype
I've been building serverless systems on AWS since Lambda was limited to Node.js 0.10 and had a 60-second timeout. In the years since, I've helped organizations migrate millions of requests per day to serverless architectures — and I've also been the one to tell teams, "Actually, put that back on containers."
Serverless on AWS is genuinely transformative for the right workloads. But the hype has created a generation of architects who reach for Lambda as a default, when sometimes a container on Fargate would have been simpler, cheaper, and more maintainable.
Let me share what I've learned.
The Serverless Toolkit on AWS
Before we dive into patterns and opinions, let's establish the landscape:
| Service | Role | Key Characteristics |
|---|---|---|
| Lambda | Compute | Event-driven functions, 15 min max, 10 GB memory, pay-per-invocation |
| API Gateway | HTTP ingress | REST & WebSocket APIs, throttling, auth, request/response transformation |
| Step Functions | Orchestration | Visual workflows, state management, error handling, long-running processes |
| EventBridge | Event routing | Serverless event bus, schema registry, cross-account events |
| DynamoDB | Database | Serverless NoSQL, single-digit ms latency, auto-scaling |
| SQS | Queue | Fully managed message queue, dead letter queues, FIFO support |
| SNS | Pub/Sub | Push-based notifications, fan-out, multi-protocol delivery |
| S3 | Storage/Triggers | Object storage with event notifications for Lambda |
Let's Talk About Cold Starts — Honestly
Cold starts are the most discussed and most misunderstood aspect of Lambda. Let me give you the real numbers and context.
What Actually Happens During a Cold Start
Real-World Cold Start Numbers (2026)
Based on my production measurements across dozens of Lambda functions:
| Runtime | Package Size | Cold Start (p50) | Cold Start (p99) | With SnapStart |
|---|---|---|---|---|
| Python 3.12 | 5 MB | ~200 ms | ~500 ms | N/A |
| Node.js 20 | 10 MB | ~250 ms | ~600 ms | N/A |
| Java 21 (Spring) | 50 MB | ~3,000 ms | ~6,000 ms | ~300 ms |
| Java 21 (Micronaut) | 15 MB | ~1,200 ms | ~2,500 ms | ~200 ms |
| .NET 8 | 20 MB | ~600 ms | ~1,200 ms | N/A |
| Rust (custom runtime) | 3 MB | ~12 ms | ~30 ms | N/A |
| Go | 8 MB | ~80 ms | ~150 ms | N/A |
My Cold Start Mitigation Strategy
-
Choose the right runtime. If cold starts matter, Python, Node.js, Go, or Rust will give you the best experience. Java requires SnapStart.
-
Use Lambda SnapStart for Java. This is a game-changer. SnapStart takes a snapshot of your initialized function and restores it instead of re-initializing. It reduces Java cold starts from seconds to milliseconds. If you're running Java on Lambda without SnapStart, you're doing it wrong.
-
Minimize your deployment package. Every MB matters. Tree-shake your dependencies. Use Lambda layers for shared code. Bundle with esbuild for Node.js.
-
Provisioned Concurrency for the critical path. If you absolutely cannot tolerate cold starts on your user-facing API, Provisioned Concurrency keeps warm environments ready. But it costs money — you're paying for compute whether it's used or not.
💡 Pro Tip: Don't use Provisioned Concurrency as your first solution. First optimize your package size and init code. I've seen teams cut cold starts by 80% just by removing unnecessary SDK imports and lazy-loading modules. Only reach for Provisioned Concurrency when you've exhausted free optimizations.
- Architecture-level mitigation. Put an SQS queue in front of non-latency-sensitive Lambda functions. The queue absorbs the cold start delay while providing buffering and retry semantics for free.
Event-Driven Architecture Patterns
This is where serverless truly shines. When you stop thinking about servers and start thinking about events, your architectures become more resilient, scalable, and decoupled.
Pattern 1: Fan-Out with SNS + SQS + Lambda
The fan-out pattern distributes a single event to multiple consumers for parallel processing.
Why SNS → SQS → Lambda instead of SNS → Lambda directly?
This is a nuance many architects miss. By placing SQS between SNS and Lambda:
- Buffering: SQS absorbs traffic spikes, preventing Lambda throttling
- Retry control: SQS gives you configurable retry with backoff, not SNS's limited retry policy
- Dead letter queues: Failed messages go to a DLQ for investigation, not into the void
- Batch processing: Lambda can process SQS messages in batches of up to 10,000, reducing invocations
💡 Pro Tip: Always set up a DLQ with an alarm. I've seen production incidents where thousands of messages silently failed because nobody was monitoring the DLQ. A simple CloudWatch alarm on
ApproximateNumberOfMessagesVisible > 0on your DLQ can save your on-call team hours of debugging.
Pattern 2: The Saga Pattern with Step Functions
When you need distributed transactions across microservices, the Saga pattern is your friend. Step Functions makes this dramatically easier than doing it yourself.
Here's the Step Functions state machine definition for this saga:
{
"Comment": "Order Processing Saga",
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:reserve-inventory",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "OrderFailed"
}
],
"Next": "ProcessPayment"
},
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:process-payment",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "CompensateInventory"
}
],
"Next": "ShipOrder"
},
"ShipOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:ship-order",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "CompensatePayment"
}
],
"Next": "SendConfirmation"
},
"CompensatePayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:refund-payment",
"Next": "CompensateInventory"
},
"CompensateInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:release-inventory",
"Next": "OrderFailed"
},
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:send-confirmation",
"Next": "OrderComplete"
},
"OrderComplete": { "Type": "Succeed" },
"OrderFailed": { "Type": "Fail", "Error": "OrderProcessingFailed" }
}
}
Orchestration (Step Functions) vs. Choreography (EventBridge)?
| Aspect | Step Functions (Orchestration) | EventBridge (Choreography) |
|---|---|---|
| Visibility | Full workflow visualization | Distributed, harder to trace |
| Coupling | Central coordinator knows all steps | Services are fully decoupled |
| Error handling | Built-in catch/retry/compensation | Each service handles its own |
| Complexity | Simpler for linear workflows | Better for loosely coupled domains |
| Cost | ~$0.025 per 1K state transitions (Standard) | ~$1.00 per 1M events |
My rule of thumb: Use Step Functions when you need guaranteed ordering and compensation logic. Use EventBridge when services genuinely don't need to know about each other.
Pattern 3: CQRS with DynamoDB Streams + EventBridge
Command Query Responsibility Segregation (CQRS) separates read and write models. On AWS, this pattern is incredibly natural:
When CQRS makes sense on AWS:
- Your read and write patterns are fundamentally different (e.g., write individual records, read complex aggregations)
- You need different query patterns than your write model supports
- You want to build materialized views optimized for specific read patterns
- You need to replay events (event sourcing combined with CQRS)
When it doesn't:
- Simple CRUD applications. CQRS adds significant complexity.
- Small teams that can't maintain the additional infrastructure
- When eventual consistency is unacceptable for your domain
EventBridge: The Underrated Backbone
EventBridge is the service I wish more architects understood deeply. It's not just "another SNS." It's a schema-aware, content-based routing event bus that fundamentally changes how you build event-driven systems.
Why EventBridge Over SNS for Event Routing
# EventBridge rule with content-based filtering
# This is far more powerful than SNS message filtering
{
"source": ["com.myapp.orders"],
"detail-type": ["OrderCreated"],
"detail": {
"amount": [{"numeric": [">", 1000]}],
"region": ["us-east-1", "eu-west-1"],
"customer": {
"tier": ["premium"]
}
}
}
EventBridge advantages:
- Schema Registry — Discover and validate event schemas automatically
- Archive & Replay — Replay historical events for debugging or reprocessing
- Cross-account events — Native support for multi-account architectures
- 150+ AWS service integrations — Direct integration without Lambda glue
- Content-based filtering — Route events based on JSON content, not just attributes
💡 Pro Tip: Use EventBridge's archive and replay feature as a poor man's event store. When you deploy a new consumer that needs historical data, just replay the last N days of events. I've used this pattern to bootstrap new microservices without any migration scripts.
Cost Optimization: The Math Nobody Talks About
Here's where I get controversial: serverless is not always cheaper. At scale, the per-invocation pricing model can become more expensive than containers.
The Crossover Point
Let me show you the math for a simple API endpoint:
Lambda pricing (us-east-1):
- $0.20 per 1M requests
- $0.0000166667 per GB-second
For a 256 MB function running for 200ms:
Per invocation: $0.0000002 (request) + $0.00000083 (duration) = ~$0.000001
At 10M requests/month: ~$10/month ← Lambda wins
At 100M requests/month: ~$100/month ← Still competitive
At 1B requests/month: ~$1,000/month ← Container territory
Fargate comparison (1 vCPU, 2 GB, handling 1000 req/s):
~$30/month per task
3 tasks for redundancy: ~$90/month
Handles: ~260M requests/month for $90
At 1B requests/month: ~$350/month ← Fargate wins by 3x
My Cost Optimization Rules
-
Start serverless, migrate hot paths. Build everything on Lambda initially. Use CloudWatch metrics to identify functions that are constantly warm. Those are candidates for containers.
-
Use Lambda Compute Savings Plans. Lambda participates in Compute Savings Plans — up to 17% savings with a 1-year commitment.
-
Right-size your memory. Lambda allocates CPU proportionally to memory. I've seen teams running 1024 MB functions that only use 128 MB of memory — they were paying for CPU. Use AWS Lambda Power Tuning to find the optimal memory/cost/performance balance.
-
Batch your SQS processing. Instead of processing 1 message per Lambda invocation, configure
batchSize: 10(or higher). You'll reduce invocations by 10x. -
Use Step Functions Express for high-volume workflows. Standard workflows cost $0.025/1K state transitions. Express workflows cost $0.00001667/GB-second — 1,000x cheaper for short-lived, high-volume workflows.
-
API Gateway HTTP APIs vs. REST APIs. HTTP APIs are up to 71% cheaper than REST APIs and support most use cases. Only use REST APIs when you need request/response transformation, usage plans, or API keys.
💡 Pro Tip: Tiered Lambda pricing kicked in for accounts with significant usage. Above 6 billion GB-seconds/month, the price drops by 20%. If you're at that scale, you should be in conversation with your AWS account team about private pricing anyway.
When Serverless Is NOT the Answer
I love serverless. I've built my career on it. But intellectual honesty matters, and there are real scenarios where serverless hurts more than it helps:
1. Long-Running Processes
Lambda has a 15-minute timeout. If your workload runs for hours (video transcoding, large data processing), you need ECS/Fargate or EC2. Yes, you can chain Lambda invocations via Step Functions, but at that point you're fighting the platform.
2. High-Throughput, Steady-State Workloads
If you consistently process 10,000 requests/second, 24/7, containers will be significantly cheaper. Lambda's per-invocation pricing shines for variable traffic, not constant load.
3. WebSocket/Persistent Connection Heavy Applications
API Gateway WebSocket APIs exist, but they're expensive and awkward for high-frequency bidirectional communication (like gaming or real-time collaboration). Use an ALB with Fargate or EC2 instead.
4. Applications with Heavy Initialization
If your application takes 10+ seconds to initialize (loading ML models, building in-memory caches), Lambda cold starts will destroy your user experience. Even with Provisioned Concurrency, this is a poor fit.
5. Complex Local State Requirements
Lambda functions are stateless by design. If your application needs significant in-memory state (large caches, connection pools, session data), containers or EC2 give you much more control.
6. Vendor Lock-In Sensitivity
Let me be real: a Lambda-native architecture is deeply locked to AWS. Your Step Functions, EventBridge rules, DynamoDB tables, and API Gateway configurations don't port to other clouds. If multi-cloud portability is a genuine business requirement, containers with Kubernetes give you more flexibility. But don't overweight this concern — most organizations never actually switch clouds.
A Real-World Serverless Architecture
Here's an architecture I recently built for a fintech client processing ~50M events/day:
Key design decisions:
- EventBridge as the backbone — Decouples ingestion from processing. New consumers can subscribe without changing producers.
- SQS between EventBridge and Lambda — Provides buffering, retry, and DLQ capabilities.
- DynamoDB for hot data, Aurora Serverless for relational — Not everything fits in DynamoDB. The ledger needs ACID transactions and complex queries.
- Step Functions for compliance workflows — Multi-step processes with error handling and compensation.
- Firehose to S3 for analytics — Near-real-time data lake ingestion without managing infrastructure.
Monthly cost for 50M events/day: ~$2,800 (including all compute, storage, and data transfer). Equivalent container-based architecture was estimated at ~$4,500 with lower operational overhead on the serverless side.
Key Takeaways
-
Cold starts are manageable, not catastrophic. Choose the right runtime, optimize your package, and use SnapStart for Java. Save Provisioned Concurrency for when you've exhausted free optimizations.
-
SNS → SQS → Lambda is superior to SNS → Lambda for production workloads. The buffering, retry control, and DLQ capabilities are worth the small added complexity.
-
EventBridge is the most important serverless service you're probably underusing. Its content-based routing, archive/replay, and schema registry make it far more than "another SNS."
-
Serverless isn't always cheaper. Do the math for your specific workload. The crossover point where containers win is lower than you think for steady-state, high-throughput systems.
-
The Saga pattern with Step Functions makes distributed transactions manageable. Don't build your own orchestration framework.
-
Know when to walk away from serverless. Long-running processes, persistent connections, heavy initialization, and constant high-throughput workloads are better served by containers.
-
Start serverless, optimize later. The speed of development and deployment outweighs cost optimization in the early stages. You can always move hot paths to containers later.
What's Next
In the final article of this series, I tackle the hardest problem in cloud: AWS Governance at Scale — Guardrails Without the Red Tape. How do you enforce security, compliance, and cost controls across 50+ AWS accounts without becoming the bottleneck? I'll share the multi-account patterns, SCP strategies, and automation approaches I've used to keep enterprise organizations both secure and productive.
