AWS Infrastructure & Operations
Master CloudWatch monitoring, CloudFormation IaC, architecture patterns, cost optimization, and production troubleshooting.
This comprehensive guide covers CloudWatch monitoring, CloudFormation infrastructure as code, architecture patterns, cost optimization, and troubleshooting. Learn how to build, operate, and optimize production AWS environments.
Monitoring and Management: Your Cloud Operations Center
Amazon CloudWatch - Your Eyes in the Cloud
CloudWatch is AWS’s monitoring service that collects and tracks metrics, logs, and events from every part of your infrastructure. Think of it as your 24/7 operations center.
Key CloudWatch Features:
-
Metrics and Alarms
# Create billing alarm to avoid surprises aws cloudwatch put-metric-alarm \ --alarm-name billing-alarm \ --alarm-description "Alert when AWS charges exceed $100" \ --metric-name EstimatedCharges \ --namespace AWS/Billing \ --statistic Maximum \ --period 86400 \ --threshold 100 \ --comparison-operator GreaterThanThreshold \ --dimensions Name=Currency,Value=USD -
Custom Metrics for Application Monitoring
import boto3 import time cloudwatch = boto3.client('cloudwatch') # Send custom metric def track_user_login(user_id): cloudwatch.put_metric_data( Namespace='MyApp', MetricData=[ { 'MetricName': 'UserLogins', 'Dimensions': [ { 'Name': 'Environment', 'Value': 'Production' } ], 'Value': 1, 'Unit': 'Count', 'Timestamp': time.time() } ] ) -
Log Insights for Troubleshooting
-- Find slowest API endpoints fields @timestamp, duration, path | filter duration > 1000 | sort duration desc | limit 20 -- Count errors by type fields @timestamp, error_type | filter @message like /ERROR/ | stats count() by error_type
Common CloudWatch Pitfall: Not setting up log retention, leading to unexpected costs. Always set retention policies:
aws logs put-retention-policy \
--log-group-name /aws/lambda/my-function \
--retention-in-days 30
AWS CloudFormation - Infrastructure as Code
CloudFormation lets you define your entire infrastructure in JSON or YAML templates. Instead of clicking through the console, you describe what you want and CloudFormation builds it.
Why CloudFormation Matters:
- Version control your infrastructure
- Replicate environments exactly
- Roll back changes if something breaks
- Share templates with your team
Practical CloudFormation Example:
# template.yml - Complete web application stack
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Web application with auto-scaling'
Parameters:
KeyName:
Type: AWS::EC2::KeyPair::KeyName
Description: EC2 Key Pair for SSH access
Resources:
# Application Load Balancer
LoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Subnets:
- !Ref PublicSubnet1
- !Ref PublicSubnet2
SecurityGroups:
- !Ref LoadBalancerSecurityGroup
# Auto Scaling Group
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: 2
MaxSize: 10
DesiredCapacity: 4
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
TargetGroupARNs:
- !Ref TargetGroup
HealthCheckType: ELB
HealthCheckGracePeriod: 300
# Scaling Policy
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref AutoScalingGroup
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 70
Outputs:
LoadBalancerDNS:
Description: DNS name of load balancer
Value: !GetAtt LoadBalancer.DNSName
Export:
Name: !Sub ${AWS::StackName}-LoadBalancer-DNS
Deploy with:
aws cloudformation create-stack \
--stack-name my-web-app \
--template-body file://template.yml \
--parameters ParameterKey=KeyName,ParameterValue=my-key
Messaging and Integration: Connecting Your Services
Amazon SNS - Simple Notification Service
SNS is a pub/sub messaging service that lets you send messages to multiple subscribers. Think of it as a broadcasting system for your applications.
Real-world SNS Patterns:
-
Multi-Channel Notifications
import boto3 sns = boto3.client('sns') # Create topic topic = sns.create_topic(Name='order-updates') topic_arn = topic['TopicArn'] # Subscribe email sns.subscribe( TopicArn=topic_arn, Protocol='email', Endpoint='customer@example.com' ) # Subscribe SMS sns.subscribe( TopicArn=topic_arn, Protocol='sms', Endpoint='+1234567890' ) # Subscribe Lambda for processing sns.subscribe( TopicArn=topic_arn, Protocol='lambda', Endpoint='arn:aws:lambda:region:account:function:process-order' ) # Send notification to all subscribers sns.publish( TopicArn=topic_arn, Message='Order #12345 has been shipped!', Subject='Order Update' ) -
Fan-out Pattern for Microservices
# Order service publishes once def complete_order(order_id): sns.publish( TopicArn='arn:aws:sns:region:account:order-completed', Message=json.dumps({ 'orderId': order_id, 'timestamp': datetime.now().isoformat(), 'amount': 99.99 }) ) # Multiple services subscribe and react # - Inventory service updates stock # - Email service sends confirmation # - Analytics service records sale # - Shipping service creates label
Amazon SQS - Simple Queue Service
SQS provides reliable, scalable message queues. Unlike SNS (push), SQS is pull-based - consumers request messages when ready to process them.
SQS Best Practices:
-
Decoupling with Standard Queues
import boto3 import json sqs = boto3.client('sqs') queue_url = 'https://sqs.region.amazonaws.com/account/my-queue' # Producer: Send messages def send_task(task_data): sqs.send_message( QueueUrl=queue_url, MessageBody=json.dumps(task_data), MessageAttributes={ 'Priority': { 'StringValue': 'High', 'DataType': 'String' } } ) # Consumer: Process messages def process_messages(): while True: response = sqs.receive_message( QueueUrl=queue_url, MaxNumberOfMessages=10, WaitTimeSeconds=20 # Long polling ) for message in response.get('Messages', []): # Process message process_task(json.loads(message['Body'])) # Delete after successful processing sqs.delete_message( QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'] ) -
FIFO Queues for Order Guarantees
# Create FIFO queue for ordered processing fifo_queue = sqs.create_queue( QueueName='payment-processing.fifo', Attributes={ 'FifoQueue': 'true', 'ContentBasedDeduplication': 'true' } ) # Send with message group ID for ordering sqs.send_message( QueueUrl=fifo_queue['QueueUrl'], MessageBody=json.dumps(payment_data), MessageGroupId=user_id # All messages for user processed in order )
Common SQS + SNS Pattern: Fan-out with buffering
SNS Topic → SQS Queue 1 → Lambda/EC2 processors
→ SQS Queue 2 → Different processors
→ SQS Queue 3 → Analytics pipeline
This pattern combines SNS’s broadcasting with SQS’s reliable delivery and buffering.
Essential Resources for Your Journey
Key AWS resources and documentation:
Official Resources
- AWS Free Tier: Your risk-free playground with 12 months of free services
- AWS Well-Architected Framework: Learn how AWS experts design systems
- AWS Architecture Center: Real-world reference architectures and patterns
- AWS Training and Certification: Structured learning paths from beginner to expert
Developer Tools
- AWS SDKs: Integrate AWS services into your applications
- AWS CDK: Define infrastructure using TypeScript, Python, or Java
- AWS Amplify: Fastest way to build full-stack applications
Stay Connected
- AWS Blog: Daily updates on new features and best practices
- AWS Developer Forums: Get help from the community
- AWS re:Invent Videos: Hundreds of free technical sessions
Pro tip: Start with the Free Tier and Well-Architected Framework. These two resources alone will accelerate your learning by months.
Key AWS Updates (2023-2024)
Stay current with the latest AWS innovations:
Generative AI Services
- Amazon Bedrock: Fully managed service for foundation models (Claude, Llama 2, Stable Diffusion)
- Amazon Q: AI-powered assistant for developers and business users
- Amazon CodeWhisperer: AI code companion (now part of Amazon Q Developer)
- PartyRock: No-code playground for building AI apps
Compute & Serverless
- Lambda SnapStart: Up to 10x faster cold starts for Java functions
- Lambda Function URLs: HTTPS endpoints without API Gateway
- EC2 Graviton3: 25% better performance than Graviton2
- AWS App Runner: Automatic scaling for containerized web apps
Storage & Databases
- S3 Express One Zone: Single-digit millisecond latency storage class
- Aurora Limitless Database: Scales beyond a single Aurora cluster
- ElastiCache Serverless: Redis and Memcached without capacity planning
- RDS Blue/Green Deployments: Safe database updates with minimal downtime
AI/ML Platforms
- SageMaker Studio Code Editor: VS Code-based IDE for ML
- SageMaker HyperPod: Managed infrastructure for training foundation models
- AWS Trainium2: Next-gen ML training chips
- SageMaker Canvas: No-code ML model building
Developer Experience
- AWS Application Composer: Visual design for serverless apps
- Amazon CodeCatalyst: Unified software development service
- AWS CloudShell: Browser-based shell with AWS CLI pre-installed
- Step Functions Workflow Studio: Low-code visual workflow designer
Building Real Applications: Architecture Patterns
Now that you understand individual services, let’s see how they work together to solve real problems. These patterns progress from simple to complex, each building on concepts from the previous ones.
Pattern 1: Static Website Hosting (Beginner)
Let’s start with the simplest cloud architecture - hosting a static website. This pattern introduces core concepts with minimal complexity.
Components:
- S3: Stores your HTML, CSS, and JavaScript files
- CloudFront: Delivers content globally with low latency
- Route 53: Manages your domain name
Why this architecture? It’s serverless (no EC2 instances to manage), globally distributed (CloudFront edge locations), and costs pennies per month for most sites. Perfect for portfolios, documentation, or marketing sites.
Evolution path: Add API Gateway and Lambda for dynamic features, turning your static site into a full serverless application.
Pattern 2: Traditional Web Application (Intermediate)
The classic three-tier architecture, modernized for the cloud. This pattern teaches you networking, security, and scaling concepts.
Components:
- VPC: Your isolated network with public/private subnets
- EC2 + Auto Scaling: Web servers that scale based on traffic
- Application Load Balancer: Distributes traffic across instances
- RDS Multi-AZ: Managed database with automatic failover
- ElastiCache: Redis/Memcached for session storage and caching
Why this architecture? It mirrors traditional on-premise setups but with cloud benefits - automatic scaling, managed databases, and high availability across multiple data centers.
Real-world example: An e-commerce platform starts with 2 EC2 instances. During sales events, Auto Scaling launches up to 20 instances. RDS handles thousands of concurrent transactions while ElastiCache reduces database load by caching product catalogs.
Pattern 3: Serverless Microservices (Advanced)
Embrace modern cloud-native development. No servers to manage, automatic scaling, and pay-per-request pricing.
Components:
- API Gateway: RESTful API endpoint management
- Lambda: Individual functions for each microservice
- DynamoDB: NoSQL database with single-digit millisecond performance
- Step Functions: Orchestrate complex workflows
- EventBridge: Decouple services with event-driven architecture
Why this architecture? Each microservice scales independently, deploys separately, and costs nothing when idle. Perfect for variable workloads and rapid development.
Real-world example: A food delivery app uses Lambda functions for order processing, restaurant notifications, and driver assignments. DynamoDB stores order data with automatic scaling. Step Functions coordinate the entire delivery workflow. During lunch rush, the system handles 10,000 orders per minute without any manual scaling.
Pattern 4: Data Analytics Pipeline (Advanced)
Process massive amounts of data in real-time and batch modes. This pattern introduces big data concepts and tools.
Components:
- Kinesis Data Streams: Ingest real-time data from thousands of sources
- Kinesis Data Firehose: Load streaming data into data stores
- S3 Data Lake: Central repository for all your data
- AWS Glue: ETL service for data preparation
- Athena: Query data directly in S3 using SQL
- QuickSight: Create dashboards and visualizations
Why this architecture? It separates data ingestion, storage, processing, and analysis into specialized services. Each component scales independently and you only pay for what you process.
Real-world example: An IoT company collects sensor data from millions of devices. Kinesis ingests 1TB per hour, Glue transforms it for analysis, and data scientists query historical data with Athena. Business users create real-time dashboards in QuickSight showing device health and usage patterns.
Pattern 5: Container-Based Microservices (Expert)
For teams needing more control than serverless offers. Containers provide consistency across development and production.
Components:
- ECS or EKS: Container orchestration (ECS for simplicity, EKS for Kubernetes)
- Fargate: Serverless compute for containers
- ECR: Container registry for your Docker images
- App Mesh: Service mesh for microservice communication
- CloudMap: Service discovery for dynamic environments
Why this architecture? Containers offer portability, consistency, and fine-grained resource control. Service mesh provides advanced traffic management and observability.
Real-world example: A fintech platform runs 50+ microservices in EKS. Each team owns their services, deploying independently. App Mesh handles service-to-service authentication and implements canary deployments. During market hours, critical services auto-scale based on trading volume.
Pattern 6: Multi-Region Global Application (Expert)
For applications requiring global presence, low latency, and extreme availability.
Components:
- Route 53: Geolocation and latency-based routing
- CloudFront: Global content delivery
- DynamoDB Global Tables: Multi-region replication
- Aurora Global Database: Cross-region read replicas
- AWS Global Accelerator: Improve global application availability
Why this architecture? Users get low latency regardless of location. The application survives entire region failures. Data replicates globally in seconds.
Real-world example: A social media platform serves users across continents. Route 53 directs users to the nearest region. DynamoDB Global Tables replicate user posts worldwide in under a second. If the US-East region fails, traffic automatically routes to US-West with minimal disruption.
Real-World AWS Case Studies: Learning from Production
These detailed case studies show how real companies solved complex problems with AWS. Each includes architecture decisions, challenges faced, and lessons learned.
Case Study 1: Netflix - Streaming at Planetary Scale
The Challenge: Stream video to 200+ million subscribers worldwide with perfect reliability and quality.
Architecture Overview:
Users → Route 53 → CloudFront (CDN) → Application Load Balancers
↓
EC2 Auto Scaling Groups (Microservices)
↓
DynamoDB (User Data) + S3 (Video Files)
↓
Kinesis (Real-time Analytics) → EMR (Big Data)
Key AWS Services:
- EC2: Thousands of instances running microservices
- S3: Stores the entire video catalog (petabytes)
- DynamoDB: Handles billions of reads/writes for user data
- CloudFront: Delivers video content globally
- Kinesis: Processes billions of events for recommendations
Technical Decisions:
-
Chaos Engineering with Chaos Monkey
# Randomly terminate instances to test resilience def chaos_monkey(): if random.random() < 0.1: # 10% chance instance = select_random_instance() terminate_instance(instance) log_termination(instance) -
Multi-Region Active-Active
- Every region can serve any user
- Data replicates globally in seconds
- Automatic failover between regions
-
Microservices Architecture
- 700+ microservices
- Each team owns their service completely
- Deploy hundreds of times per day
Challenges and Solutions:
Challenge: Thundering herd when popular shows release Solution: Pre-scaling based on ML predictions
def predict_and_scale(show_id):
predicted_viewers = ml_model.predict(show_id)
required_capacity = calculate_capacity(predicted_viewers)
# Pre-scale 30 minutes before release
schedule_scaling(
time=release_time - timedelta(minutes=30),
capacity=required_capacity
)
Challenge: Cost optimization at scale Solution: Reserved Instances + Spot for batch processing
- 75% Reserved Instances for baseline
- 20% On-Demand for peaks
- 5% Spot for analytics workloads
Lessons Learned:
- Design for failure - everything will fail eventually
- Automate everything - manual processes don’t scale
- Data-driven decisions - measure everything
- Small teams with full ownership work best
Case Study 2: Airbnb - Global Marketplace Platform
The Challenge: Match millions of guests with hosts worldwide, handling payments, messaging, and trust.
Architecture Evolution:
2008: Monolithic Ruby on Rails → Single MySQL database
2012: Added caching layer → Memcached
2015: Service-oriented architecture → Multiple databases
2020: Kubernetes on AWS → Microservices
Current Architecture:
Mobile/Web → API Gateway → ALB → EKS (Kubernetes)
↓
Service Mesh (Envoy) → Microservices
↓
RDS (Transactions) + DynamoDB (Sessions) + S3 (Images)
↓
Kinesis → Data Lake (S3) → Athena/Spark
Key Technical Innovations:
-
Smart Pricing Algorithm
# Lambda function for dynamic pricing def calculate_optimal_price(listing_id, date): factors = { 'seasonality': get_seasonal_demand(date), 'local_events': check_events_api(listing.location, date), 'competitor_prices': analyze_nearby_listings(listing_id), 'historical_booking': get_booking_patterns(listing_id) } base_price = listing.base_price optimal_price = ml_model.predict(base_price, factors) return { 'price': optimal_price, 'confidence': ml_model.confidence, 'factors': factors } -
Fraud Detection System
- Real-time analysis with Kinesis Analytics
- Graph database for relationship mapping
- ML models retrained daily on EMR
-
Image Processing Pipeline
# Step Functions workflow for image processing { "ProcessListingImages": { "Type": "Parallel", "Branches": [ { "StartAt": "GenerateThumbnails", "States": { "GenerateThumbnails": { "Type": "Task", "Resource": "arn:aws:lambda:function:resize-images" } } }, { "StartAt": "DetectInappropriateContent", "States": { "DetectInappropriateContent": { "Type": "Task", "Resource": "arn:aws:lambda:function:content-moderation" } } }, { "StartAt": "ExtractMetadata", "States": { "ExtractMetadata": { "Type": "Task", "Resource": "arn:aws:lambda:function:image-analysis" } } } ] } }
Scaling Challenges:
-
Search Performance
- Solution: ElasticSearch with custom ranking
- Geographical sharding for faster queries
- Cache warming for popular destinations
-
Payment Processing
- Challenge: Handle payments in 190+ countries
- Solution: Step Functions for complex workflows
- SQS for reliable payment retry logic
Key Metrics:
- 4 million listings worldwide
- 1 billion+ searches per day
- 99.99% uptime SLA
Case Study 3: Slack - Real-Time Messaging at Scale
The Challenge: Deliver messages instantly to millions of concurrent users with perfect reliability.
Architecture Highlights:
WebSocket Connections → ELB → EC2 Fleet (Connection Servers)
↓
Message Queue (Kafka on EC2)
↓
Worker Fleet (Process messages, send notifications)
↓
DynamoDB (Message history) + S3 (File uploads)
Real-Time Architecture:
-
WebSocket Management
class ConnectionManager: def __init__(self): self.connections = {} # In Redis async def handle_connection(self, websocket, user_id): # Register connection connection_id = str(uuid.uuid4()) await self.register(user_id, connection_id, websocket) # Handle messages try: async for message in websocket: await self.route_message(user_id, message) finally: await self.unregister(user_id, connection_id) async def broadcast_to_channel(self, channel_id, message): # Get all users in channel users = await self.get_channel_users(channel_id) # Send to all connected clients tasks = [] for user_id in users: connections = await self.get_user_connections(user_id) for conn in connections: tasks.append(conn.send(message)) await asyncio.gather(*tasks, return_exceptions=True) -
Message Delivery Guarantees
- At-least-once delivery with idempotency
- Message ordering per channel
- Offline queue for disconnected users
-
Search Infrastructure
- Every message indexed in near real-time
- Elasticsearch cluster per workspace
- Query optimization for emoji and reactions
Scaling Milestones:
| Year | Daily Active Users | Messages/Day | Architecture Change |
|---|---|---|---|
| 2014 | 100K | 10M | Single database |
| 2016 | 4M | 100M | Sharded MySQL |
| 2018 | 8M | 1B | DynamoDB migration |
| 2020 | 12M | 5B | Multi-region active |
Performance Optimizations:
-
Connection Pooling
# Efficient database connection management class ShardedConnectionPool: def __init__(self, shard_map): self.pools = { shard_id: ConnectionPool(config) for shard_id, config in shard_map.items() } def get_connection(self, workspace_id): shard_id = self.get_shard(workspace_id) return self.pools[shard_id].get_connection() -
Caching Strategy
- User presence in Redis (15-second TTL)
- Channel membership in ElastiCache
- Recent messages in memory
Lessons for Real-Time Apps:
- Design for connection drops - mobile networks are unreliable
- Batch operations where possible
- Use backpressure to prevent overload
- Monitor everything - latency matters
Case Study 4: Robinhood - Financial Services Platform
The Challenge: Process millions of stock trades with zero downtime and SEC compliance.
Regulatory Requirements:
- Every transaction must be logged
- Data retention for 7 years
- Disaster recovery with < 1-hour RPO
- Encryption at rest and in transit
Architecture:
Mobile Apps → API Gateway → WAF → ALB
↓
ECS Fargate (Microservices)
↓
Aurora (Transactions) + DynamoDB (Market Data)
↓
Kinesis Data Firehose → S3 (Compliance Archive)
↓
Redshift (Analytics)
Critical Components:
-
Order Execution Engine
class OrderExecutor: def __init__(self): self.market_connection = MarketConnection() self.risk_checker = RiskChecker() async def execute_order(self, order): # Pre-trade compliance checks compliance_result = await self.check_compliance(order) if not compliance_result.passed: return OrderResult(status='rejected', reason=compliance_result.reason) # Risk checks risk_result = await self.risk_checker.check(order) if risk_result.score > RISK_THRESHOLD: return OrderResult(status='rejected', reason='risk_limit') # Execute with retry logic for attempt in range(3): try: result = await self.market_connection.submit(order) await self.log_execution(order, result) return result except MarketUnavailable: await asyncio.sleep(0.1 * (attempt + 1)) return OrderResult(status='failed', reason='market_unavailable') -
Real-Time Market Data Pipeline
- 100,000+ price updates per second
- Sub-millisecond latency requirements
- DynamoDB with DAX for caching
-
Compliance and Audit System
- Every API call logged to Kinesis
- Immutable audit trail in S3
- Daily reports generated with Athena
Scaling for Market Events:
# Auto-scaling based on market volatility
def calculate_required_capacity():
volatility = get_market_volatility()
normal_capacity = 100
if volatility > HIGH_VOLATILITY_THRESHOLD:
return normal_capacity * 5 # 5x during high volatility
elif volatility > MEDIUM_VOLATILITY_THRESHOLD:
return normal_capacity * 2
else:
return normal_capacity
# Pre-scale before market open
schedule.every().day.at("09:00").do(
lambda: scale_to_capacity(calculate_required_capacity())
)
Security Architecture:
- All data encrypted with KMS
- Network isolation with PrivateLink
- API Gateway with rate limiting
- WAF rules for common attacks
Case Study 5: Pinterest - Visual Discovery Engine
The Challenge: Serve billions of images with personalized recommendations to 400+ million users.
Data Scale:
- 300+ billion Pins
- 5 billion boards
- 600 million searches per month
- 2 billion recommendations per day
Architecture:
CDN (CloudFront) → Image Servers (EC2 + S3)
↓
API Gateway → Service Mesh → Microservices (EKS)
↓
Graph Database (Neptune) + Feature Store (DynamoDB)
↓
ML Pipeline (SageMaker) → Recommendation Service
Key Innovations:
-
Visual Search System
class VisualSearchEngine: def __init__(self): self.feature_extractor = load_model('resnet50') self.index = FaissIndex() # Billion-scale similarity search def process_image(self, image_url): # Extract visual features image = download_image(image_url) features = self.feature_extractor.extract(image) # Store in feature database image_id = generate_id(image_url) self.store_features(image_id, features) # Find similar images similar = self.index.search(features, k=100) return self.rank_results(similar) def build_index_shard(self, shard_id): # Build index for billions of images features = self.load_features_for_shard(shard_id) index = FaissIndex() # Add in batches for efficiency for batch in chunks(features, 10000): index.add_batch(batch) # Save to S3 index.save_to_s3(f"index/shard_{shard_id}") -
Personalization Pipeline
- User signals processed in real-time
- Graph neural networks for recommendations
- A/B testing framework for algorithms
-
Content Moderation
- ML models detect inappropriate content
- Human review queue with SQS
- Feedback loop to improve models
Performance Optimizations:
- Image serving through CloudFront
- Aggressive caching at every layer
- Progressive image loading
- WebP format for modern browsers
Lessons Learned:
- Cache Everything: 99% cache hit rate saves millions
- Precompute When Possible: Recommendations generated offline
- Shard by User: Better cache locality
- Monitor User Experience: Not just system metrics
Key Takeaways from All Case Studies
-
Start Simple, Evolve Gradually
- Every company started with basic architecture
- Complexity added only when needed
- Technical debt managed actively
-
Data is Everything
- Instrument everything from day one
- Use data to drive decisions
- Build data pipelines early
-
Failure is Normal
- Design for failure at every level
- Practice failure scenarios
- Automate recovery procedures
-
Scale Horizontally
- Vertical scaling hits limits quickly
- Design for distributed systems
- Embrace eventual consistency
-
Security Cannot Be an Afterthought
- Build security into architecture
- Automate security scanning
- Regular security audits
These case studies demonstrate that successful AWS architectures share common patterns: they start simple, measure everything, automate aggressively, and evolve based on real needs rather than predicted ones.
Cost Optimization: Spending Smart in the Cloud
The cloud’s pay-as-you-go model is powerful, but without proper management, costs can spiral. The key is understanding how pricing works and implementing automated controls from the start.
The Cost Evolution Pattern
Most teams follow this cost optimization journey:
- Shock Phase: First AWS bill surprises everyone
- Panic Cuts: Turning off resources randomly
- Understanding: Learning what actually drives costs
- Optimization: Right-sizing and automated management
- Mastery: Costs become predictable and optimized
Understanding Your Bill
AWS costs break down into three main categories:
Compute Costs
- On-Demand: Like hotel rooms - flexible but expensive
- Reserved Instances: Like apartment leases - cheaper with commitment
- Spot Instances: Like last-minute deals - up to 90% off but can be interrupted
- Savings Plans: Flexible commitment across instance types
Real example: A startup’s API servers cost $5,000/month on-demand. After analyzing usage patterns, they buy Reserved Instances for baseline capacity and use Spot for batch processing, reducing costs to $2,000/month.
Storage Costs
-
S3 Storage Classes: Match storage to access patterns
- Standard: Frequently accessed data
- Infrequent Access: 50% cheaper for archived data
- Glacier: 90% cheaper for long-term archives
- Lifecycle Policies: Automatically move data to cheaper storage
Real example: A photo sharing app automatically moves photos older than 30 days to Infrequent Access, and after 1 year to Glacier. Storage costs drop 70% with no user impact.
Data Transfer Costs
- Within Region: Free between services
- Cross-Region: Charged per GB
- Internet Egress: Most expensive
Advanced Cost Management Tools
# cost-optimization.tf - Cost management and optimization
# Cost anomaly detection
resource "aws_ce_anomaly_monitor" "main" {
name = "${var.environment}-cost-anomaly-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_subscription" "main" {
name = "${var.environment}-cost-anomaly-subscription"
threshold = 100.0 # USD
frequency = "DAILY"
monitor_arn_list = [
aws_ce_anomaly_monitor.main.arn
]
subscriber {
type = "EMAIL"
address = var.cost_alert_email
}
subscriber {
type = "SNS"
address = aws_sns_topic.cost_alerts.arn
}
}
# Budget alerts
resource "aws_budgets_budget" "monthly" {
name = "${var.environment}-monthly-budget"
budget_type = "COST"
limit_amount = var.monthly_budget_limit
limit_unit = "USD"
time_unit = "MONTHLY"
time_period_start = "2024-01-01_00:00"
cost_types {
include_credit = false
include_discount = true
include_other_subscription = true
include_recurring = true
include_refund = false
include_subscription = true
include_support = true
include_tax = true
include_upfront = true
use_blended = false
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = [var.cost_alert_email]
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.cost_alert_email]
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}
# Service-specific budgets
resource "aws_budgets_budget" "service_budgets" {
for_each = var.service_budgets
name = "${var.environment}-${each.key}-budget"
budget_type = "COST"
limit_amount = each.value.limit
limit_unit = "USD"
time_unit = "MONTHLY"
time_period_start = "2024-01-01_00:00"
cost_filter {
name = "Service"
values = [each.key]
}
cost_types {
include_credit = false
include_discount = true
include_other_subscription = true
include_recurring = true
include_refund = false
include_subscription = true
include_support = false
include_tax = true
include_upfront = true
use_blended = false
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.cost_alert_email]
}
}
# Compute Optimizer enrollment
resource "aws_organizations_policy" "compute_optimizer" {
name = "ComputeOptimizerEnrollment"
description = "Enable Compute Optimizer for all accounts"
type = "SERVICE_CONTROL_POLICY"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = "compute-optimizer:*"
Resource = "*"
}
]
})
}
# Lambda for cost optimization recommendations
resource "aws_lambda_function" "cost_optimizer" {
filename = "cost_optimizer.zip"
function_name = "${var.environment}-cost-optimizer"
role = aws_iam_role.cost_optimizer.arn
handler = "index.handler"
runtime = "python3.9"
timeout = 900
memory_size = 3008
environment {
variables = {
SNS_TOPIC_ARN = aws_sns_topic.cost_recommendations.arn
S3_BUCKET = aws_s3_bucket.cost_reports.id
ENVIRONMENT = var.environment
}
}
layers = [
"arn:aws:lambda:${var.aws_region}:336392948345:layer:AWSSDKPandas-Python39:1"
]
}
# EventBridge rule for weekly cost analysis
resource "aws_cloudwatch_event_rule" "cost_analysis" {
name = "${var.environment}-weekly-cost-analysis"
description = "Trigger weekly cost analysis"
schedule_expression = "cron(0 9 ? * MON *)"
}
resource "aws_cloudwatch_event_target" "cost_optimizer" {
rule = aws_cloudwatch_event_rule.cost_analysis.name
target_id = "CostOptimizer"
arn = aws_lambda_function.cost_optimizer.arn
}
# Cost and Usage Report
resource "aws_s3_bucket" "cost_reports" {
bucket = "${var.environment}-cost-reports-${data.aws_caller_identity.current.account_id}"
}
resource "aws_s3_bucket_policy" "cost_reports" {
bucket = aws_s3_bucket.cost_reports.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Service = "billingreports.amazonaws.com"
}
Action = [
"s3:GetBucketAcl",
"s3:GetBucketPolicy"
]
Resource = aws_s3_bucket.cost_reports.arn
},
{
Effect = "Allow"
Principal = {
Service = "billingreports.amazonaws.com"
}
Action = "s3:PutObject"
Resource = "${aws_s3_bucket.cost_reports.arn}/*"
}
]
})
}
resource "aws_cur_report_definition" "main" {
report_name = "${var.environment}-cost-usage-report"
time_unit = "DAILY"
format = "Parquet"
compression = "Parquet"
additional_schema_elements = ["RESOURCES"]
s3_bucket = aws_s3_bucket.cost_reports.id
s3_prefix = "cur"
s3_region = var.aws_region
additional_artifacts = ["QUICKSIGHT"]
report_versioning = "OVERWRITE_REPORT"
}
# Reserved Instance utilization alerts
resource "aws_cloudwatch_metric_alarm" "ri_utilization" {
alarm_name = "${var.environment}-low-ri-utilization"
comparison_operator = "LessThanThreshold"
evaluation_periods = "1"
metric_name = "ReservedInstanceUtilization"
namespace = "AWS/CE"
period = "86400" # 24 hours
statistic = "Average"
threshold = "75"
alarm_description = "Reserved Instance utilization below 75%"
alarm_actions = [aws_sns_topic.cost_alerts.arn]
dimensions = {
Currency = "USD"
}
}
# Savings Plans utilization alerts
resource "aws_cloudwatch_metric_alarm" "sp_utilization" {
alarm_name = "${var.environment}-low-sp-utilization"
comparison_operator = "LessThanThreshold"
evaluation_periods = "1"
metric_name = "SavingsPlansUtilization"
namespace = "AWS/CE"
period = "86400"
statistic = "Average"
threshold = "90"
alarm_description = "Savings Plans utilization below 90%"
alarm_actions = [aws_sns_topic.cost_alerts.arn]
}
# Cost allocation tags
resource "aws_organizations_policy" "tagging" {
name = "MandatoryTaggingPolicy"
description = "Enforce cost allocation tags"
type = "TAG_POLICY"
content = jsonencode({
tags = {
Environment = {
tag_key = {
"@@assign" = "Environment"
}
tag_value = {
"@@assign" = ["Production", "Staging", "Development"]
}
enforced_for = {
"@@assign" = ["ec2:instance", "s3:bucket", "rds:db"]
}
}
CostCenter = {
tag_key = {
"@@assign" = "CostCenter"
}
enforced_for = {
"@@assign" = ["ec2:*", "s3:*", "rds:*"]
}
}
Project = {
tag_key = {
"@@assign" = "Project"
}
enforced_for = {
"@@assign" = ["ec2:*", "s3:*", "rds:*"]
}
}
}
})
}
# Attach tagging policy to organization
resource "aws_organizations_policy_attachment" "tagging" {
policy_id = aws_organizations_policy.tagging.id
target_id = aws_organizations_organization.main.roots[0].id
}
# Instance Scheduler for non-production environments
module "instance_scheduler" {
source = "aws-ia/instance-scheduler/aws"
version = "2.0.0"
scheduler_frequency = "5"
schedules = [
{
name = "business-hours"
description = "Run instances during business hours only"
timezone = "America/New_York"
periods = [
{
name = "weekdays"
description = "Monday to Friday"
begintime = "08:00"
endtime = "18:00"
weekdays = "mon-fri"
}
]
}
]
tag_name = "Schedule"
}
# Spot Instance configuration
resource "aws_launch_template" "spot" {
name_prefix = "${var.environment}-spot-"
instance_market_options {
market_type = "spot"
spot_options {
max_price = "0.5" # 50% of on-demand price
spot_instance_type = "persistent"
instance_interruption_behavior = "stop"
}
}
tag_specifications {
resource_type = "instance"
tags = {
Environment = var.environment
InstanceType = "spot"
}
}
}
# S3 lifecycle policies for cost optimization
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "transition-old-logs"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
transition {
days = 180
storage_class = "DEEP_ARCHIVE"
}
expiration {
days = 365
}
}
rule {
id = "delete-incomplete-uploads"
status = "Enabled"
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
}
# Athena for cost analysis
resource "aws_athena_database" "cost_analysis" {
name = "${var.environment}_cost_analysis"
bucket = aws_s3_bucket.cost_reports.id
}
resource "aws_athena_workgroup" "cost_analysis" {
name = "${var.environment}-cost-analysis"
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://${aws_s3_bucket.cost_reports.id}/athena-results/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}
}
}
# QuickSight for cost visualization
resource "aws_quicksight_data_source" "cost_data" {
data_source_id = "${var.environment}-cost-data"
name = "Cost and Usage Report"
parameters {
athena {
work_group = aws_athena_workgroup.cost_analysis.name
}
}
type = "ATHENA"
}
for key in tag_schema.keys():
self.ce_client.create_cost_category_definition(
Name=f'CostCategory-{key}',
Rules=[
{
'Value': value,
'Rule': {
'Tags': {
'Key': key,
'Values': [value]
}
}
} for value in tag_schema[key]
]
)
Spot Instance management
class SpotInstanceManager: def init(self): self.ec2 = boto3.client(‘ec2’)
def create_spot_fleet(self,
target_capacity: int,
instance_types: List[str],
max_price: str,
subnets: List[str]) -> str:
"""Create diversified Spot Fleet"""
# Build launch specifications for each instance type
launch_specs = []
for instance_type in instance_types:
for subnet in subnets:
launch_specs.append({
'InstanceType': instance_type,
'ImageId': 'ami-12345678', # Your AMI
'KeyName': 'your-key-pair',
'SecurityGroups': [{'GroupId': 'sg-12345678'}],
'SubnetId': subnet,
'IamInstanceProfile': {
'Arn': 'arn:aws:iam::account:instance-profile/role'
},
'TagSpecifications': [
{
'ResourceType': 'instance',
'Tags': [
{'Key': 'Name', 'Value': 'SpotFleet-Instance'},
{'Key': 'Type', 'Value': 'Spot'}
]
}
]
})
response = self.ec2.request_spot_fleet(
SpotFleetRequestConfig={
'AllocationStrategy': 'diversified',
'TargetCapacity': target_capacity,
'SpotPrice': max_price,
'IamFleetRole': 'arn:aws:iam::account:role/aws-ec2-spot-fleet-role',
'LaunchSpecifications': launch_specs,
'TerminateInstancesWithExpiration': True,
'Type': 'maintain',
'ReplaceUnhealthyInstances': True,
'InstanceInterruptionBehavior': 'terminate',
'TagSpecifications': [
{
'ResourceType': 'spot-fleet-request',
'Tags': [
{'Key': 'Name', 'Value': 'MySpotFleet'}
]
}
]
}
)
return response['SpotFleetRequestId'] ```
Infrastructure as Code: Never Click Again
The biggest shift in cloud operations? Treating infrastructure like software. Instead of clicking through the AWS console, you define infrastructure in code. This enables version control, peer review, and automated deployments.
Why Infrastructure as Code Changes Everything
The Old Way:
- Click through AWS console to create resources
- Document steps in a wiki (that nobody updates)
- Hope you can recreate it in another region
- Fear making changes that might break production
The IaC Way:
- Define infrastructure in configuration files
- Version control shows exactly what changed and when
- Deploy identical environments with one command
- Test changes in staging before production
Choosing Your IaC Tool
CloudFormation (AWS Native)
- Pros: Deep AWS integration, no extra tools needed
- Cons: Verbose syntax, AWS-only
- Best for: Teams fully committed to AWS
Terraform (Multi-Cloud)
- Pros: Works across cloud providers, huge community
- Cons: Requires learning HCL syntax
- Best for: Multi-cloud strategies or teams wanting flexibility
AWS CDK (Developer-Friendly)
- Pros: Use familiar programming languages (Python, TypeScript)
- Cons: Newer tool, smaller community
- Best for: Development teams wanting to use existing skills
Real-World IaC Evolution
A startup’s infrastructure journey:
- Month 1: Everything created via console clicks
- Month 3: Production breaks, nobody remembers how to rebuild
- Month 4: Team adopts Terraform, documents existing infrastructure
- Month 6: All changes go through pull requests
- Year 1: Disaster recovery test - entire production rebuilt in 30 minutes
Advanced Patterns That Save Your Sanity
from aws_cdk import (
core as cdk,
aws_ec2 as ec2,
aws_ecs as ecs,
aws_ecs_patterns as ecs_patterns,
aws_elasticloadbalancingv2 as elbv2,
aws_rds as rds,
aws_secretsmanager as sm,
aws_cloudwatch as cloudwatch,
aws_cloudwatch_actions as cw_actions,
aws_sns as sns,
aws_lambda as lambda_,
aws_apigateway as apigw,
custom_resources as cr
)
from constructs import Construct
import json
class MicroservicesStack(cdk.Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
# Create VPC with custom configuration
vpc = ec2.Vpc(
self, "MicroservicesVPC",
max_azs=3,
nat_gateways=2,
subnet_configuration=[
ec2.SubnetConfiguration(
name="Public",
subnet_type=ec2.SubnetType.PUBLIC,
cidr_mask=24
),
ec2.SubnetConfiguration(
name="Private",
subnet_type=ec2.SubnetType.PRIVATE,
cidr_mask=24
),
ec2.SubnetConfiguration(
name="Isolated",
subnet_type=ec2.SubnetType.ISOLATED,
cidr_mask=24
)
]
)
# Create ECS Cluster with capacity providers
cluster = ecs.Cluster(
self, "Cluster",
vpc=vpc,
container_insights=True
)
# Add Fargate Spot capacity provider
cluster.add_capacity_provider(
ecs.FargateCapacityProvider(
self, "FargateSpotProvider",
spot=True
)
)
# Create RDS Aurora Serverless v2
db_secret = sm.Secret(
self, "DBSecret",
generate_secret_string=sm.SecretStringGenerator(
secret_string_template=json.dumps({"username": "admin"}),
generate_string_key="password",
exclude_characters=" %+~`#$&*()|[]{}:;<>?!'/\\"
)
)
db_cluster = rds.DatabaseCluster(
self, "AuroraCluster",
engine=rds.DatabaseClusterEngine.aurora_mysql(
version=rds.AuroraMysqlEngineVersion.VER_3_01_0
),
serverless_v2_scaling_configuration=rds.ServerlessV2ScalingConfiguration(
min_capacity=0.5,
max_capacity=2
),
credentials=rds.Credentials.from_secret(db_secret),
vpc=vpc,
vpc_subnets=ec2.SubnetSelection(
subnet_type=ec2.SubnetType.ISOLATED
),
backup=rds.BackupProps(
retention=cdk.Duration.days(7)
),
deletion_protection=True
)
# Create shared ALB
alb = elbv2.ApplicationLoadBalancer(
self, "ALB",
vpc=vpc,
internet_facing=True,
http2_enabled=True
)
# Add CloudWatch alarms
alarm = cloudwatch.Alarm(
self, "HighErrorRate",
metric=alb.metric_target_response_time(),
threshold=1000,
evaluation_periods=2
)
# SNS topic for alarms
alarm_topic = sns.Topic(
self, "AlarmTopic",
display_name="Microservices Alarms"
)
alarm.add_alarm_action(cw_actions.SnsAction(alarm_topic))
# Deploy microservices
self.deploy_microservice(
cluster=cluster,
alb=alb,
service_name="users",
image="users-service:latest",
port=8080,
priority=1,
path_pattern="/users/*",
environment={
"DB_SECRET_ARN": db_secret.secret_arn,
"DB_CLUSTER_ARN": db_cluster.cluster_arn
}
)
self.deploy_microservice(
cluster=cluster,
alb=alb,
service_name="orders",
image="orders-service:latest",
port=8081,
priority=2,
path_pattern="/orders/*",
environment={
"DB_SECRET_ARN": db_secret.secret_arn,
"DB_CLUSTER_ARN": db_cluster.cluster_arn
}
)
# Create API Gateway for serverless endpoints
api = apigw.RestApi(
self, "MicroservicesAPI",
deploy_options=apigw.StageOptions(
logging_level=apigw.MethodLoggingLevel.INFO,
data_trace_enabled=True,
tracing_enabled=True
)
)
# Lambda function for async processing
async_processor = lambda_.Function(
self, "AsyncProcessor",
runtime=lambda_.Runtime.PYTHON_3_9,
handler="index.handler",
code=lambda_.Code.from_asset("lambda"),
vpc=vpc,
environment={
"DB_SECRET_ARN": db_secret.secret_arn
},
reserved_concurrent_executions=100,
tracing=lambda_.Tracing.ACTIVE
)
# Grant permissions
db_secret.grant_read(async_processor)
db_cluster.grant_connect(async_processor)
# Custom resource for database initialization
db_init = cr.AwsCustomResource(
self, "DBInit",
on_create=cr.AwsSdkCall(
service="RDS",
action="executeStatement",
parameters={
"resourceArn": db_cluster.cluster_arn,
"secretArn": db_secret.secret_arn,
"database": "mysql",
"sql": "CREATE DATABASE IF NOT EXISTS microservices;"
},
physical_resource_id=cr.PhysicalResourceId.of("DBInit")
),
policy=cr.AwsCustomResourcePolicy.from_sdk_calls(
resources=[db_cluster.cluster_arn]
)
)
# Output values
cdk.CfnOutput(
self, "ALBDNSName",
value=alb.load_balancer_dns_name,
description="ALB DNS Name"
)
cdk.CfnOutput(
self, "APIEndpoint",
value=api.url,
description="API Gateway Endpoint"
)
def deploy_microservice(self,
cluster: ecs.Cluster,
alb: elbv2.ApplicationLoadBalancer,
service_name: str,
image: str,
port: int,
priority: int,
path_pattern: str,
environment: dict):
"""Deploy a microservice to ECS"""
# Create task definition
task_definition = ecs.FargateTaskDefinition(
self, f"{service_name}TaskDef",
memory_limit_mib=512,
cpu=256
)
# Add container
container = task_definition.add_container(
f"{service_name}Container",
image=ecs.ContainerImage.from_registry(image),
logging=ecs.LogDrivers.aws_logs(
stream_prefix=service_name
),
environment=environment,
health_check=ecs.HealthCheck(
command=["CMD-SHELL", f"curl -f http://localhost:{port}/health || exit 1"],
interval=cdk.Duration.seconds(30),
timeout=cdk.Duration.seconds(5),
retries=3
)
)
container.add_port_mappings(
ecs.PortMapping(
container_port=port,
protocol=ecs.Protocol.TCP
)
)
# Create service
service = ecs.FargateService(
self, f"{service_name}Service",
cluster=cluster,
task_definition=task_definition,
desired_count=2,
capacity_provider_strategies=[
ecs.CapacityProviderStrategy(
capacity_provider="FARGATE_SPOT",
weight=2
),
ecs.CapacityProviderStrategy(
capacity_provider="FARGATE",
weight=1
)
],
circuit_breaker=ecs.DeploymentCircuitBreaker(
rollback=True
)
)
# Configure auto-scaling
scaling = service.auto_scale_task_count(
min_capacity=2,
max_capacity=10
)
scaling.scale_on_cpu_utilization(
"CpuScaling",
target_utilization_percent=70,
scale_in_cooldown=cdk.Duration.seconds(60),
scale_out_cooldown=cdk.Duration.seconds(60)
)
scaling.scale_on_request_count(
"RequestScaling",
requests_per_target=1000,
target_group=alb.add_targets(
f"{service_name}TG",
port=port,
targets=[service],
health_check=elbv2.HealthCheck(
path=f"/{service_name}/health",
interval=cdk.Duration.seconds(30)
)
)
)
# Add ALB listener rule
alb.add_listener(
f"{service_name}Listener",
port=80
).add_targets(
f"{service_name}Targets",
port=port,
targets=[service],
priority=priority,
conditions=[
elbv2.ListenerCondition.path_patterns([path_pattern])
]
)
AWS Troubleshooting Guide: When Things Go Wrong
Even experienced cloud architects encounter issues. This guide helps you diagnose and fix common AWS problems quickly.
The Troubleshooting Mindset
Before diving into specific issues, adopt this systematic approach:
- Check the obvious first - Is it plugged in? (Is the service running?)
- Isolate the problem - What changed recently?
- Use AWS tools - CloudWatch Logs, X-Ray, Systems Manager
- Document everything - Future you will thank present you
Common Issues and Solutions
1. “Access Denied” - The Most Common AWS Error
Symptoms:
- API calls fail with “Access Denied”
- Console shows “You don’t have permissions”
- Lambda functions can’t access resources
Diagnosis Checklist:
# Check who you are
aws sts get-caller-identity
# Check attached policies
aws iam list-attached-user-policies --user-name $(aws sts get-caller-identity --query UserId --output text)
# Test specific permissions
aws iam simulate-principal-policy \
--policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \
--action-names s3:GetObject \
--resource-arns arn:aws:s3:::my-bucket/*
Common Fixes:
-
Wrong Region
# Check current region aws configure get region # Set correct region export AWS_DEFAULT_REGION=us-east-1 -
Missing Resource Permissions
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::my-bucket/*" // Don't forget the /* }] } -
Service-Linked Roles
# For Lambda accessing VPC aws iam create-service-linked-role --aws-service-name lambda.amazonaws.com
2. “Instance Connection Timeout” - Can’t SSH to EC2
Symptoms:
- SSH hangs or times out
- Can’t reach web server on instance
- Instance is running but unreachable
Systematic Diagnosis:
-
Check Security Group
# List security group rules aws ec2 describe-security-groups --group-ids sg-xxxxxx # Fix: Allow SSH aws ec2 authorize-security-group-ingress \ --group-id sg-xxxxxx \ --protocol tcp \ --port 22 \ --cidr 0.0.0.0/0 # Use your IP for security -
Check Network ACLs
# Default NACLs allow all - custom ones might not aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=subnet-xxxxx" -
Check Route Table
# Ensure route to Internet Gateway exists aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-xxxxx" -
Check Instance Status
# Both checks should pass aws ec2 describe-instance-status --instance-id i-xxxxx
Quick Fix Script:
#!/bin/bash
INSTANCE_ID="i-xxxxx"
SG_ID=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].SecurityGroups[0].GroupId' --output text)
# Allow SSH from your IP
MY_IP=$(curl -s checkip.amazonaws.com)
aws ec2 authorize-security-group-ingress \
--group-id $SG_ID \
--protocol tcp \
--port 22 \
--cidr $MY_IP/32
echo "SSH access enabled from $MY_IP"
3. “Throttling Errors” - Rate Limit Exceeded
Symptoms:
- “Rate exceeded” errors
- Intermittent API failures
- Bulk operations failing
Solutions:
-
Implement Exponential Backoff
import time import random from botocore.exceptions import ClientError def retry_with_backoff(func, max_retries=5): for attempt in range(max_retries): try: return func() except ClientError as e: if e.response['Error']['Code'] == 'Throttling': # Exponential backoff with jitter wait_time = (2 ** attempt) + random.uniform(0, 1) time.sleep(wait_time) else: raise raise Exception(f"Max retries ({max_retries}) exceeded") -
Use Service Quotas
# Check current limits aws service-quotas get-service-quota \ --service-code ec2 \ --quota-code L-1216C47A # Running On-Demand instances # Request increase aws service-quotas request-service-quota-increase \ --service-code ec2 \ --quota-code L-1216C47A \ --desired-value 100
4. “Out of Memory” - Lambda/Container Crashes
Symptoms:
- Lambda function fails with no clear error
- ECS tasks stopping unexpectedly
- Application becomes unresponsive
Diagnosis:
-
Check Lambda Logs
# Find memory usage aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --filter-pattern "[REPORT]" \ --query 'events[*].message' \ --output text | grep "Memory" -
Monitor with CloudWatch
# Add memory tracking to Lambda import resource def lambda_handler(event, context): # Track memory usage memory_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss print(f"Memory used: {memory_usage / 1024:.2f} MB") # Your code here -
Fix: Increase Memory or Optimize
# Update Lambda memory aws lambda update-function-configuration \ --function-name my-function \ --memory-size 1024
5. “Slow Application Performance”
Symptoms:
- API responses taking seconds
- Database queries timing out
- Users complaining about speed
Performance Troubleshooting Toolkit:
-
Enable X-Ray Tracing
from aws_xray_sdk.core import xray_recorder from aws_xray_sdk.core import patch_all patch_all() # Automatically trace AWS SDK calls @xray_recorder.capture('process_order') def process_order(order_id): # X-Ray will show time spent in each service validate_order(order_id) charge_payment(order_id) update_inventory(order_id) -
Analyze RDS Performance
-- Enable Performance Insights -- Then query slow operations SELECT query, calls, total_time, mean_time, max_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10; -
Check CloudFront Cache Hit Ratio
# Low cache hit = slow performance aws cloudwatch get-metric-statistics \ --namespace AWS/CloudFront \ --metric-name CacheHitRate \ --dimensions Name=DistributionId,Value=XXXXX \ --statistics Average \ --start-time 2024-01-01T00:00:00Z \ --end-time 2024-01-02T00:00:00Z \ --period 3600
Emergency Response Playbook
When production is down, follow this checklist:
1. Immediate Actions (First 5 Minutes)
# Check service health
aws health describe-events --filter eventTypeCategories=issue
# Check CloudWatch alarms
aws cloudwatch describe-alarms --state-value ALARM
# Recent changes?
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=UpdateStack \
--max-items 10
2. Common Quick Fixes
“Everything is Down!”
- Check Route 53 health checks
- Verify load balancer target health
- Check Auto Scaling group size
“Database Connection Errors”
- Check RDS security groups
- Verify connection limits not exceeded
- Check if automated backup is running
“API Gateway 5XX Errors”
- Check Lambda function errors
- Verify integration timeout settings
- Check concurrent execution limits
Proactive Monitoring Setup
Prevent issues before they happen:
# Create comprehensive CloudWatch dashboard
aws cloudwatch put-dashboard \
--dashboard-name ProductionHealth \
--dashboard-body file://dashboard.json
# Set up alerts for common issues
# High error rate
aws cloudwatch put-metric-alarm \
--alarm-name high-error-rate \
--alarm-description "Alert when error rate exceeds 1%" \
--metric-name 4XXError \
--namespace AWS/ApiGateway \
--statistic Sum \
--period 300 \
--threshold 50 \
--comparison-operator GreaterThanThreshold
# Database CPU
aws cloudwatch put-metric-alarm \
--alarm-name rds-high-cpu \
--alarm-description "RDS CPU above 80%" \
--metric-name CPUUtilization \
--namespace AWS/RDS \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold
Tools Every AWS Developer Should Know
-
AWS CLI with Query Powers
# Find specific resources quickly aws ec2 describe-instances \ --query 'Reservations[*].Instances[?State.Name==`running`].[InstanceId,Tags[?Key==`Name`].Value|[0]]' \ --output table -
Systems Manager Session Manager
# Connect without SSH keys or bastion hosts aws ssm start-session --target i-xxxxx -
CloudWatch Logs Insights
-- Find errors across all Lambda functions fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(5m) -
AWS Personal Health Dashboard
- Check for AWS service issues affecting you
- Get advance notice of maintenance
Remember: Most AWS issues fall into these categories:
- Permissions (IAM)
- Networking (Security Groups, NACLs)
- Limits (Service Quotas)
- Configuration (Wrong region, missing parameters)
Master troubleshooting these four areas and you’ll solve 90% of AWS problems.
Your Cloud Journey: From Here to Mastery
You’ve learned the core concepts, explored architecture patterns, and understand optimization strategies. Where do you go from here?
The Path Forward
Next 30 Days: Build Your Foundation
- Get Hands-On: Launch your first EC2 instance, create an S3 bucket, set up a simple website
- Break Things: Experiment in a sandbox account - failure is the best teacher
- Automate One Thing: Convert a manual process to Lambda or use CloudFormation
- Monitor Costs: Set up billing alerts and understand your first bill
Next 90 Days: Develop Expertise
- Build a Real Project: Create something you’ll actually use
- Master One Service Deeply: Whether it’s Lambda, DynamoDB, or ECS
- Practice Troubleshooting: Learn to read CloudWatch logs and traces
- Join the Community: AWS user groups, re:Invent videos, forums
Next Year: Achieve Mastery
- Design for Scale: Build systems that can grow 100x
- Optimize Everything: Cost, performance, security, operations
- Share Knowledge: Blog, speak, mentor others
- Stay Current: AWS releases new features daily - follow what matters to you
Emerging Trends to Watch
Serverless Everything: The trend toward managed services accelerates. Focus on business logic, not infrastructure.
AI-Powered Operations: From cost optimization to security, AI will automate routine cloud management tasks.
Edge Computing: Processing moves closer to users. 5G and IoT drive computing to the edge.
Sustainability Focus: Carbon-aware computing becomes standard. Green architectures will be the default.
Remember: Cloud is a Journey, Not a Destination
AWS evolves constantly. The services you master today will have new features tomorrow. The architectures you build will need to adapt. That’s not a bug - it’s the feature that makes cloud computing exciting.
Start small, think big, and build amazing things. The cloud is your platform for innovation. What will you create?
See Also
- AWS Hub - Overview of all AWS documentation
- Compute Services - EC2 and Lambda deployment
- Storage Services - S3 and EBS management
- Database Services - RDS and DynamoDB
- Networking - VPC and load balancers
- Security - Security automation and compliance
- Terraform - Alternative IaC approach
- Docker - Containerization for deployments
- Kubernetes - Container orchestration