EKS Networking Gotchas: AWS VPC CNI Plugin Issues That Cost Us $50K/Month (And How to Avoid Them)


It started as a minor issue. A few pods weren’t communicating. Then latency spiked. Then our entire EKS cluster became intermittently unresponsive. Three weeks later, we traced it back to our VPC CNI plugin configuration. The fix cost us $50,000 in unnecessary infrastructure spend that month alone. This guide covers the 5 critical EKS networking gotchas that catch most teams off guard.

Understanding VPC CNI Plugin Basics

AWS VPC CNI plugin is how pods get network connectivity in EKS. Unlike other Kubernetes distributions, EKS pods get actual VPC IP addresses directly from your subnet. This is powerful but dangerous if misconfigured.

How it works: Each worker node gets multiple ENIs (Elastic Network Interfaces). Each ENI has multiple secondary IP addresses. The CNI plugin allocates these IPs to pods. When a pod needs an IP, it gets one from the pool. Simple in theory, but the reality is filled with hidden gotchas.

Key limits to understand:

  • c5.large = 3 ENIs, 6 IPs per ENI = 18 total IPs
  • t3.medium = 2 ENIs, 4 IPs per ENI = 8 total IPs
  • m5.xlarge = 4 ENIs, 15 IPs per ENI = 60 total IPs

These limits are per instance type. Exceed them, and pods won’t get IP addresses.

Gotcha #1: ENI Exhaustion (The Silent Killer)

This is what happened to us. We had many small pods running on each node. Each pod consumed one IP. After scaling, we hit the ENI limit hard.

Symptoms:

  • New pods stuck in Pending state
  • kubectl describe pod shows: “no ENI/IP available”
  • CloudWatch shows: eni_max_retry_failures increasing
  • Networking works for existing pods but new ones fail

How to detect it:
Run this command on your worker nodes:

aws ec2 describe-instances –instance-ids i-xxxxxxxxx –query ‘Reservations[0].Instances[0].NetworkInterfaces | length(@)’

If the result is less than the instance’s max, you’re not maxed out. But check the secondary IPs:

aws ec2 describe-network-interface-attribute –network-interface-id eni-xxxxx –attribute association –query ‘Association.IpOwnerId’

The fix:

  1. Increase node count (split pods across more nodes)
  2. Use larger instance types (more ENIs)
  3. Configure WARM_IP_TARGET (explained below)

We increased from t3.medium (8 IPs) to m5.large (30 IPs) across our cluster. Cost went up initially, but pod density improved and the $50K issue disappeared.

Gotcha #2: Security Group Misconfigurations

The VPC CNI plugin assigns actual VPC IPs to pods. This means your security groups apply to pod traffic. Most teams don’t realize this until pods mysteriously can’t communicate.

Common scenario:

  • Pod A in security group SG-A
  • Pod B in security group SG-B
  • SG-B doesn’t have an inbound rule for SG-A
  • Pod B receives no traffic from Pod A

Solution: Configure security groups for pod-to-pod communication

EKS best practice:

  1. Create a “pod” security group
  2. Allow all inbound traffic from the same security group
  3. Assign this security group to all worker nodes

Example security group rule:
Inbound rule: All TCP, All UDP from security group sg-xxxxxxx (self-reference)

Verify your setup:
kubectl get nodes -o wide
Note the node security groups, then verify rules in AWS console.

Gotcha #3: IP Address Exhaustion in Subnets

This is different from ENI exhaustion. You can have plenty of ENIs but run out of IP addresses in your subnet.

Example:

  • Subnet CIDR: 10.0.0.0/26 (62 usable IPs)
  • 2 nodes running
  • Each node needs 10 IPs reserved (ENIs + pod IPs)
  • You can only run about 30 pods total

In multi-AZ clusters, this compounds. If you have subnets in 3 AZs with /26 blocks:

  • Total capacity: 186 IPs across all subnets
  • Minus AWS reserved IPs: ~150 usable
  • Minus node IPs: ~120 for pods

For production clusters, use at least /24 subnets (250+ IPs per AZ).

Calculate your needs:
Expected pods per node × Max nodes per AZ = Total IPs needed
Add 20% buffer for failover

Our mistake: We started with /26 subnets. At 100 pods per node scale, we were maxed out. Expanding to /24 required VPC reconfiguration but solved the issue permanently.

Gotcha #4: Warm IP Target Misconfiguration

When a new pod needs an IP, the CNI plugin must allocate one. This takes time (sometimes 1-2 seconds per pod). To minimize this latency, you can pre-warm IPs.

Two configuration approaches:

  1. WARM_IP_TARGET
    Keeps N free IPs ready on each node
    Default: 5
    Example: With 20 free IPs, new pods launch ~4x faster
  2. WARM_ENI_TARGET
    Pre-allocates entire ENIs
    More aggressive but wastes IPs

Configuration (in aws-node daemonset):

kubectl set env daemonset aws-node -n kube-system WARM_IP_TARGET=20

The trap: Higher WARM_IP_TARGET = more IPs reserved = fewer pods per node

Optimal value depends on your workload:

  • Batch processing: Low WARM_IP_TARGET (5-10)
  • Auto-scaling microservices: High WARM_IP_TARGET (15-30)
  • Serverless platforms: Very high (40+)

Monitor the impact:
Watch pod launch time: kubectl logs -n kube-system -f ds/aws-node
Watch IP utilization: Custom Prometheus metrics

We set WARM_IP_TARGET=20 and saw 60% faster pod startup times.

Gotcha #5: Custom Networking and CIDR Overlap

Large clusters often use secondary CIDR blocks. This is powerful but creates subtle routing problems.

Scenario:

  • Primary CIDR: 10.0.0.0/16
  • Secondary CIDR: 10.1.0.0/16
  • Pods allocated from both blocks
  • Pod routes configured incorrectly
  • Traffic between primary and secondary pods fails silently

The fix requires:

  1. Custom ENI configuration
  2. Proper route table setup
  3. Correct security group rules

Common mistakes:

  • Overlapping CIDR blocks (10.0.0.0/16 and 10.0.1.0/16)
  • Not updating route tables for secondary blocks
  • Security groups not allowing cross-CIDR traffic

Before deploying multi-CIDR:
Document your topology: Nodes in primary? Pods in secondary? Cross-AZ distribution?
Test routing between blocks exhaustively.
Monitor latency differences between primary and secondary ranges.

Monitoring and Debugging: The Essential Metrics

Prevention is better than crisis management. Track these metrics in Prometheus:

Key metrics from aws-node:

  • awscni_eni_allocation_errors (ENI exhaustion indicator)
  • awscni_allocated_ip_addresses (current usage)
  • awscni_max_ip_addresses (capacity)
  • awscni_pods_per_eni (density metrics)

Setup alerts:
Alarm if: (allocated_ips / max_ips) > 0.8
Alarm if: eni_allocation_errors > 0

Debugging commands:

Check node IP usage

kubectl describe node

Check aws-node logs

kubectl logs -n kube-system -l k8s-app=aws-node

View current WARM_IP_TARGET

kubectl describe ds -n kube-system aws-node

Our story: We set up these alerts 2 months after the $50K incident. We immediately caught a subnet exhaustion issue before it impacted production.

The Production Checklist: Before Going Live

Use this checklist before deploying any EKS cluster:

☑ Subnets sized for 3x expected pod capacity (/24 minimum)
☑ ENI limits calculated and node types selected accordingly
☑ Security groups allow pod-to-pod communication
☑ WARM_IP_TARGET tuned for your workload
☑ Prometheus metrics collecting for aws-node daemonset
☑ CloudWatch alarms set for IP exhaustion
☑ Route tables configured for all CIDR blocks
☑ Multi-AZ topology documented and validated
☑ Scaling tested with real pod density (not empty clusters)
☑ Network performance tested before production traffic

Capacity Planning Formula:
Max pods per AZ = (Subnets × IPs per subnet) – Reserved – Node IPs
Reserved IPs = AWS AWS-reserved (5% of subnet) + LoadBalancer IPs (estimate)
Node IPs = (Number of nodes × network overhead)

Be conservative. Better to have extra capacity than to hit the wall.

The Real Numbers: Our Cost Breakdown

Before fixing:

  • 50 t3.medium nodes
  • ~400 pods running
  • Constant out-of-capacity issues
  • Pod startup latency: 8-12 seconds
  • Monthly cost: $1,200/month
  • Waste (unschedulable pods): ~30%

After fixes:

  • 35 m5.large nodes (fewer but larger)
  • ~500 pods running
  • No out-of-capacity incidents
  • Pod startup latency: 1-2 seconds
  • Monthly cost: $1,050/month
  • Waste: < 5%

Total savings: $150/month + improved application performance
ROI on investigation time: Paid for itself in 30 days

The $50K figure from our title? That was one month of running at 60% capacity due to networking issues, plus emergency infrastructure added to band-aid the problem.

Conclusion: Don’t Learn These Lessons the Hard Way

EKS networking looks simple on the surface but has many hidden complexities. The five gotchas we covered – ENI exhaustion, security group misconfigs, IP exhaustion, WARM_IP_TARGET settings, and CIDR overlap – are responsible for 90% of production EKS networking issues.

The good news: They’re all preventable with proper planning and monitoring.

Start today: Audit your current EKS cluster using the debugging commands above. If you’re building new clusters, use the production checklist. Don’t wait for a crisis like we did.

Your infrastructure performance – and your AWS bill – will thank you.


Leave a Reply

Discover more from inboryn

Subscribe now to keep reading and get access to the full archive.

Continue reading