Architecting for Interruption: How to Safely Use Spot Instances in Production
Imagine your cloud provider offered you a deal: “We will give you this server for 90% off. But, we can take it back whenever we want, and we’ll only give you a two-minute warning.”
To a finance manager, that sounds like a risk. To a seasoned systems architect, that sounds like an engineering challenge,and a lucrative one. According to AWS case studies, companies like Lyft have reduced their monthly compute costs by 75% simply by adapting their architecture to this model.
Spot Instances (referred to as Preemptible VMs on Google Cloud) are the ultimate test of your system’s resilience. They offer immense compute power for pennies on the dollar, but using them effectively isn’t just about configuration; it’s about code.
This is why advanced cost optimization is less about finance and more about better design.
Here is how you architect for the rug pull.
Understanding Spot Instance Eviction Mechanics
First, let’s understand the threat. Spot capacity is essentially spare inventory. When the cloud provider needs that capacity back for full-paying On-Demand customers, they issue an interruption notice.
On a technical level, this usually arrives as a SIGTERM signal sent to the OS of your instance. From that moment, the clock starts ticking.
- AWS gives you a 2-minute warning.
- Azure and Google Cloud are even stricter, offering just 30 seconds before the instance is preempted.
Your mission is simple: Do not die in those few seconds.
1. Catch the Signal, Save the State (Checkpointing)
The biggest mistake engineers make is treating Spot instances like standard servers. You cannot assume an instance will live to finish its task.
You need to handle the termination signal at the application level. In a standard application, a SIGTERM might just initiate a graceful shutdown. In a Spot environment, your handler needs to be more aggressive: stop accepting new work immediately and flush any in-memory state to durable storage.
If you are running long-running batch jobs, you must implement the Checkpoint Pattern.
- Don’t process a 10GB file in one monolithic operation.
- Break it into small chunks.
- As your worker processes each chunk, update the status in an external store like Redis or DynamoDB.
Netflix is the industry standard for this approach. By building an internal “spot market” and architecting their encoding service to be interruption-tolerant, they achieved a reported 92% cost efficiency compared to standard scaling.
2. The Golden Rule: Ruthless Statelessness
Spot instances expose bad architecture. If your application stores session data, temporary files, or process logs locally on the disk, a Spot interruption is a catastrophe.
To use Spot safely in production, you must be ruthlessly stateless:
- Logs: Ship them immediately to a centralized logging service (e.g., CloudWatch, Datadog). Do not wait for a batch upload.
- Sessions: Never store state on the instance. Use an external distributed cache like Redis or Memcached.
- Files: Process streams directly from object storage (S3/Blob Storage) whenever possible.
If your Spot instance disappears right now, the only thing you should lose is the CPU cycles.
3. The Mixed Fleet Strategy: Combining Spot and On-Demand
You rarely want to go 100% Spot for a customer-facing service. The risk of a “mass extinction event”, where an entire availability zone runs out of Spot capacity,is non-zero.
The smart architectural pattern is the Mixed Fleet. According to the 2025 Kubernetes Cost Benchmark Report by Cast AI, clusters running a strategic mix of On-Demand and Spot instances recorded an average savings of 59%, striking the perfect balance between savings and stability.
- The Base Capacity: Configure your minimum required healthy percentage (e.g., the first 20% of your fleet) to run on reliable On-Demand instances. (Pro Tip: For even better savings on this baseline, you should use reserved capacity. See our guide on [Commitment vs. Flexibility]).
- The Burst Capacity: Configure the rest of your scaling needs to be fulfilled by Spot Instances.
Crucial Consideration: Don’t bet on a single instance family. If you only request m5.large Spot instances, you are competing with everyone else who wants exactly that instance. Configure your fleet to accept a prioritized list of acceptable types (e.g., m5.large, m4.large, c5.large). This capacity diversification dramatically lowers your chances of getting evicted.
The CI/CD Sandbox: A Safe Place to Start
If you are nervous about deploying this architecture directly to production, start with your CI/CD pipeline.
Build agents and test runners are the perfect use case for Spot instances. They are ephemeral by nature, entirely stateless, and if a build fails because an instance died, your CI tool can usually just retry it automatically. Companies like Delivery Hero have used this exact strategy to run their Kubernetes-based CI/CD pipelines on Spot, achieving 70% savings.
Conclusion: Resilience is a Feature
Here is the irony: Architects who design specifically for Spot instances usually end up with better, more robust systems than those who rely solely on stable infrastructure.
By forcing yourself to handle random terminations, you eliminate fragility. You build systems that heal themselves. The 90% cost savings? That’s just the reward for good engineering.
For a broader look at how different pricing structures impact your system design, check out The Software Engineer’s Guide to Cloud Billing Models.
