Unknown's avatar

About Anthony Mattson

If you want to know more, read my About section.

Cloud Outages: Why Developers Must Plan for Failure

Over the past several months, outages at AWS, Azure, and Cloudflare have been a stark reminder: cloud services are powerful, but they are not infallible. Too often, teams misinterpret the reliability of cloud platforms and overlook their own responsibility in building resilient systems.

The Value of the Cloud

Let’s be clear—cloud providers deliver incredible value. Having managed a data center myself, I know firsthand the relief of not having to maintain racks of servers, cooling systems, and endless hardware replacements. The cloud abstracts away much of that pain and enables developers to focus on building solutions instead of babysitting infrastructure.

Reliability vs. Resiliency

Here’s the catch: reliability and resiliency are not the same thing, and too much weight is being placed on the providers alone. Yes, AWS, Azure, and Cloudflare engineer their platforms for resilience. But their guarantees—“99.99% availability”—still fall short of perfection. Hardware fails. Networks degrade. Services stall. Outages happen, and when they do, they can hit hard.

The responsibility doesn’t end with the provider. It’s on us, the developers and architects, to design for resiliency in our own solutions.

Planning for the Inevitable

If you deploy to the cloud, assume it will fail. Then plan accordingly:

  • Disaster Recovery (DR): Architect for regional outages. Have a failover plan ready to spin up workloads in another region or provider.
  • Performance Degradation: Build strategies for graceful degradation when services slow down. Don’t let one bottleneck grind your entire system to a halt.
  • Network Dependencies: Pay special attention to DNS. If your solution relies on resolving URLs, plan for what happens when DNS hiccups.
  • Failure Scenarios: If you can imagine a failure, expect it. Hardware, APIs, authentication, networking—every layer is a potential point of failure.

The mindset should be simple: expect the worst, hope for the best.

Final Thoughts

Cloud providers are amazing, but they are not magic. Outages are inevitable, and resiliency is a shared responsibility. As developers, we must stop treating the cloud as a silver bullet and start treating it as a powerful tool—one that still requires careful planning, redundancy, and foresight.

Explore the Python Network Tester on GitHub and Docker

Building My Python Network Tester

Over the past couple of months, I’ve been tinkering with a side project: a Python Network Tester. The idea came directly from my time troubleshooting EKS and ECS solutions, where I often walked customers through a familiar set of network checks. While the tool may feel basic, it reflects the common gauntlet of tests I relied on whenever network concerns came up.

What I noticed in those scenarios was that, more often than not, the “network issue” wasn’t really the network at all. Instead, it was a misunderstanding of how networks behave—or how much application performance itself can influence perceived network performance. Sometimes, the simplest visibility into the basics is all someone needs to figure out their next step.

After plenty of tweaking, debugging, and refactoring, I’ve reached a point where I’m ready to share the project publicly.


Using AI as a Programming Partner

One thing that helped me along the way was experimenting with Amazon’s Kiro Agentic AI. I often struggle with getting projects off the ground and into the right mindset, so I leaned on Kiro to help me draft code. But I made sure to carefully read every single line it suggested. That way, I understood exactly what the code was doing, which made refactoring and debugging much easier when things didn’t behave as expected.

In short, Kiro became a surprisingly good programming buddy. Beyond code, it was especially helpful for writing comments and documentation—an area I’ve always wrestled with, unsure if I’m writing too little or too much. Having that support gave me confidence to keep the docs clear and useful.


Try It Out

If you’d like to explore the project yourself:

Feedback, recommendations, or issues are welcome—please report them in the GitHub Issues tab. I plan to keep iterating on this project to see what features make the most sense and where it can provide the most value.

Build a Python Network Troubleshooting Tool

While settling into my new job, I’ve also been tinkering with some side projects. One of them was inspired by the countless network connectivity and latency issues I used to troubleshoot as a Cloud Support Engineer.

I’ll admit — networking has always been one of my weaker areas in IT. I can get around it well enough, but shifting back into the “network mindset” is always a bit of a slog. There are so many different tests to remember, each useful for diagnosing a different type of issue.

So, just for fun, I decided to build a comprehensive network troubleshooting tool that bundles together the tests I ran most often. I chose Python for the project, partly to refresh my skills with the language.

As an extra experiment, I’ve been trying out AWS’s new IDE, Kiro, and giving “vibe coding” a spin. That said, I’m not blindly trusting whatever the GenAI assistant spits out. I’m reviewing every line of code to make sure I understand it, confirm it makes sense, and check for any security concerns.

Right now, the tool can run DNS resolution, ping, and TCP connectivity tests. You can feed it a URL or IP, specify which port to test, and set how many ping attempts to run. Since I mostly supported containerized solutions at AWS, I also built a container image so the tool can run in that environment. As a bonus, I’m bundling in the most common network testing CLI tools, so you can run one-off checks directly inside the container.

It’s still a work in progress, but I’m happy with how it’s shaping up. Once I feel confident enough, I might even push it to a public repo. And while I don’t recommend relying on “vibe coding” as a strategy, it can be handy for breaking through roadblocks. At the end of the day, though, I’d rather refactor the code myself so it works exactly the way I want.

2025: A Year of Change and Renewal

2025 has been a powerful reminder that change is the only constant in life (thanks, Heraclitus). Despite my best intentions to focus on advancing my career, depression, anxiety, and an overly demanding job had other plans — draining my energy and stealing my focus. On top of that, there was the looming uncertainty of when my employer would finally enforce a return-to-office policy, one that would require me to move to a significantly more expensive state without any increase in pay.

Fast forward to a month ago: everything shifted. I was faced with a decision — accept a severance package and leave, or continue pushing my stress levels to hold onto a job that felt like it was already on borrowed time. Needless to say, I chose the severance. It turned out to be one of the best decisions I’ve ever made, allowing me to walk away from a mountain of stress.

With unemployment on the horizon, I dove into the process of filing for benefits and began applying for new roles. Luckily, one of the local companies I had applied to before the severance reached out for an interview. I’m thrilled to share that this interview quickly led to a new position as a dedicated Solutions Architect.

Now in week three of my new role, I’m absolutely loving it! Things have been relatively calm so far, but I fully expect the pace to pick up — and I’m ready for it. I’m finally working on projects where I can see real progress and play a direct role in shaping outcomes. There’s also a ton of new material to learn, which I’m genuinely excited about. Most importantly, I no longer have to constantly micromanage my time.

On top of all that, I recently renewed my AWS Certified Solutions Architect Associate certification and have started working toward my AWS Certified AI Practitioner credential. While I stay cautious about AI integration and the way GenAI is being handled, I can clearly see its potential. It’s time to dive in and learn everything I can.