Principal Architect · Site Reliability Engineering
Building highly-available, fault-tolerant systems at scale. Passionate about observability, chaos engineering, platform engineering, and developer experience. Founder of PulseTechOps — sharing SRE & cloud-native insights with the community.
I'm Arunsanthoshkumar Annamalai, a Principal Architect specialising in Site Reliability Engineering and a Founder of PulseTechOps. My work sits at the intersection of software engineering, infrastructure, and operations—turning complex distributed systems into reliable, observable, and self-healing platforms.
As an Author, I share deep-dive articles on SRE, cloud-native infrastructure, and platform engineering on PulseTechOps Substack. As an Architect, I lead SRE practices across large-scale cloud environments, championing SLOs/SLAs/error-budget policies, blameless post-mortems, and progressive delivery.
Full-stack observability with metrics, logs, and traces. Designing alerting strategies that reduce MTTD and MTTR.
Building Internal Developer Platforms (IDPs) and golden-path templates to accelerate engineering velocity.
Game-day exercises, fault injection, and resilience testing to proactively uncover system weaknesses.
Multi-cloud, hybrid, and cloud-native architecture design on AWS, GCP, and Azure with IaC-first principles.
End-to-end delivery pipelines, canary and blue-green deployments, and GitOps workflows at enterprise scale.
DevSecOps integration, policy-as-code, zero-trust networking, and compliance automation.
Amazon Web Services
Cloud Native Computing Foundation
Cloud Native Computing Foundation
Microsoft Azure
Google Cloud Platform
HashiCorp
Define error budgets and SLO policies. Align engineering priorities with reliability targets. Build automated alerting and remediation workflows.
Implement the three pillars—metrics, logs, traces—to achieve full system visibility. Use OpenTelemetry as the instrumentation standard.
Automate repetitive operational tasks. Measure toil, set targets, and track progress. Free engineering time for high-value work.
Run regular game days and fault-injection experiments. Validate assumptions about system resilience before production incidents do it for you.
Structured incident response with clear severity levels, on-call rotations, and blameless post-mortems to drive continuous learning.
Demand forecasting, load testing, and autoscaling strategies to ensure infrastructure keeps pace with growth.
"Hope is not a strategy. Reliability is an engineering discipline—not an accident. Every production incident is a gift: an opportunity to learn, to harden systems, and to build the culture of continuous improvement that high-performing teams rely on."
— Arunsanthoshkumar Annamalai
Deep-dive SRE knowledge, platform engineering patterns, and reliability best practices — published on PulseTechOps Substack.
Get deep-dive SRE and cloud-native articles delivered to your inbox. Join the community of engineers building reliable systems.
Read & Subscribe on Substack →Interested in SRE consulting, speaking engagements, or collaboration on reliability engineering? Let's connect.
Available for conference talks, internal workshops, and webinars on SRE, reliability engineering, and cloud-native operations.
Help your organization build or mature its SRE practice—from establishing SLOs and error budgets to building observability platforms.
Cloud architecture and platform engineering reviews focusing on reliability, scalability, security, and operational excellence.
Collaborating on whitepapers, engineering blog posts, and documentation covering SRE best practices and cloud infrastructure.