🛠️

Arunsanthoshkumar Annamalai

Principal Architect · Site Reliability Engineering

🏛️ Architect ✍️ Author 🚀 Founder

Building highly-available, fault-tolerant systems at scale. Passionate about observability, chaos engineering, platform engineering, and developer experience. Founder of PulseTechOps — sharing SRE & cloud-native insights with the community.

Open to collaboration ☁️ Cloud Native 📍 Global Remote
📰 Read on Substack
15+
Years in Tech
99.99%
SLA Target
10+
Cloud Certifications
👤 About Me

I'm Arunsanthoshkumar Annamalai, a Principal Architect specialising in Site Reliability Engineering and a Founder of PulseTechOps. My work sits at the intersection of software engineering, infrastructure, and operations—turning complex distributed systems into reliable, observable, and self-healing platforms.


As an Author, I share deep-dive articles on SRE, cloud-native infrastructure, and platform engineering on PulseTechOps Substack. As an Architect, I lead SRE practices across large-scale cloud environments, championing SLOs/SLAs/error-budget policies, blameless post-mortems, and progressive delivery.

🎯 Core Competencies

🔍 Observability & Monitoring

Full-stack observability with metrics, logs, and traces. Designing alerting strategies that reduce MTTD and MTTR.

🚀 Platform Engineering

Building Internal Developer Platforms (IDPs) and golden-path templates to accelerate engineering velocity.

💥 Chaos Engineering

Game-day exercises, fault injection, and resilience testing to proactively uncover system weaknesses.

📦 Cloud Architecture

Multi-cloud, hybrid, and cloud-native architecture design on AWS, GCP, and Azure with IaC-first principles.

🔄 CI/CD & GitOps

End-to-end delivery pipelines, canary and blue-green deployments, and GitOps workflows at enterprise scale.

🔒 Security & Compliance

DevSecOps integration, policy-as-code, zero-trust networking, and compliance automation.

💼 Experience
2022 – Present
Principal Architect – SRE & Platform Engineering
Enterprise Cloud & Platform Engineering · Founder, PulseTechOps
Defined SRE strategy across 50+ production services. Reduced P0 incidents by 60% through proactive SLO-driven alerts and automated runbooks. Founded PulseTechOps to democratise SRE knowledge for the wider engineering community.
2019 – 2022
Senior Cloud Architect / SRE Lead
Cloud Infrastructure & Reliability
Architected multi-region Kubernetes platform serving millions of requests/day. Led migration from on-prem to cloud with zero downtime. Introduced chaos engineering practices and error-budget policies across engineering teams.
2016 – 2019
DevOps / Infrastructure Engineer
Software Engineering & Operations
Built CI/CD pipelines, configuration management, and monitoring stacks for distributed applications. Automated infrastructure provisioning with Terraform and Ansible across AWS environments.
2012 – 2016
Systems & Network Engineer
IT Infrastructure
Managed on-premises infrastructure, Linux/Windows server administration, virtualisation (VMware), and network operations. Laid the foundation for a cloud-first engineering mindset.
⚙️ Technology Stack

☁️ Cloud Platforms

AWS Google Cloud Azure Oracle Cloud

🐳 Containers & Orchestration

Kubernetes Docker Helm Kustomize Istio Envoy ArgoCD Flux

📈 Observability & Monitoring

Prometheus Grafana Loki Tempo OpenTelemetry Datadog PagerDuty ELK Stack Jaeger Zipkin

🏗️ Infrastructure as Code

Terraform Pulumi Ansible Chef CloudFormation Crossplane

🔄 CI/CD & GitOps

GitHub Actions GitLab CI Jenkins CircleCI Tekton Spinnaker ArgoCD

💻 Languages & Scripting

Python Go Bash / Shell YAML HCL TypeScript

🗄️ Databases & Messaging

PostgreSQL MySQL Redis Kafka RabbitMQ Cassandra MongoDB
📊 Proficiency Levels
Kubernetes & Cloud Native95%
Observability (O11y)92%
Terraform / IaC90%
Python / Go85%
CI/CD & GitOps92%
AWS / GCP / Azure88%
Security & Compliance80%
Chaos Engineering78%
🏅 Certifications
☁️

AWS Certified Solutions Architect – Professional

Amazon Web Services

☸️

Certified Kubernetes Administrator (CKA)

Cloud Native Computing Foundation

☸️

Certified Kubernetes Security Specialist (CKS)

Cloud Native Computing Foundation

🟦

Azure DevOps Engineer Expert

Microsoft Azure

🟧

Google Professional Cloud Architect

Google Cloud Platform

🏗️

HashiCorp Certified: Terraform Associate

HashiCorp

📊 SRE Practice
SLO
Service Level Objectives
SLA
Service Level Agreements
SLI
Service Level Indicators
🧱 SRE Pillars

📐 Reliability Engineering

Define error budgets and SLO policies. Align engineering priorities with reliability targets. Build automated alerting and remediation workflows.

🔍 Observability

Implement the three pillars—metrics, logs, traces—to achieve full system visibility. Use OpenTelemetry as the instrumentation standard.

🤖 Toil Reduction

Automate repetitive operational tasks. Measure toil, set targets, and track progress. Free engineering time for high-value work.

💥 Chaos Engineering

Run regular game days and fault-injection experiments. Validate assumptions about system resilience before production incidents do it for you.

📋 Incident Management

Structured incident response with clear severity levels, on-call rotations, and blameless post-mortems to drive continuous learning.

🚀 Capacity Planning

Demand forecasting, load testing, and autoscaling strategies to ensure infrastructure keeps pace with growth.

🛠️ Operational Tools

Incident & On-Call

PagerDuty OpsGenie VictorOps StatusPage Rootly FireHydrant

Chaos & Reliability Testing

Chaos Monkey LitmusChaos Gremlin k6 Gatling Locust

Service Mesh & Networking

Istio Linkerd Envoy Consul NGINX Traefik
📖 SRE Philosophy

"Hope is not a strategy. Reliability is an engineering discipline—not an accident. Every production incident is a gift: an opportunity to learn, to harden systems, and to build the culture of continuous improvement that high-performing teams rely on."

— Arunsanthoshkumar Annamalai

📝 Articles & Thoughts

Deep-dive SRE knowledge, platform engineering patterns, and reliability best practices — published on PulseTechOps Substack.

01

Error Budgets: Moving Beyond Uptime Theater

Why raw uptime percentages are a misleading reliability metric and how error budgets create the right incentives for development and operations teams.

02

Building an Observability Platform from Scratch with OpenTelemetry

A practical walkthrough of instrumenting a microservices estate using OpenTelemetry collectors, Prometheus, Loki, Tempo, and Grafana dashboards.

03

Kubernetes at Scale: Lessons Learned Running 1000+ Node Clusters

Operational insights from running large-scale Kubernetes in production—node lifecycle management, control-plane tuning, and etcd performance.

04

Chaos Engineering Playbook for Production Systems

Step-by-step guide to designing, running, and learning from chaos experiments—from hypothesis formulation to blast-radius control.

05

GitOps Patterns That Actually Work in Enterprise

Practical GitOps patterns using ArgoCD and Flux—multi-tenancy, secrets management, progressive delivery, and drift detection at scale.

06

Internal Developer Platforms: The SRE's Role in Developer Experience

How SRE teams can drive developer productivity by building paved-road platforms, golden-path templates, and self-service infrastructure.

📰 Subscribe to PulseTechOps

Get deep-dive SRE and cloud-native articles delivered to your inbox. Join the community of engineers building reliable systems.

Read & Subscribe on Substack →
✉️ Get in Touch

Interested in SRE consulting, speaking engagements, or collaboration on reliability engineering? Let's connect.

🤝 Areas of Collaboration

🎤 Speaking & Workshops

Available for conference talks, internal workshops, and webinars on SRE, reliability engineering, and cloud-native operations.

🔧 SRE Advisory

Help your organization build or mature its SRE practice—from establishing SLOs and error budgets to building observability platforms.

🏗️ Architecture Review

Cloud architecture and platform engineering reviews focusing on reliability, scalability, security, and operational excellence.

📝 Technical Writing

Collaborating on whitepapers, engineering blog posts, and documentation covering SRE best practices and cloud infrastructure.

📍 Based globally, working remotely.
Reach out via LinkedIn or GitHub for professional enquiries.