🛠️

Arunsanthoshkumar Annamalai

Principal Architect · Site Reliability Engineering

🏛️ Architect ✍️ Author 🚀 Founder

Building highly-available, fault-tolerant systems at scale. Passionate about observability, chaos engineering, platform engineering, and developer experience. Founder of PulseTechOps — sharing SRE & cloud-native insights with the community.

Open to collaboration ☁️ Cloud Native 📍 Global Remote

📰 Read on Substack

15+

Years in Tech

99.99%

SLA Target

10+

Cloud Certifications

👤 About Me

I'm Arunsanthoshkumar Annamalai, a Principal Architect specialising in Site Reliability Engineering and a Founder of PulseTechOps. My work sits at the intersection of software engineering, infrastructure, and operations—turning complex distributed systems into reliable, observable, and self-healing platforms.

As an Author, I share deep-dive articles on SRE, cloud-native infrastructure, and platform engineering on PulseTechOps Substack. As an Architect, I lead SRE practices across large-scale cloud environments, championing SLOs/SLAs/error-budget policies, blameless post-mortems, and progressive delivery.

🎯 Core Competencies

🔍 Observability & Monitoring

Full-stack observability with metrics, logs, and traces. Designing alerting strategies that reduce MTTD and MTTR.

🚀 Platform Engineering

Building Internal Developer Platforms (IDPs) and golden-path templates to accelerate engineering velocity.

💥 Chaos Engineering

Game-day exercises, fault injection, and resilience testing to proactively uncover system weaknesses.

📦 Cloud Architecture

Multi-cloud, hybrid, and cloud-native architecture design on AWS, GCP, and Azure with IaC-first principles.

🔄 CI/CD & GitOps

End-to-end delivery pipelines, canary and blue-green deployments, and GitOps workflows at enterprise scale.

🔒 Security & Compliance

DevSecOps integration, policy-as-code, zero-trust networking, and compliance automation.

💼 Experience

2022 – Present

Principal Architect – SRE & Platform Engineering

Enterprise Cloud & Platform Engineering · Founder, PulseTechOps

Defined SRE strategy across 50+ production services. Reduced P0 incidents by 60% through proactive SLO-driven alerts and automated runbooks. Founded PulseTechOps to democratise SRE knowledge for the wider engineering community.

2019 – 2022

Senior Cloud Architect / SRE Lead

Cloud Infrastructure & Reliability

Architected multi-region Kubernetes platform serving millions of requests/day. Led migration from on-prem to cloud with zero downtime. Introduced chaos engineering practices and error-budget policies across engineering teams.

2016 – 2019

DevOps / Infrastructure Engineer

Software Engineering & Operations

Built CI/CD pipelines, configuration management, and monitoring stacks for distributed applications. Automated infrastructure provisioning with Terraform and Ansible across AWS environments.

2012 – 2016

Systems & Network Engineer

IT Infrastructure

Managed on-premises infrastructure, Linux/Windows server administration, virtualisation (VMware), and network operations. Laid the foundation for a cloud-first engineering mindset.

⚙️ Technology Stack

☁️ Cloud Platforms

🐳 Containers & Orchestration

📈 Observability & Monitoring

🏗️ Infrastructure as Code

🔄 CI/CD & GitOps

💻 Languages & Scripting

🗄️ Databases & Messaging

📊 Proficiency Levels

Kubernetes & Cloud Native95%

Observability (O11y)92%

Terraform / IaC90%

Python / Go85%

CI/CD & GitOps92%

AWS / GCP / Azure88%

Security & Compliance80%

Chaos Engineering78%

🏅 Certifications

☁️

AWS Certified Solutions Architect – Professional

Amazon Web Services

☸️

Certified Kubernetes Administrator (CKA)

Cloud Native Computing Foundation

☸️

Certified Kubernetes Security Specialist (CKS)

Cloud Native Computing Foundation

🟦

Azure DevOps Engineer Expert

Microsoft Azure

🟧

Google Professional Cloud Architect

Google Cloud Platform

🏗️

HashiCorp Certified: Terraform Associate

HashiCorp

📊 SRE Practice

SLO

Service Level Objectives

SLA

Service Level Agreements

SLI

Service Level Indicators

🧱 SRE Pillars

📐 Reliability Engineering

Define error budgets and SLO policies. Align engineering priorities with reliability targets. Build automated alerting and remediation workflows.

🔍 Observability

Implement the three pillars—metrics, logs, traces—to achieve full system visibility. Use OpenTelemetry as the instrumentation standard.

🤖 Toil Reduction

Automate repetitive operational tasks. Measure toil, set targets, and track progress. Free engineering time for high-value work.

💥 Chaos Engineering

Run regular game days and fault-injection experiments. Validate assumptions about system resilience before production incidents do it for you.

📋 Incident Management

Structured incident response with clear severity levels, on-call rotations, and blameless post-mortems to drive continuous learning.

🚀 Capacity Planning

Demand forecasting, load testing, and autoscaling strategies to ensure infrastructure keeps pace with growth.

🛠️ Operational Tools

Incident & On-Call

Chaos & Reliability Testing

Service Mesh & Networking

📖 SRE Philosophy

"Hope is not a strategy. Reliability is an engineering discipline—not an accident. Every production incident is a gift: an opportunity to learn, to harden systems, and to build the culture of continuous improvement that high-performing teams rely on."

— Arunsanthoshkumar Annamalai

📝 Articles & Thoughts

Deep-dive SRE knowledge, platform engineering patterns, and reliability best practices — published on PulseTechOps Substack.

Error Budgets: Moving Beyond Uptime Theater

Why raw uptime percentages are a misleading reliability metric and how error budgets create the right incentives for development and operations teams.

📅 2024 SLO Reliability

Building an Observability Platform from Scratch with OpenTelemetry

A practical walkthrough of instrumenting a microservices estate using OpenTelemetry collectors, Prometheus, Loki, Tempo, and Grafana dashboards.

📅 2024 Observability OpenTelemetry

Kubernetes at Scale: Lessons Learned Running 1000+ Node Clusters

Operational insights from running large-scale Kubernetes in production—node lifecycle management, control-plane tuning, and etcd performance.

📅 2023 Kubernetes Platform

Chaos Engineering Playbook for Production Systems

Step-by-step guide to designing, running, and learning from chaos experiments—from hypothesis formulation to blast-radius control.

📅 2023 Chaos Resilience

GitOps Patterns That Actually Work in Enterprise

Practical GitOps patterns using ArgoCD and Flux—multi-tenancy, secrets management, progressive delivery, and drift detection at scale.

📅 2023 GitOps CI/CD

Internal Developer Platforms: The SRE's Role in Developer Experience

How SRE teams can drive developer productivity by building paved-road platforms, golden-path templates, and self-service infrastructure.

📅 2022 Platform Engineering DX

📰 Subscribe to PulseTechOps

Get deep-dive SRE and cloud-native articles delivered to your inbox. Join the community of engineers building reliable systems.

Read & Subscribe on Substack →

✉️ Get in Touch

Interested in SRE consulting, speaking engagements, or collaboration on reliability engineering? Let's connect.

🌐

Website pages.pulsetechops.com

💼

LinkedIn linkedin.com/in/arun7pulse

🐙

GitHub github.com/arun7pulse

📰

Substack pulsetechops.substack.com

🤝 Areas of Collaboration

🎤 Speaking & Workshops

Available for conference talks, internal workshops, and webinars on SRE, reliability engineering, and cloud-native operations.

🔧 SRE Advisory

Help your organization build or mature its SRE practice—from establishing SLOs and error budgets to building observability platforms.

🏗️ Architecture Review

Cloud architecture and platform engineering reviews focusing on reliability, scalability, security, and operational excellence.

📝 Technical Writing

Collaborating on whitepapers, engineering blog posts, and documentation covering SRE best practices and cloud infrastructure.

📍 Based globally, working remotely.
Reach out via LinkedIn or GitHub for professional enquiries.