CI/CD Explained: From Git Push to Production

Episode - CI/CD Explained: From Git Push to Production

Hey everyone! Welcome back to the tutorial series. Today we are going to learn about one of the MOST important topics in modern software engineering - CI/CD (Continuous Integration / Continuous Deployment)!

I am super excited about this topic because understanding CI/CD is what separates a junior developer from a senior developer. Trust me, this is asked in almost every backend and DevOps interview!

Have you ever wondered - what happens after you type "git push"? How does your code magically reach production, serve millions of users, and do it with ZERO downtime? Let's find out!

What we will cover:

  • What is CI/CD?
  • The Problem Before CI/CD
  • Step 1: Code Review (PR → Review → Merge)
  • Step 2: Automated Quality Gates (Tests, Security Scans, Linting)
  • Step 3: Containerization (Docker Build → Push to Registry)
  • Step 4: Staging Validation (Deploy to Staging + Smoke Tests)
  • Step 5: Progressive Production Rollout (Canary Deployment)
  • Step 6: Auto-Rollback Safety Net
  • Complete CI/CD Pipeline Example
  • Interview Questions
  • Key Points to Remember
The Complete Journey - From Git Push to Production:
====================================================

  git push
     │
     ▼
┌──────────┐   ┌──────────────┐   ┌────────────────┐   ┌──────────────┐   ┌─────────────┐   ┌──────────────┐
│  STEP 1  │──→│    STEP 2    │──→│    STEP 3      │──→│   STEP 4     │──→│   STEP 5    │──→│   STEP 6     │
│   Code   │   │   Quality    │   │ Containerize   │   │   Staging    │   │  Production │   │  Auto        │
│  Review  │   │   Gates      │   │  (Docker)      │   │  Validation  │   │  Rollout    │   │  Rollback    │
└──────────┘   └──────────────┘   └────────────────┘   └──────────────┘   └─────────────┘   └──────────────┘
   PR +            Tests +           Docker Build        Deploy to           5% → 25%          If error > 5%
   Review          Scans +           + Push to           Staging +           → 50% → 100%      → Revert!
   + Merge         Linting           Registry            Smoke Tests         (Canary)           → Alert!

⏱️ Total Time: 15-30 minutes | Zero Downtime | Fully Automated

What is CI/CD?

Let's start with the most basic question - What exactly is CI/CD?

"CI/CD is a set of practices that automate the process of integrating code changes, running tests, and deploying applications to production."

Wait, what does that mean? Let me break it down for you!

  • CI (Continuous Integration) - Automatically merge and test code changes frequently
  • CD (Continuous Delivery) - Automatically prepare code for release to production
  • CD (Continuous Deployment) - Automatically deploy every change that passes all tests to production

In simple words:

CI/CD = You push code → Machines do EVERYTHING else → Code reaches production safely!

CI vs CD vs CD:
===============

┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  CI (Continuous Integration)                                     │
│  ════════════════════════════                                    │
│  Developer pushes code → Automatically builds → Runs tests      │
│                                                                  │
│  "Merge code frequently and catch bugs early"                    │
│                                                                  │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CD (Continuous Delivery)                                        │
│  ═════════════════════════                                       │
│  CI + Automatically prepares release → ONE CLICK to deploy      │
│                                                                  │
│  "Code is always in a deployable state"                          │
│                                                                  │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CD (Continuous Deployment)                                      │
│  ══════════════════════════                                      │
│  CI + Automatically deploys to production → NO human approval    │
│                                                                  │
│  "Every change that passes tests goes live automatically"        │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

The Problem Before CI/CD

To understand why CI/CD was created, let's see what problems existed before it!

The Old Way - Manual Deployment:
=================================

Developer 1: "I'll push my changes on Friday"
Developer 2: "Me too"
Developer 3: "Same here"

Friday Night (Deployment Day):
==============================

Step 1: Merge all code manually
  → Merge conflicts everywhere! 😱

Step 2: Run tests manually
  → "Who broke the tests?" 🤔

Step 3: Build manually
  → "Works on my machine!" 🤷

Step 4: SSH into production server
  → "Let me just copy these files..." 😰

Step 5: Deploy manually
  → Server goes down at 2 AM! 🔥
  → Weekend ruined! 😭

RESULT:
- 6-8 hours of manual work
- Stressful deployments
- Bugs reach production
- Server downtime
- Angry customers
- Unhappy developers

CI/CD solves ALL these problems!

The CI/CD Way:
==============

Developer pushes code → Pipeline takes over!

  git push origin main
     │
     ▼
  ✅ Tests run automatically
  ✅ Code quality checked
  ✅ Security scanned
  ✅ Docker image built
  ✅ Deployed to staging
  ✅ Smoke tests pass
  ✅ Rolled out to production (gradually)
  ✅ Monitored for errors

RESULT:
- 15-30 minutes, fully automated
- Zero downtime
- Zero human error
- Bugs caught before production
- Happy customers
- Happy developers!

Step 1: Code Review - "PR → Review → Merge"

Everything starts with a Pull Request (PR). A developer writes code, pushes to a feature branch, and raises a PR!

The Code Review Flow:
=====================

Developer writes code on feature branch
     │
     ▼
┌──────────────────────────────────────────────────────┐
│                 PULL REQUEST (PR)                      │
│                                                        │
│  Title: "Add user authentication endpoint"             │
│  Branch: feature/auth → main                           │
│                                                        │
│  Changes:                                              │
│  + src/routes/auth.js         (new file)              │
│  + src/middleware/jwt.js      (new file)              │
│  ~ src/app.js                 (modified)              │
│  + tests/auth.test.js         (new file)              │
│                                                        │
│  Status:                                               │
│  ┌────────────────────────────────────────────┐       │
│  │  ✅ All CI checks passed                    │       │
│  │  ✅ Code coverage: 92%                      │       │
│  │  ✅ No security vulnerabilities             │       │
│  │  👀 Waiting for 2 approvals                 │       │
│  └────────────────────────────────────────────┘       │
└──────────────────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────────────────┐
│                 TEAM REVIEWS                           │
│                                                        │
│  Reviewer 1 (Senior Dev):                              │
│  "Looks good! But add input validation on line 42"     │
│  Status: Changes Requested                             │
│                                                        │
│  Reviewer 2 (Tech Lead):                               │
│  "Clean code! Approved ✅"                              │
│                                                        │
│  Developer fixes line 42 → Pushes again                │
│                                                        │
│  Reviewer 1: "Approved ✅"                              │
└──────────────────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────────────────┐
│                 MERGE TO MAIN                          │
│                                                        │
│  ✅ 2 approvals received                               │
│  ✅ All CI checks passed                               │
│  ✅ No merge conflicts                                 │
│                                                        │
│  → Squash and Merge into main branch                   │
│  → Feature branch deleted                              │
│  → CI/CD Pipeline triggered!                           │
└──────────────────────────────────────────────────────┘

Q: Why code review matters?

Benefits of Code Review:
========================

1. Catch bugs BEFORE they reach production
2. Share knowledge across the team
3. Maintain code quality and consistency
4. Security issues caught early
5. Better architecture decisions
6. Junior developers learn from seniors

Branch Protection Rules:

GitHub Branch Protection (Settings → Branches):
================================================

main branch rules:
  ✅ Require pull request before merging
  ✅ Require at least 2 approvals
  ✅ Require status checks to pass (CI pipeline)
  ✅ Require branch to be up to date
  ✅ Do not allow bypassing
  ❌ No direct push to main allowed!

This ensures NO code reaches main without review!

Step 2: Automated Quality Gates - "The Gatekeepers"

The moment code is pushed or a PR is created, GitHub Actions triggers automatically and runs a series of checks. Think of these as gatekeepers - your code MUST pass ALL of them!

Quality Gates Flow:
===================

  git push / PR created
          │
          ▼
  GitHub Actions Triggered!
          │
    ┌─────┴─────┬──────────────┬───────────────┐
    │           │              │               │
    ▼           ▼              ▼               ▼
┌────────┐ ┌─────────┐ ┌───────────┐ ┌──────────────┐
│  Unit  │ │Integra- │ │ Security  │ │ Code Quality │
│  Tests │ │  tion   │ │   Scan    │ │    Check     │
│ (Jest) │ │  Tests  │ │(SonarQube)│ │  (ESLint)    │
└───┬────┘ └────┬────┘ └─────┬─────┘ └──────┬───────┘
    │           │            │              │
    ▼           ▼            ▼              ▼
   ✅/❌       ✅/❌        ✅/❌          ✅/❌
    │           │            │              │
    └─────┬─────┴────────────┴──────────────┘
          │
     ALL PASSED?
          │
    ┌─────┴─────┐
    │           │
   YES          NO
    │           │
    ▼           ▼
  Next      ❌ PIPELINE STOPS!
  Step      No broken code reaches production.

MIND BLOWN, right? Any single failure stops the ENTIRE pipeline. No broken code can sneak through!

2A) Unit Tests (Jest, PyTest)

Unit tests check if individual functions work correctly.

// auth.test.js - Unit Tests

describe("Authentication", () => {

    test("should hash password correctly", async () => {
        const password = "myPassword123";
        const hashed = await hashPassword(password);

        expect(hashed).not.toBe(password);
        expect(hashed.length).toBeGreaterThan(50);
    });

    test("should generate valid JWT token", () => {
        const user = { id: "123", email: "dev@test.com" };
        const token = generateToken(user);

        expect(token).toBeDefined();
        expect(token.split(".").length).toBe(3); // Header.Payload.Signature
    });

    test("should reject invalid credentials", async () => {
        const result = await login("wrong@email.com", "wrongPassword");

        expect(result.status).toBe(401);
        expect(result.body.error).toBe("Invalid credentials");
    });

    test("should return user profile for valid token", async () => {
        const token = generateToken({ id: "123" });
        const result = await getProfile(token);

        expect(result.status).toBe(200);
        expect(result.body.userId).toBe("123");
    });
});
OUTPUT (Jest):
==============

 PASS  tests/auth.test.js
  Authentication
    ✓ should hash password correctly (45ms)
    ✓ should generate valid JWT token (12ms)
    ✓ should reject invalid credentials (89ms)
    ✓ should return user profile for valid token (34ms)

Test Suites: 1 passed, 1 total
Tests:       4 passed, 4 total
Coverage:    92.5%

All tests passed! ✅

2B) Integration Tests (API Contract Validation)

Integration tests check if multiple components work together correctly - like your API endpoints with the database!

// integration/api.test.js

describe("API Integration Tests", () => {

    test("POST /signup → should create user in database", async () => {
        const res = await request(app)
            .post("/signup")
            .send({
                name: "Akshay",
                email: "akshay@test.com",
                password: "securePass123"
            });

        expect(res.status).toBe(201);
        expect(res.body.message).toBe("User created successfully");

        // Verify user exists in DB
        const user = await User.findOne({ email: "akshay@test.com" });
        expect(user).toBeDefined();
        expect(user.name).toBe("Akshay");
    });

    test("POST /login → should return JWT token", async () => {
        const res = await request(app)
            .post("/login")
            .send({
                email: "akshay@test.com",
                password: "securePass123"
            });

        expect(res.status).toBe(200);
        expect(res.body.token).toBeDefined();
    });

    test("GET /profile → should reject without token", async () => {
        const res = await request(app).get("/profile");

        expect(res.status).toBe(401);
    });
});

2C) SAST Security Scans (SonarQube, Snyk)

SAST stands for Static Application Security Testing. It scans your code for security vulnerabilities without running it!

What SAST Catches:
==================

┌──────────────────────────────────────────────────────┐
│               SECURITY SCAN REPORT                    │
├──────────────────────────────────────────────────────┤
│                                                       │
│  ❌ CRITICAL: SQL Injection in routes/user.js:42      │
│     const query = "SELECT * FROM users WHERE          │
│                    id = " + req.params.id;            │
│     FIX: Use parameterized queries!                   │
│                                                       │
│  ❌ HIGH: Hardcoded secret in config.js:15            │
│     const SECRET = "mySecretKey123";                  │
│     FIX: Use environment variables!                   │
│                                                       │
│  ⚠️ MEDIUM: Outdated dependency (lodash 4.17.15)     │
│     Known vulnerability: CVE-2021-23337              │
│     FIX: Update to lodash 4.17.21+                   │
│                                                       │
│  ⚠️ LOW: console.log in production code              │
│     FIX: Use proper logging library                   │
│                                                       │
├──────────────────────────────────────────────────────┤
│  Result: ❌ FAILED (2 critical issues found)          │
│  Pipeline: STOPPED                                    │
└──────────────────────────────────────────────────────┘

Tools:
- SonarQube  → Code quality + security analysis
- Snyk       → Dependency vulnerability scanning
- ESLint     → Code pattern analysis (security rules)

2D) Code Quality Checks (ESLint, Prettier)

ESLint + Prettier:
==================

ESLint checks:
  ✅ No unused variables
  ✅ No console.log in production
  ✅ Consistent error handling
  ✅ No eval() usage
  ✅ Proper async/await patterns

Prettier checks:
  ✅ Consistent indentation
  ✅ Consistent quotes (single/double)
  ✅ Line length limits
  ✅ Trailing commas
  ✅ Semicolons

Here's a real GitHub Actions workflow that runs ALL these checks:

# .github/workflows/ci.yml

name: CI Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  # ===========================
  # JOB 1: Lint and Format
  # ===========================
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '18'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run ESLint
        run: npm run lint

      - name: Check Prettier formatting
        run: npm run format:check

  # ===========================
  # JOB 2: Unit Tests
  # ===========================
  test:
    runs-on: ubuntu-latest
    needs: lint    # Only runs if lint passes!
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '18'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run unit tests
        run: npm test -- --coverage

      - name: Check coverage threshold
        run: |
          COVERAGE=$(cat coverage/coverage-summary.json | jq '.total.lines.pct')
          if (( $(echo "$COVERAGE < 80" | bc -l) )); then
            echo "Coverage is $COVERAGE%, must be at least 80%!"
            exit 1
          fi

  # ===========================
  # JOB 3: Integration Tests
  # ===========================
  integration:
    runs-on: ubuntu-latest
    needs: lint
    services:
      mongodb:
        image: mongo:6
        ports:
          - 27017:27017
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '18'

      - name: Install dependencies
        run: npm ci

      - name: Run integration tests
        run: npm run test:integration
        env:
          DATABASE_URL: mongodb://localhost:27017/testdb

  # ===========================
  # JOB 4: Security Scan
  # ===========================
  security:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4

      - name: Run Snyk security scan
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

      - name: Run SonarQube scan
        uses: sonarqube-community/sonarqube-scan-action@v5
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
GitHub Actions UI:
==================

┌──────────────────────────────────────────────────────┐
│  CI Pipeline                                          │
│                                                       │
│  ✅ lint          (32s)   ESLint + Prettier passed    │
│  ├── ✅ test      (1m 45s) 48 tests passed, 92% cov │
│  ├── ✅ integration (2m 12s) API contracts valid     │
│  └── ✅ security  (1m 30s) No vulnerabilities found  │
│                                                       │
│  Status: All checks passed ✅                         │
│  Ready to merge!                                      │
└──────────────────────────────────────────────────────┘

Step 3: Containerization - "Package It Up!"

All quality gates passed? Now the code gets packaged into a Docker container - an immutable artifact that runs the same EVERYWHERE!

Q: Why containerize?

Why Docker in CI/CD:
====================

Without Docker:
  Dev: "Works on my machine!"    → Node 18, npm 9, Ubuntu 22
  Staging: Different Node version → Node 16, npm 7, Ubuntu 20
  Prod: Different OS entirely    → Node 18, npm 9, Amazon Linux

  Same code, different behavior! 😱

With Docker:
  Dev:     Same Docker image
  Staging: Same Docker image
  Prod:    Same Docker image

  Same image everywhere = Same behavior everywhere! ✅

This is what we call an IMMUTABLE ARTIFACT!

The Dockerfile:

# Dockerfile - Production optimized

# Stage 1: Build
FROM node:18-alpine AS builder

WORKDIR /app

COPY package*.json ./

RUN npm ci --production    # Clean install, production only!

COPY . .

# Stage 2: Production image (smaller!)
FROM node:18-alpine

WORKDIR /app

# Don't run as root!
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Copy only what we need from builder
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/src ./src
COPY --from=builder /app/package.json ./

USER appuser

EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s \
    CMD wget -qO- http://localhost:3000/health || exit 1

CMD ["node", "src/index.js"]
What Makes This Dockerfile Production-Ready:
=============================================

✅ Multi-stage build     → Smaller final image (~150MB vs ~900MB)
✅ npm ci --production   → Only production dependencies
✅ Non-root user         → Security best practice
✅ HEALTHCHECK           → Container self-monitoring
✅ Alpine base           → Minimal OS footprint
✅ .dockerignore         → Excludes node_modules, .git, .env

The CI/CD step that builds and pushes:

# .github/workflows/cd.yml (continued)

  # ===========================
  # JOB 5: Build & Push Docker Image
  # ===========================
  build:
    runs-on: ubuntu-latest
    needs: [test, integration, security]   # ALL must pass!
    steps:
      - uses: actions/checkout@v4

      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build Docker image
        run: |
          docker build \
            -t registry.acme.com/app:${{ github.sha }} \
            -t registry.acme.com/app:latest \
            .

      - name: Push to registry
        run: |
          docker push registry.acme.com/app:${{ github.sha }}
          docker push registry.acme.com/app:latest
Image Tagging Strategy:
=======================

registry.acme.com/app:v2.4.1-abc123f
                      │     │
                      │     └── Git commit SHA (unique!)
                      └── Semantic version

Why this tag format?
- v2.4.1    → Human readable version
- abc123f   → Exact commit this image was built from
- Can always trace image back to exact code!

Every build = New unique image
Old images = Still available for rollback
Docker Image Flow:
==================

┌──────────────┐         ┌─────────────────┐        ┌──────────────┐
│   Source      │         │   Docker Build   │        │   Registry   │
│   Code       │────────→│                  │───────→│              │
│              │         │  FROM node:18    │        │  ECR /       │
│ src/         │         │  COPY . .        │        │  Docker Hub  │
│ package.json │         │  RUN npm ci      │        │              │
│ Dockerfile   │         │  CMD ["node"...] │        │ app:v2.4.1   │
└──────────────┘         └─────────────────┘        │ app:v2.4.0   │
                                                     │ app:v2.3.9   │
                          Immutable artifact!        │ ...          │
                          Same everywhere!           └──────────────┘

Step 4: Staging Validation - "Test Before Going Live!"

Before touching production, the image gets deployed to a staging environment - an exact replica of production!

Staging = Mini Production:
==========================

┌──────────────────────────────────────────────────────┐
│                 STAGING ENVIRONMENT                    │
│              (Exact replica of production)             │
│                                                       │
│  ✅ Same Kubernetes version                           │
│  ✅ Same database schema                              │
│  ✅ Same resource limits (CPU, RAM)                   │
│  ✅ Same environment variables (different values)     │
│  ✅ Same network configuration                        │
│  ✅ Same container image                              │
│                                                       │
│  Only difference:                                     │
│  - Smaller scale (fewer replicas)                     │
│  - Test database (not real user data)                 │
│  - Not public facing                                  │
└──────────────────────────────────────────────────────┘

Why staging?
- Catch environment-specific bugs
- Test database migrations
- Verify configuration
- Run end-to-end tests
- QA team can manually test

Automated Smoke Tests on Staging:

Smoke Tests = Quick Health Checks:
===================================

After deploying to staging, automated tests run:

┌──────────────────────────────────────────────────────┐
│              SMOKE TEST SUITE                         │
├──────────────────────────────────────────────────────┤
│                                                       │
│  Test 1: Health Check                                 │
│  GET /health → Expected: 200 OK                      │
│  Result: ✅ PASSED (response in 23ms)                 │
│                                                       │
│  Test 2: Database Connectivity                        │
│  GET /health/db → Expected: 200 OK                   │
│  Result: ✅ PASSED (MongoDB connected)                │
│                                                       │
│  Test 3: Login Flow                                   │
│  POST /login → Expected: 200 + JWT Token             │
│  Result: ✅ PASSED (token received)                   │
│                                                       │
│  Test 4: Checkout Flow                                │
│  POST /checkout → Expected: 201 Created              │
│  Result: ✅ PASSED (order created)                    │
│                                                       │
│  Test 5: Search                                       │
│  GET /search?q=test → Expected: 200 + results        │
│  Result: ✅ PASSED (42 results returned)              │
│                                                       │
├──────────────────────────────────────────────────────┤
│  All 5 smoke tests passed ✅                          │
│  Staging deployment healthy!                          │
│  Ready for production rollout!                        │
└──────────────────────────────────────────────────────┘
# smoke-tests.sh - Automated smoke test script

#!/bin/bash

STAGING_URL="https://staging.acme.com"
FAILED=0

echo "Running smoke tests on $STAGING_URL..."

# Test 1: Health endpoint
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$STAGING_URL/health")
if [ "$STATUS" -eq 200 ]; then
    echo "✅ Health check passed"
else
    echo "❌ Health check failed (status: $STATUS)"
    FAILED=1
fi

# Test 2: Database connectivity
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$STAGING_URL/health/db")
if [ "$STATUS" -eq 200 ]; then
    echo "✅ Database connectivity passed"
else
    echo "❌ Database connectivity failed"
    FAILED=1
fi

# Test 3: Login flow
RESPONSE=$(curl -s -w "\n%{http_code}" -X POST "$STAGING_URL/login" \
    -H "Content-Type: application/json" \
    -d '{"email":"test@test.com","password":"testPass123"}')
STATUS=$(echo "$RESPONSE" | tail -1)
if [ "$STATUS" -eq 200 ]; then
    echo "✅ Login flow passed"
else
    echo "❌ Login flow failed"
    FAILED=1
fi

# Final result
if [ "$FAILED" -eq 1 ]; then
    echo "❌ SMOKE TESTS FAILED! Blocking production deployment."
    exit 1
else
    echo "✅ ALL SMOKE TESTS PASSED! Ready for production."
    exit 0
fi

Step 5: Progressive Production Rollout - "Go Live Gradually!"

This is where it gets really exciting! We DON'T deploy to 100% of users at once. Instead, we use a canary deployment strategy - rolling out gradually!

Q: What is a Canary Deployment?

A: The name comes from coal miners who used canaries to detect toxic gas. If the canary died, miners knew to evacuate. Similarly, we send a small percentage of traffic to the new version first - if something goes wrong, only a few users are affected!

Canary Deployment Flow:
=======================

PHASE 1: 5% of traffic (Canary Pods)
==========================================

┌──────────────┐
│   USERS      │
│  (100%)      │
└──────┬───────┘
       │
       ▼
┌──────────────────────────────────────────────┐
│              LOAD BALANCER                    │
└──────┬──────────────────────────────┬────────┘
       │ 95% traffic                  │ 5% traffic
       ▼                              ▼
┌──────────────────┐         ┌──────────────────┐
│   OLD VERSION    │         │   NEW VERSION    │
│   v2.3.9         │         │   v2.4.1         │
│                  │         │   (Canary)       │
│  Pod 1           │         │  Pod 1           │
│  Pod 2           │         │                  │
│  Pod 3           │         │                  │
│  Pod 4           │         │                  │
└──────────────────┘         └──────────────────┘

→ Monitor for 10 minutes
→ Grafana tracks: error rate, latency, CPU, memory
→ If healthy: proceed to Phase 2
→ If unhealthy: ROLLBACK immediately!


PHASE 2: 25% of traffic
==========================================

┌──────────────────┐         ┌──────────────────┐
│   OLD VERSION    │         │   NEW VERSION    │
│   v2.3.9         │         │   v2.4.1         │
│                  │         │                  │
│  Pod 1           │         │  Pod 1           │
│  Pod 2           │         │  Pod 2           │
│  Pod 3           │         │                  │
└──────────────────┘         └──────────────────┘
     75%                          25%

→ Run synthetic transactions (automated user flows)
→ Validate business KPIs (conversion rate, checkout success)


PHASE 3: 50% → 100%
==========================================

┌──────────────────┐         ┌──────────────────┐
│   OLD VERSION    │         │   NEW VERSION    │
│   (draining)     │         │   v2.4.1         │
│                  │         │                  │
│                  │         │  Pod 1           │
│                  │         │  Pod 2           │
│                  │         │  Pod 3           │
│                  │         │  Pod 4           │
└──────────────────┘         └──────────────────┘
      0%                          100%

→ Full rollout complete!
→ Old pods terminated gracefully
→ Zero downtime! Users didn't notice a thing!

Kubernetes + Argo Rollouts Configuration:

# argo-rollout.yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 4
  strategy:
    canary:
      # Phase 1: 5% traffic
      steps:
        - setWeight: 5
        - pause: { duration: 10m }    # Monitor for 10 minutes

      # Phase 2: 25% traffic
        - setWeight: 25
        - pause: { duration: 10m }

      # Phase 3: 50% traffic
        - setWeight: 50
        - pause: { duration: 5m }

      # Phase 4: Full rollout
        - setWeight: 100

      # Auto-rollback conditions
      analysis:
        templates:
          - templateName: success-rate
        args:
          - name: service-name
            value: my-app

  template:
    spec:
      containers:
        - name: my-app
          image: registry.acme.com/app:v2.4.1-abc123f
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
What Argo Rollouts Does:
========================

Manual Kubernetes deployment:
  kubectl apply → ALL pods updated at once → Risky!

Argo Rollouts:
  kubectl apply → 5% → wait → 25% → wait → 50% → 100%
                   ↑          ↑           ↑
              Monitor     Monitor     Monitor
              metrics     metrics     metrics

If anything goes wrong at ANY phase → Auto rollback!

Step 6: Auto-Rollback Safety Net - "The Insurance Policy"

This is the most critical safety feature of the entire pipeline. If something goes wrong in production, the system automatically reverts to the last stable version!

Auto-Rollback Triggers:
=======================

IF error rate > 5%
   OR p99 latency > 500ms
   OR CPU usage > 90%
   OR memory usage > 85%
   OR health check fails

THEN:
   ┌──────────────────────────────────────────────────┐
   │           AUTO-ROLLBACK TRIGGERED!                │
   │                                                    │
   │  1. Argo detects unhealthy metrics                │
   │  2. Traffic shifted back to stable version        │
   │  3. Canary pods terminated                        │
   │  4. Stable version serves 100% traffic            │
   │  5. Alert fires to Slack + PagerDuty              │
   │  6. Deployment marked as FAILED                   │
   │  7. Pipeline blocked until root cause fixed       │
   │                                                    │
   │  Time to rollback: ~30 seconds                    │
   │  Users affected: Only those in canary (5-25%)     │
   └──────────────────────────────────────────────────┘
Rollback Timeline:
==================

  00:00 - New version deployed (5% canary)
  00:02 - Grafana detects error rate spike: 12%
  00:02 - Argo Rollouts: "Error rate exceeds 5% threshold!"
  00:02 - AUTO-ROLLBACK INITIATED
  00:03 - Traffic shifted: 100% → stable version
  00:03 - Canary pods terminating...
  00:04 - Stable version serving all traffic
  00:04 - Slack alert: "🚨 Deployment v2.4.1 rolled back!"
  00:05 - PagerDuty alert to on-call engineer
  00:05 - All systems normal

  Total impact: ~3 minutes, only 5% of users affected!
  Without auto-rollback: HOURS of downtime, ALL users affected!

Argo Analysis Template (What Metrics to Watch):

# analysis-template.yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    # Metric 1: Error Rate
    - name: error-rate
      interval: 60s
      failureLimit: 3       # 3 failures = rollback
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              status=~"5.*",
              app="my-app"
            }[5m]))
            /
            sum(rate(http_requests_total{
              app="my-app"
            }[5m]))
            * 100
      successCondition: result[0] < 5    # Must be under 5%

    # Metric 2: Latency
    - name: p99-latency
      interval: 60s
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                app="my-app"
              }[5m])) by (le)
            )
      successCondition: result[0] < 0.5  # Must be under 500ms
Monitoring Dashboard (Grafana):
===============================

┌──────────────────────────────────────────────────────┐
│                DEPLOYMENT MONITOR                     │
│                v2.4.1 Canary Rollout                  │
├──────────────────────────────────────────────────────┤
│                                                       │
│  Error Rate:           Canary: 0.3% ✅  Stable: 0.2% │
│  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  Threshold: < 5%               │
│                                                       │
│  p99 Latency:          Canary: 120ms ✅  Stable: 95ms │
│  ▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁  Threshold: < 500ms            │
│                                                       │
│  CPU Usage:            Canary: 45% ✅   Stable: 42%   │
│  ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃                                 │
│                                                       │
│  Memory:               Canary: 280MB ✅  Stable: 265MB│
│  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄                                 │
│                                                       │
│  Traffic Split:        Canary: 25%      Stable: 75%  │
│  Phase: 2 of 4                                        │
│  Status: HEALTHY ✅                                    │
└──────────────────────────────────────────────────────┘

Alert Configuration:

Where Alerts Go:
================

┌─────────────────────┐
│   Argo detects      │
│   unhealthy metrics │
└──────────┬──────────┘
           │
     ┌─────┴─────┬─────────────┐
     │           │             │
     ▼           ▼             ▼
┌─────────┐ ┌─────────┐ ┌──────────┐
│  Slack  │ │PagerDuty│ │  Email   │
│ #alerts │ │ On-call │ │  Team    │
│ channel │ │ engineer│ │  Lead    │
└─────────┘ └─────────┘ └──────────┘

Slack message:
  🚨 ROLLBACK: Deployment v2.4.1 failed
  Error rate: 12.3% (threshold: 5%)
  Rolled back to: v2.3.9
  Affected users: ~5%
  Action required: Fix and redeploy

The Complete CI/CD Pipeline - GitHub Actions

Let me show you a real-world complete pipeline that ties ALL 6 steps together!

# .github/workflows/pipeline.yml

name: CI/CD Pipeline

on:
  push:
    branches: [main]

env:
  REGISTRY: registry.acme.com
  IMAGE_NAME: app
  STAGING_URL: https://staging.acme.com
  K8S_NAMESPACE: production

jobs:

  # ═══════════════════════════════════════
  # STEP 2: Quality Gates (Step 1 is PR review - done before merge)
  # ═══════════════════════════════════════

  lint:
    name: Code Quality
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '18', cache: 'npm' }
      - run: npm ci
      - run: npm run lint
      - run: npm run format:check

  test:
    name: Unit Tests
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '18', cache: 'npm' }
      - run: npm ci
      - run: npm test -- --coverage
      - name: Upload coverage
        uses: codecov/codecov-action@v4

  integration:
    name: Integration Tests
    runs-on: ubuntu-latest
    needs: lint
    services:
      mongodb:
        image: mongo:6
        ports: ['27017:27017']
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '18' }
      - run: npm ci
      - run: npm run test:integration

  security:
    name: Security Scan
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      - name: Snyk vulnerability scan
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

  # ═══════════════════════════════════════
  # STEP 3: Containerization
  # ═══════════════════════════════════════

  build:
    name: Build & Push Docker Image
    runs-on: ubuntu-latest
    needs: [test, integration, security]   # ALL gates must pass!
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - name: Generate image tag
        id: meta
        run: echo "tags=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}" >> $GITHUB_OUTPUT

      - name: Build Docker image
        run: docker build -t ${{ steps.meta.outputs.tags }} .

      - name: Push to registry
        run: docker push ${{ steps.meta.outputs.tags }}

  # ═══════════════════════════════════════
  # STEP 4: Staging Validation
  # ═══════════════════════════════════════

  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: build
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to staging cluster
        run: |
          kubectl set image deployment/app \
            app=${{ needs.build.outputs.image-tag }} \
            -n staging

      - name: Wait for rollout
        run: kubectl rollout status deployment/app -n staging --timeout=120s

      - name: Run smoke tests
        run: bash scripts/smoke-tests.sh ${{ env.STAGING_URL }}

  # ═══════════════════════════════════════
  # STEP 5 + 6: Production Rollout + Auto-Rollback
  # ═══════════════════════════════════════

  deploy-production:
    name: Production Canary Rollout
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment: production     # Requires manual approval!
    steps:
      - uses: actions/checkout@v4

      - name: Update Argo Rollout
        run: |
          kubectl argo rollouts set image my-app \
            app=${{ needs.build.outputs.image-tag }} \
            -n ${{ env.K8S_NAMESPACE }}

      - name: Monitor canary rollout
        run: |
          kubectl argo rollouts status my-app \
            -n ${{ env.K8S_NAMESPACE }} \
            --watch --timeout 30m

      - name: Notify success
        if: success()
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text":"✅ v2.4.1 deployed to production successfully!"}'

      - name: Notify failure
        if: failure()
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text":"🚨 v2.4.1 deployment failed! Auto-rolled back."}'

CI/CD Tools Landscape

Category Tools Purpose
CI/CD Platform GitHub Actions, Jenkins, GitLab CI, CircleCI Orchestrate the pipeline
Testing Jest, PyTest, Mocha, Cypress Unit, integration, E2E tests
Security SonarQube, Snyk, Trivy SAST scans, dependency vulnerabilities
Code Quality ESLint, Prettier, Husky Linting, formatting, git hooks
Containerization Docker, Podman Build immutable artifacts
Registry Amazon ECR, Docker Hub, GitHub GHCR Store container images
Orchestration Kubernetes, Docker Swarm, ECS Run containers at scale
Rollout Argo Rollouts, Flagger, Spinnaker Canary/blue-green deployments
Monitoring Prometheus, Grafana, Datadog Metrics, dashboards, alerts
Alerting PagerDuty, Slack, OpsGenie Incident notifications

Interview Questions - Quick Fire!

Q: What is CI/CD?

"CI (Continuous Integration) is the practice of automatically building and testing code changes whenever developers push to the repository. CD (Continuous Delivery/Deployment) extends this by automatically deploying the tested code to staging or production. Together, they create an automated pipeline from code commit to production deployment."

Q: What is the difference between Continuous Delivery and Continuous Deployment?

"Continuous Delivery means code is always in a deployable state and can be deployed with one click (manual approval). Continuous Deployment goes further - every change that passes all tests is automatically deployed to production with no human intervention."

Q: What are Quality Gates in CI/CD?

"Quality gates are automated checks that code must pass before it can proceed in the pipeline. They include unit tests, integration tests, security scans (SAST), code quality checks (linting), and coverage thresholds. If any gate fails, the pipeline stops and broken code cannot reach production."

Q: What is a Canary Deployment?

"Canary deployment is a strategy where the new version is rolled out to a small percentage of users first (like 5%), while the majority still use the stable version. If the canary is healthy (low error rate, good latency), traffic is gradually increased to 25%, 50%, then 100%. If issues are detected, traffic is immediately routed back to the stable version."

Q: What is the difference between Canary and Blue-Green deployment?

"In Blue-Green deployment, you have two identical environments - Blue (current) and Green (new). Traffic switches 100% from Blue to Green at once. In Canary deployment, traffic shifts gradually (5% → 25% → 50% → 100%), giving more time to detect issues. Canary is safer but takes longer; Blue-Green is faster but riskier."

Q: Why do we containerize applications in CI/CD?

"Containerization with Docker creates an immutable artifact that includes the application, its dependencies, and runtime environment. This ensures the exact same code runs in development, staging, and production - eliminating the 'works on my machine' problem. It also makes rollbacks trivial - just switch to the previous image."

Q: What is an immutable artifact?

"An immutable artifact is a build output (like a Docker image) that never changes after creation. Each build produces a new unique artifact tagged with the git commit SHA. If you need to change something, you create a new artifact rather than modifying the existing one. This ensures traceability and reliable rollbacks."

Q: What happens if a deployment fails in production?

"With a proper CI/CD pipeline using canary deployments and Argo Rollouts, the system automatically detects failures by monitoring metrics like error rate and latency. If thresholds are exceeded, an auto-rollback triggers within seconds - shifting all traffic back to the last stable version. Alerts fire to Slack and PagerDuty, and the deployment is blocked until the root cause is fixed."

Q: What is SAST and why is it important in CI/CD?

"SAST (Static Application Security Testing) scans source code for security vulnerabilities without executing it. Tools like SonarQube and Snyk check for SQL injection, hardcoded secrets, outdated dependencies with known CVEs, and other security issues. Running SAST in the CI pipeline ensures vulnerable code never reaches production."

Q: Why do we need a staging environment?

"Staging is a production replica used to validate deployments before going live. It catches environment-specific issues like database migration problems, configuration errors, and integration failures that unit tests might miss. Smoke tests on staging verify critical flows (health check, login, checkout) work correctly with the production-like setup."

Q: What metrics should you monitor during a production deployment?

"The key metrics are: error rate (percentage of 5xx responses), p99 latency (99th percentile response time), CPU and memory usage, and health check status. Business KPIs like conversion rate and checkout success are also important. If any metric crosses the threshold during canary rollout, an automatic rollback is triggered."

Quick Recap

Step What Happens Tools
1. Code Review PR raised → Team reviews → Merge to main GitHub, GitLab
2. Quality Gates Tests + Security scans + Linting (auto) Jest, Snyk, ESLint, SonarQube
3. Containerize Build Docker image → Push to registry Docker, ECR, Docker Hub
4. Staging Deploy to staging → Run smoke tests Kubernetes, curl, Cypress
5. Production Canary rollout: 5% → 25% → 50% → 100% Argo Rollouts, Kubernetes
6. Safety Net Auto-rollback if metrics exceed thresholds Prometheus, Grafana, PagerDuty

Key Points to Remember

  • CI = Merge + Build + Test automatically on every push
  • CD (Delivery) = Code always deployable, one-click deploy
  • CD (Deployment) = Auto-deploy every passing change, no human needed
  • Pull Requests = Code review before merge, branch protection rules
  • Quality Gates = Unit tests, integration tests, SAST scans, linting
  • Any gate failure = Pipeline stops, broken code blocked
  • Docker image = Immutable artifact, same everywhere (dev/staging/prod)
  • Image tag = version + git SHA for traceability (v2.4.1-abc123f)
  • Staging = Production replica for final validation + smoke tests
  • Canary Deployment = Gradual rollout (5% → 25% → 50% → 100%)
  • Auto-Rollback = If error rate > 5% OR latency > 500ms → revert instantly
  • Argo Rollouts = Progressive delivery controller for Kubernetes
  • Prometheus + Grafana = Metrics collection + visualization
  • PagerDuty + Slack = Alert on-call engineers immediately
  • Total time = 15-30 minutes, zero downtime, fully automated
  • Security is shift-left = Catch vulnerabilities early in the pipeline, not in production

What's Next?

Now you understand the complete CI/CD journey from git push to production! In the next episode, we can explore:

  • Kubernetes Deep Dive - Pods, Services, Deployments
  • Infrastructure as Code (Terraform)
  • GitOps with ArgoCD
  • Monitoring & Observability (Prometheus, Grafana, ELK Stack)

Keep coding, keep learning! See you in the next one!