CI/CD Explained: From Git Push to Production
Episode - CI/CD Explained: From Git Push to Production
Hey everyone! Welcome back to the tutorial series. Today we are going to learn about one of the MOST important topics in modern software engineering - CI/CD (Continuous Integration / Continuous Deployment)!
I am super excited about this topic because understanding CI/CD is what separates a junior developer from a senior developer. Trust me, this is asked in almost every backend and DevOps interview!
Have you ever wondered - what happens after you type "git push"? How does your code magically reach production, serve millions of users, and do it with ZERO downtime? Let's find out!
What we will cover:
- What is CI/CD?
- The Problem Before CI/CD
- Step 1: Code Review (PR → Review → Merge)
- Step 2: Automated Quality Gates (Tests, Security Scans, Linting)
- Step 3: Containerization (Docker Build → Push to Registry)
- Step 4: Staging Validation (Deploy to Staging + Smoke Tests)
- Step 5: Progressive Production Rollout (Canary Deployment)
- Step 6: Auto-Rollback Safety Net
- Complete CI/CD Pipeline Example
- Interview Questions
- Key Points to Remember
The Complete Journey - From Git Push to Production:
====================================================
git push
│
▼
┌──────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ STEP 1 │──→│ STEP 2 │──→│ STEP 3 │──→│ STEP 4 │──→│ STEP 5 │──→│ STEP 6 │
│ Code │ │ Quality │ │ Containerize │ │ Staging │ │ Production │ │ Auto │
│ Review │ │ Gates │ │ (Docker) │ │ Validation │ │ Rollout │ │ Rollback │
└──────────┘ └──────────────┘ └────────────────┘ └──────────────┘ └─────────────┘ └──────────────┘
PR + Tests + Docker Build Deploy to 5% → 25% If error > 5%
Review Scans + + Push to Staging + → 50% → 100% → Revert!
+ Merge Linting Registry Smoke Tests (Canary) → Alert!
⏱️ Total Time: 15-30 minutes | Zero Downtime | Fully Automated
What is CI/CD?
Let's start with the most basic question - What exactly is CI/CD?
"CI/CD is a set of practices that automate the process of integrating code changes, running tests, and deploying applications to production."
Wait, what does that mean? Let me break it down for you!
- CI (Continuous Integration) - Automatically merge and test code changes frequently
- CD (Continuous Delivery) - Automatically prepare code for release to production
- CD (Continuous Deployment) - Automatically deploy every change that passes all tests to production
In simple words:
CI/CD = You push code → Machines do EVERYTHING else → Code reaches production safely!
CI vs CD vs CD: =============== ┌──────────────────────────────────────────────────────────────────┐ │ │ │ CI (Continuous Integration) │ │ ════════════════════════════ │ │ Developer pushes code → Automatically builds → Runs tests │ │ │ │ "Merge code frequently and catch bugs early" │ │ │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ CD (Continuous Delivery) │ │ ═════════════════════════ │ │ CI + Automatically prepares release → ONE CLICK to deploy │ │ │ │ "Code is always in a deployable state" │ │ │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ CD (Continuous Deployment) │ │ ══════════════════════════ │ │ CI + Automatically deploys to production → NO human approval │ │ │ │ "Every change that passes tests goes live automatically" │ │ │ └──────────────────────────────────────────────────────────────────┘
The Problem Before CI/CD
To understand why CI/CD was created, let's see what problems existed before it!
The Old Way - Manual Deployment: ================================= Developer 1: "I'll push my changes on Friday" Developer 2: "Me too" Developer 3: "Same here" Friday Night (Deployment Day): ============================== Step 1: Merge all code manually → Merge conflicts everywhere! 😱 Step 2: Run tests manually → "Who broke the tests?" 🤔 Step 3: Build manually → "Works on my machine!" 🤷 Step 4: SSH into production server → "Let me just copy these files..." 😰 Step 5: Deploy manually → Server goes down at 2 AM! 🔥 → Weekend ruined! 😠RESULT: - 6-8 hours of manual work - Stressful deployments - Bugs reach production - Server downtime - Angry customers - Unhappy developers
CI/CD solves ALL these problems!
The CI/CD Way:
==============
Developer pushes code → Pipeline takes over!
git push origin main
│
▼
✅ Tests run automatically
✅ Code quality checked
✅ Security scanned
✅ Docker image built
✅ Deployed to staging
✅ Smoke tests pass
✅ Rolled out to production (gradually)
✅ Monitored for errors
RESULT:
- 15-30 minutes, fully automated
- Zero downtime
- Zero human error
- Bugs caught before production
- Happy customers
- Happy developers!
Step 1: Code Review - "PR → Review → Merge"
Everything starts with a Pull Request (PR). A developer writes code, pushes to a feature branch, and raises a PR!
The Code Review Flow:
=====================
Developer writes code on feature branch
│
▼
┌──────────────────────────────────────────────────────┐
│ PULL REQUEST (PR) │
│ │
│ Title: "Add user authentication endpoint" │
│ Branch: feature/auth → main │
│ │
│ Changes: │
│ + src/routes/auth.js (new file) │
│ + src/middleware/jwt.js (new file) │
│ ~ src/app.js (modified) │
│ + tests/auth.test.js (new file) │
│ │
│ Status: │
│ ┌────────────────────────────────────────────┐ │
│ │ ✅ All CI checks passed │ │
│ │ ✅ Code coverage: 92% │ │
│ │ ✅ No security vulnerabilities │ │
│ │ 👀 Waiting for 2 approvals │ │
│ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ TEAM REVIEWS │
│ │
│ Reviewer 1 (Senior Dev): │
│ "Looks good! But add input validation on line 42" │
│ Status: Changes Requested │
│ │
│ Reviewer 2 (Tech Lead): │
│ "Clean code! Approved ✅" │
│ │
│ Developer fixes line 42 → Pushes again │
│ │
│ Reviewer 1: "Approved ✅" │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ MERGE TO MAIN │
│ │
│ ✅ 2 approvals received │
│ ✅ All CI checks passed │
│ ✅ No merge conflicts │
│ │
│ → Squash and Merge into main branch │
│ → Feature branch deleted │
│ → CI/CD Pipeline triggered! │
└──────────────────────────────────────────────────────┘
Q: Why code review matters?
Benefits of Code Review: ======================== 1. Catch bugs BEFORE they reach production 2. Share knowledge across the team 3. Maintain code quality and consistency 4. Security issues caught early 5. Better architecture decisions 6. Junior developers learn from seniors
Branch Protection Rules:
GitHub Branch Protection (Settings → Branches): ================================================ main branch rules: ✅ Require pull request before merging ✅ Require at least 2 approvals ✅ Require status checks to pass (CI pipeline) ✅ Require branch to be up to date ✅ Do not allow bypassing ❌ No direct push to main allowed! This ensures NO code reaches main without review!
Step 2: Automated Quality Gates - "The Gatekeepers"
The moment code is pushed or a PR is created, GitHub Actions triggers automatically and runs a series of checks. Think of these as gatekeepers - your code MUST pass ALL of them!
Quality Gates Flow:
===================
git push / PR created
│
▼
GitHub Actions Triggered!
│
┌─────┴─────┬──────────────┬───────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌───────────┐ ┌──────────────┐
│ Unit │ │Integra- │ │ Security │ │ Code Quality │
│ Tests │ │ tion │ │ Scan │ │ Check │
│ (Jest) │ │ Tests │ │(SonarQube)│ │ (ESLint) │
└───┬────┘ └────┬────┘ └─────┬─────┘ └──────┬───────┘
│ │ │ │
▼ ▼ ▼ ▼
✅/❌ ✅/❌ ✅/❌ ✅/❌
│ │ │ │
└─────┬─────┴────────────┴──────────────┘
│
ALL PASSED?
│
┌─────┴─────┐
│ │
YES NO
│ │
▼ ▼
Next ❌ PIPELINE STOPS!
Step No broken code reaches production.
MIND BLOWN, right? Any single failure stops the ENTIRE pipeline. No broken code can sneak through!
2A) Unit Tests (Jest, PyTest)
Unit tests check if individual functions work correctly.
// auth.test.js - Unit Tests
describe("Authentication", () => {
test("should hash password correctly", async () => {
const password = "myPassword123";
const hashed = await hashPassword(password);
expect(hashed).not.toBe(password);
expect(hashed.length).toBeGreaterThan(50);
});
test("should generate valid JWT token", () => {
const user = { id: "123", email: "dev@test.com" };
const token = generateToken(user);
expect(token).toBeDefined();
expect(token.split(".").length).toBe(3); // Header.Payload.Signature
});
test("should reject invalid credentials", async () => {
const result = await login("wrong@email.com", "wrongPassword");
expect(result.status).toBe(401);
expect(result.body.error).toBe("Invalid credentials");
});
test("should return user profile for valid token", async () => {
const token = generateToken({ id: "123" });
const result = await getProfile(token);
expect(result.status).toBe(200);
expect(result.body.userId).toBe("123");
});
});
OUTPUT (Jest):
==============
PASS tests/auth.test.js
Authentication
✓ should hash password correctly (45ms)
✓ should generate valid JWT token (12ms)
✓ should reject invalid credentials (89ms)
✓ should return user profile for valid token (34ms)
Test Suites: 1 passed, 1 total
Tests: 4 passed, 4 total
Coverage: 92.5%
All tests passed! ✅
2B) Integration Tests (API Contract Validation)
Integration tests check if multiple components work together correctly - like your API endpoints with the database!
// integration/api.test.js
describe("API Integration Tests", () => {
test("POST /signup → should create user in database", async () => {
const res = await request(app)
.post("/signup")
.send({
name: "Akshay",
email: "akshay@test.com",
password: "securePass123"
});
expect(res.status).toBe(201);
expect(res.body.message).toBe("User created successfully");
// Verify user exists in DB
const user = await User.findOne({ email: "akshay@test.com" });
expect(user).toBeDefined();
expect(user.name).toBe("Akshay");
});
test("POST /login → should return JWT token", async () => {
const res = await request(app)
.post("/login")
.send({
email: "akshay@test.com",
password: "securePass123"
});
expect(res.status).toBe(200);
expect(res.body.token).toBeDefined();
});
test("GET /profile → should reject without token", async () => {
const res = await request(app).get("/profile");
expect(res.status).toBe(401);
});
});
2C) SAST Security Scans (SonarQube, Snyk)
SAST stands for Static Application Security Testing. It scans your code for security vulnerabilities without running it!
What SAST Catches: ================== ┌──────────────────────────────────────────────────────┐ │ SECURITY SCAN REPORT │ ├──────────────────────────────────────────────────────┤ │ │ │ ❌ CRITICAL: SQL Injection in routes/user.js:42 │ │ const query = "SELECT * FROM users WHERE │ │ id = " + req.params.id; │ │ FIX: Use parameterized queries! │ │ │ │ ❌ HIGH: Hardcoded secret in config.js:15 │ │ const SECRET = "mySecretKey123"; │ │ FIX: Use environment variables! │ │ │ │ ⚠️ MEDIUM: Outdated dependency (lodash 4.17.15) │ │ Known vulnerability: CVE-2021-23337 │ │ FIX: Update to lodash 4.17.21+ │ │ │ │ ⚠️ LOW: console.log in production code │ │ FIX: Use proper logging library │ │ │ ├──────────────────────────────────────────────────────┤ │ Result: ❌ FAILED (2 critical issues found) │ │ Pipeline: STOPPED │ └──────────────────────────────────────────────────────┘ Tools: - SonarQube → Code quality + security analysis - Snyk → Dependency vulnerability scanning - ESLint → Code pattern analysis (security rules)
2D) Code Quality Checks (ESLint, Prettier)
ESLint + Prettier: ================== ESLint checks: ✅ No unused variables ✅ No console.log in production ✅ Consistent error handling ✅ No eval() usage ✅ Proper async/await patterns Prettier checks: ✅ Consistent indentation ✅ Consistent quotes (single/double) ✅ Line length limits ✅ Trailing commas ✅ Semicolons
Here's a real GitHub Actions workflow that runs ALL these checks:
# .github/workflows/ci.yml
name: CI Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
# ===========================
# JOB 1: Lint and Format
# ===========================
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run ESLint
run: npm run lint
- name: Check Prettier formatting
run: npm run format:check
# ===========================
# JOB 2: Unit Tests
# ===========================
test:
runs-on: ubuntu-latest
needs: lint # Only runs if lint passes!
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run unit tests
run: npm test -- --coverage
- name: Check coverage threshold
run: |
COVERAGE=$(cat coverage/coverage-summary.json | jq '.total.lines.pct')
if (( $(echo "$COVERAGE < 80" | bc -l) )); then
echo "Coverage is $COVERAGE%, must be at least 80%!"
exit 1
fi
# ===========================
# JOB 3: Integration Tests
# ===========================
integration:
runs-on: ubuntu-latest
needs: lint
services:
mongodb:
image: mongo:6
ports:
- 27017:27017
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '18'
- name: Install dependencies
run: npm ci
- name: Run integration tests
run: npm run test:integration
env:
DATABASE_URL: mongodb://localhost:27017/testdb
# ===========================
# JOB 4: Security Scan
# ===========================
security:
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v4
- name: Run Snyk security scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
- name: Run SonarQube scan
uses: sonarqube-community/sonarqube-scan-action@v5
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
GitHub Actions UI: ================== ┌──────────────────────────────────────────────────────┐ │ CI Pipeline │ │ │ │ ✅ lint (32s) ESLint + Prettier passed │ │ ├── ✅ test (1m 45s) 48 tests passed, 92% cov │ │ ├── ✅ integration (2m 12s) API contracts valid │ │ └── ✅ security (1m 30s) No vulnerabilities found │ │ │ │ Status: All checks passed ✅ │ │ Ready to merge! │ └──────────────────────────────────────────────────────┘
Step 3: Containerization - "Package It Up!"
All quality gates passed? Now the code gets packaged into a Docker container - an immutable artifact that runs the same EVERYWHERE!
Q: Why containerize?
Why Docker in CI/CD: ==================== Without Docker: Dev: "Works on my machine!" → Node 18, npm 9, Ubuntu 22 Staging: Different Node version → Node 16, npm 7, Ubuntu 20 Prod: Different OS entirely → Node 18, npm 9, Amazon Linux Same code, different behavior! 😱 With Docker: Dev: Same Docker image Staging: Same Docker image Prod: Same Docker image Same image everywhere = Same behavior everywhere! ✅ This is what we call an IMMUTABLE ARTIFACT!
The Dockerfile:
# Dockerfile - Production optimized
# Stage 1: Build
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production # Clean install, production only!
COPY . .
# Stage 2: Production image (smaller!)
FROM node:18-alpine
WORKDIR /app
# Don't run as root!
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
# Copy only what we need from builder
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/src ./src
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "src/index.js"]
What Makes This Dockerfile Production-Ready: ============================================= ✅ Multi-stage build → Smaller final image (~150MB vs ~900MB) ✅ npm ci --production → Only production dependencies ✅ Non-root user → Security best practice ✅ HEALTHCHECK → Container self-monitoring ✅ Alpine base → Minimal OS footprint ✅ .dockerignore → Excludes node_modules, .git, .env
The CI/CD step that builds and pushes:
# .github/workflows/cd.yml (continued)
# ===========================
# JOB 5: Build & Push Docker Image
# ===========================
build:
runs-on: ubuntu-latest
needs: [test, integration, security] # ALL must pass!
steps:
- uses: actions/checkout@v4
- name: Login to Amazon ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Build Docker image
run: |
docker build \
-t registry.acme.com/app:${{ github.sha }} \
-t registry.acme.com/app:latest \
.
- name: Push to registry
run: |
docker push registry.acme.com/app:${{ github.sha }}
docker push registry.acme.com/app:latest
Image Tagging Strategy:
=======================
registry.acme.com/app:v2.4.1-abc123f
│ │
│ └── Git commit SHA (unique!)
└── Semantic version
Why this tag format?
- v2.4.1 → Human readable version
- abc123f → Exact commit this image was built from
- Can always trace image back to exact code!
Every build = New unique image
Old images = Still available for rollback
Docker Image Flow:
==================
┌──────────────┐ ┌─────────────────┐ ┌──────────────┐
│ Source │ │ Docker Build │ │ Registry │
│ Code │────────→│ │───────→│ │
│ │ │ FROM node:18 │ │ ECR / │
│ src/ │ │ COPY . . │ │ Docker Hub │
│ package.json │ │ RUN npm ci │ │ │
│ Dockerfile │ │ CMD ["node"...] │ │ app:v2.4.1 │
└──────────────┘ └─────────────────┘ │ app:v2.4.0 │
│ app:v2.3.9 │
Immutable artifact! │ ... │
Same everywhere! └──────────────┘
Step 4: Staging Validation - "Test Before Going Live!"
Before touching production, the image gets deployed to a staging environment - an exact replica of production!
Staging = Mini Production: ========================== ┌──────────────────────────────────────────────────────┐ │ STAGING ENVIRONMENT │ │ (Exact replica of production) │ │ │ │ ✅ Same Kubernetes version │ │ ✅ Same database schema │ │ ✅ Same resource limits (CPU, RAM) │ │ ✅ Same environment variables (different values) │ │ ✅ Same network configuration │ │ ✅ Same container image │ │ │ │ Only difference: │ │ - Smaller scale (fewer replicas) │ │ - Test database (not real user data) │ │ - Not public facing │ └──────────────────────────────────────────────────────┘ Why staging? - Catch environment-specific bugs - Test database migrations - Verify configuration - Run end-to-end tests - QA team can manually test
Automated Smoke Tests on Staging:
Smoke Tests = Quick Health Checks: =================================== After deploying to staging, automated tests run: ┌──────────────────────────────────────────────────────┐ │ SMOKE TEST SUITE │ ├──────────────────────────────────────────────────────┤ │ │ │ Test 1: Health Check │ │ GET /health → Expected: 200 OK │ │ Result: ✅ PASSED (response in 23ms) │ │ │ │ Test 2: Database Connectivity │ │ GET /health/db → Expected: 200 OK │ │ Result: ✅ PASSED (MongoDB connected) │ │ │ │ Test 3: Login Flow │ │ POST /login → Expected: 200 + JWT Token │ │ Result: ✅ PASSED (token received) │ │ │ │ Test 4: Checkout Flow │ │ POST /checkout → Expected: 201 Created │ │ Result: ✅ PASSED (order created) │ │ │ │ Test 5: Search │ │ GET /search?q=test → Expected: 200 + results │ │ Result: ✅ PASSED (42 results returned) │ │ │ ├──────────────────────────────────────────────────────┤ │ All 5 smoke tests passed ✅ │ │ Staging deployment healthy! │ │ Ready for production rollout! │ └──────────────────────────────────────────────────────┘
# smoke-tests.sh - Automated smoke test script
#!/bin/bash
STAGING_URL="https://staging.acme.com"
FAILED=0
echo "Running smoke tests on $STAGING_URL..."
# Test 1: Health endpoint
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$STAGING_URL/health")
if [ "$STATUS" -eq 200 ]; then
echo "✅ Health check passed"
else
echo "❌ Health check failed (status: $STATUS)"
FAILED=1
fi
# Test 2: Database connectivity
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$STAGING_URL/health/db")
if [ "$STATUS" -eq 200 ]; then
echo "✅ Database connectivity passed"
else
echo "❌ Database connectivity failed"
FAILED=1
fi
# Test 3: Login flow
RESPONSE=$(curl -s -w "\n%{http_code}" -X POST "$STAGING_URL/login" \
-H "Content-Type: application/json" \
-d '{"email":"test@test.com","password":"testPass123"}')
STATUS=$(echo "$RESPONSE" | tail -1)
if [ "$STATUS" -eq 200 ]; then
echo "✅ Login flow passed"
else
echo "❌ Login flow failed"
FAILED=1
fi
# Final result
if [ "$FAILED" -eq 1 ]; then
echo "❌ SMOKE TESTS FAILED! Blocking production deployment."
exit 1
else
echo "✅ ALL SMOKE TESTS PASSED! Ready for production."
exit 0
fi
Step 5: Progressive Production Rollout - "Go Live Gradually!"
This is where it gets really exciting! We DON'T deploy to 100% of users at once. Instead, we use a canary deployment strategy - rolling out gradually!
Q: What is a Canary Deployment?
A: The name comes from coal miners who used canaries to detect toxic gas. If the canary died, miners knew to evacuate. Similarly, we send a small percentage of traffic to the new version first - if something goes wrong, only a few users are affected!
Canary Deployment Flow:
=======================
PHASE 1: 5% of traffic (Canary Pods)
==========================================
┌──────────────┐
│ USERS │
│ (100%) │
└──────┬───────┘
│
▼
┌──────────────────────────────────────────────┐
│ LOAD BALANCER │
└──────┬──────────────────────────────┬────────┘
│ 95% traffic │ 5% traffic
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ OLD VERSION │ │ NEW VERSION │
│ v2.3.9 │ │ v2.4.1 │
│ │ │ (Canary) │
│ Pod 1 │ │ Pod 1 │
│ Pod 2 │ │ │
│ Pod 3 │ │ │
│ Pod 4 │ │ │
└──────────────────┘ └──────────────────┘
→ Monitor for 10 minutes
→ Grafana tracks: error rate, latency, CPU, memory
→ If healthy: proceed to Phase 2
→ If unhealthy: ROLLBACK immediately!
PHASE 2: 25% of traffic
==========================================
┌──────────────────┐ ┌──────────────────┐
│ OLD VERSION │ │ NEW VERSION │
│ v2.3.9 │ │ v2.4.1 │
│ │ │ │
│ Pod 1 │ │ Pod 1 │
│ Pod 2 │ │ Pod 2 │
│ Pod 3 │ │ │
└──────────────────┘ └──────────────────┘
75% 25%
→ Run synthetic transactions (automated user flows)
→ Validate business KPIs (conversion rate, checkout success)
PHASE 3: 50% → 100%
==========================================
┌──────────────────┐ ┌──────────────────┐
│ OLD VERSION │ │ NEW VERSION │
│ (draining) │ │ v2.4.1 │
│ │ │ │
│ │ │ Pod 1 │
│ │ │ Pod 2 │
│ │ │ Pod 3 │
│ │ │ Pod 4 │
└──────────────────┘ └──────────────────┘
0% 100%
→ Full rollout complete!
→ Old pods terminated gracefully
→ Zero downtime! Users didn't notice a thing!
Kubernetes + Argo Rollouts Configuration:
# argo-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 4
strategy:
canary:
# Phase 1: 5% traffic
steps:
- setWeight: 5
- pause: { duration: 10m } # Monitor for 10 minutes
# Phase 2: 25% traffic
- setWeight: 25
- pause: { duration: 10m }
# Phase 3: 50% traffic
- setWeight: 50
- pause: { duration: 5m }
# Phase 4: Full rollout
- setWeight: 100
# Auto-rollback conditions
analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-app
template:
spec:
containers:
- name: my-app
image: registry.acme.com/app:v2.4.1-abc123f
ports:
- containerPort: 3000
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
What Argo Rollouts Does:
========================
Manual Kubernetes deployment:
kubectl apply → ALL pods updated at once → Risky!
Argo Rollouts:
kubectl apply → 5% → wait → 25% → wait → 50% → 100%
↑ ↑ ↑
Monitor Monitor Monitor
metrics metrics metrics
If anything goes wrong at ANY phase → Auto rollback!
Step 6: Auto-Rollback Safety Net - "The Insurance Policy"
This is the most critical safety feature of the entire pipeline. If something goes wrong in production, the system automatically reverts to the last stable version!
Auto-Rollback Triggers: ======================= IF error rate > 5% OR p99 latency > 500ms OR CPU usage > 90% OR memory usage > 85% OR health check fails THEN: ┌──────────────────────────────────────────────────┐ │ AUTO-ROLLBACK TRIGGERED! │ │ │ │ 1. Argo detects unhealthy metrics │ │ 2. Traffic shifted back to stable version │ │ 3. Canary pods terminated │ │ 4. Stable version serves 100% traffic │ │ 5. Alert fires to Slack + PagerDuty │ │ 6. Deployment marked as FAILED │ │ 7. Pipeline blocked until root cause fixed │ │ │ │ Time to rollback: ~30 seconds │ │ Users affected: Only those in canary (5-25%) │ └──────────────────────────────────────────────────┘
Rollback Timeline: ================== 00:00 - New version deployed (5% canary) 00:02 - Grafana detects error rate spike: 12% 00:02 - Argo Rollouts: "Error rate exceeds 5% threshold!" 00:02 - AUTO-ROLLBACK INITIATED 00:03 - Traffic shifted: 100% → stable version 00:03 - Canary pods terminating... 00:04 - Stable version serving all traffic 00:04 - Slack alert: "🚨 Deployment v2.4.1 rolled back!" 00:05 - PagerDuty alert to on-call engineer 00:05 - All systems normal Total impact: ~3 minutes, only 5% of users affected! Without auto-rollback: HOURS of downtime, ALL users affected!
Argo Analysis Template (What Metrics to Watch):
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
# Metric 1: Error Rate
- name: error-rate
interval: 60s
failureLimit: 3 # 3 failures = rollback
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{
status=~"5.*",
app="my-app"
}[5m]))
/
sum(rate(http_requests_total{
app="my-app"
}[5m]))
* 100
successCondition: result[0] < 5 # Must be under 5%
# Metric 2: Latency
- name: p99-latency
interval: 60s
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
app="my-app"
}[5m])) by (le)
)
successCondition: result[0] < 0.5 # Must be under 500ms
Monitoring Dashboard (Grafana): =============================== ┌──────────────────────────────────────────────────────┐ │ DEPLOYMENT MONITOR │ │ v2.4.1 Canary Rollout │ ├──────────────────────────────────────────────────────┤ │ │ │ Error Rate: Canary: 0.3% ✅ Stable: 0.2% │ │ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ Threshold: < 5% │ │ │ │ p99 Latency: Canary: 120ms ✅ Stable: 95ms │ │ ▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁ Threshold: < 500ms │ │ │ │ CPU Usage: Canary: 45% ✅ Stable: 42% │ │ ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃ │ │ │ │ Memory: Canary: 280MB ✅ Stable: 265MB│ │ ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ │ │ │ │ Traffic Split: Canary: 25% Stable: 75% │ │ Phase: 2 of 4 │ │ Status: HEALTHY ✅ │ └──────────────────────────────────────────────────────┘
Alert Configuration:
Where Alerts Go:
================
┌─────────────────────┐
│ Argo detects │
│ unhealthy metrics │
└──────────┬──────────┘
│
┌─────┴─────┬─────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────────┐
│ Slack │ │PagerDuty│ │ Email │
│ #alerts │ │ On-call │ │ Team │
│ channel │ │ engineer│ │ Lead │
└─────────┘ └─────────┘ └──────────┘
Slack message:
🚨 ROLLBACK: Deployment v2.4.1 failed
Error rate: 12.3% (threshold: 5%)
Rolled back to: v2.3.9
Affected users: ~5%
Action required: Fix and redeploy
The Complete CI/CD Pipeline - GitHub Actions
Let me show you a real-world complete pipeline that ties ALL 6 steps together!
# .github/workflows/pipeline.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
env:
REGISTRY: registry.acme.com
IMAGE_NAME: app
STAGING_URL: https://staging.acme.com
K8S_NAMESPACE: production
jobs:
# ═══════════════════════════════════════
# STEP 2: Quality Gates (Step 1 is PR review - done before merge)
# ═══════════════════════════════════════
lint:
name: Code Quality
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '18', cache: 'npm' }
- run: npm ci
- run: npm run lint
- run: npm run format:check
test:
name: Unit Tests
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '18', cache: 'npm' }
- run: npm ci
- run: npm test -- --coverage
- name: Upload coverage
uses: codecov/codecov-action@v4
integration:
name: Integration Tests
runs-on: ubuntu-latest
needs: lint
services:
mongodb:
image: mongo:6
ports: ['27017:27017']
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '18' }
- run: npm ci
- run: npm run test:integration
security:
name: Security Scan
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v4
- name: Snyk vulnerability scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
# ═══════════════════════════════════════
# STEP 3: Containerization
# ═══════════════════════════════════════
build:
name: Build & Push Docker Image
runs-on: ubuntu-latest
needs: [test, integration, security] # ALL gates must pass!
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Generate image tag
id: meta
run: echo "tags=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}" >> $GITHUB_OUTPUT
- name: Build Docker image
run: docker build -t ${{ steps.meta.outputs.tags }} .
- name: Push to registry
run: docker push ${{ steps.meta.outputs.tags }}
# ═══════════════════════════════════════
# STEP 4: Staging Validation
# ═══════════════════════════════════════
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: build
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging cluster
run: |
kubectl set image deployment/app \
app=${{ needs.build.outputs.image-tag }} \
-n staging
- name: Wait for rollout
run: kubectl rollout status deployment/app -n staging --timeout=120s
- name: Run smoke tests
run: bash scripts/smoke-tests.sh ${{ env.STAGING_URL }}
# ═══════════════════════════════════════
# STEP 5 + 6: Production Rollout + Auto-Rollback
# ═══════════════════════════════════════
deploy-production:
name: Production Canary Rollout
runs-on: ubuntu-latest
needs: deploy-staging
environment: production # Requires manual approval!
steps:
- uses: actions/checkout@v4
- name: Update Argo Rollout
run: |
kubectl argo rollouts set image my-app \
app=${{ needs.build.outputs.image-tag }} \
-n ${{ env.K8S_NAMESPACE }}
- name: Monitor canary rollout
run: |
kubectl argo rollouts status my-app \
-n ${{ env.K8S_NAMESPACE }} \
--watch --timeout 30m
- name: Notify success
if: success()
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text":"✅ v2.4.1 deployed to production successfully!"}'
- name: Notify failure
if: failure()
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text":"🚨 v2.4.1 deployment failed! Auto-rolled back."}'
CI/CD Tools Landscape
| Category | Tools | Purpose |
|---|---|---|
| CI/CD Platform | GitHub Actions, Jenkins, GitLab CI, CircleCI | Orchestrate the pipeline |
| Testing | Jest, PyTest, Mocha, Cypress | Unit, integration, E2E tests |
| Security | SonarQube, Snyk, Trivy | SAST scans, dependency vulnerabilities |
| Code Quality | ESLint, Prettier, Husky | Linting, formatting, git hooks |
| Containerization | Docker, Podman | Build immutable artifacts |
| Registry | Amazon ECR, Docker Hub, GitHub GHCR | Store container images |
| Orchestration | Kubernetes, Docker Swarm, ECS | Run containers at scale |
| Rollout | Argo Rollouts, Flagger, Spinnaker | Canary/blue-green deployments |
| Monitoring | Prometheus, Grafana, Datadog | Metrics, dashboards, alerts |
| Alerting | PagerDuty, Slack, OpsGenie | Incident notifications |
Interview Questions - Quick Fire!
Q: What is CI/CD?
"CI (Continuous Integration) is the practice of automatically building and testing code changes whenever developers push to the repository. CD (Continuous Delivery/Deployment) extends this by automatically deploying the tested code to staging or production. Together, they create an automated pipeline from code commit to production deployment."
Q: What is the difference between Continuous Delivery and Continuous Deployment?
"Continuous Delivery means code is always in a deployable state and can be deployed with one click (manual approval). Continuous Deployment goes further - every change that passes all tests is automatically deployed to production with no human intervention."
Q: What are Quality Gates in CI/CD?
"Quality gates are automated checks that code must pass before it can proceed in the pipeline. They include unit tests, integration tests, security scans (SAST), code quality checks (linting), and coverage thresholds. If any gate fails, the pipeline stops and broken code cannot reach production."
Q: What is a Canary Deployment?
"Canary deployment is a strategy where the new version is rolled out to a small percentage of users first (like 5%), while the majority still use the stable version. If the canary is healthy (low error rate, good latency), traffic is gradually increased to 25%, 50%, then 100%. If issues are detected, traffic is immediately routed back to the stable version."
Q: What is the difference between Canary and Blue-Green deployment?
"In Blue-Green deployment, you have two identical environments - Blue (current) and Green (new). Traffic switches 100% from Blue to Green at once. In Canary deployment, traffic shifts gradually (5% → 25% → 50% → 100%), giving more time to detect issues. Canary is safer but takes longer; Blue-Green is faster but riskier."
Q: Why do we containerize applications in CI/CD?
"Containerization with Docker creates an immutable artifact that includes the application, its dependencies, and runtime environment. This ensures the exact same code runs in development, staging, and production - eliminating the 'works on my machine' problem. It also makes rollbacks trivial - just switch to the previous image."
Q: What is an immutable artifact?
"An immutable artifact is a build output (like a Docker image) that never changes after creation. Each build produces a new unique artifact tagged with the git commit SHA. If you need to change something, you create a new artifact rather than modifying the existing one. This ensures traceability and reliable rollbacks."
Q: What happens if a deployment fails in production?
"With a proper CI/CD pipeline using canary deployments and Argo Rollouts, the system automatically detects failures by monitoring metrics like error rate and latency. If thresholds are exceeded, an auto-rollback triggers within seconds - shifting all traffic back to the last stable version. Alerts fire to Slack and PagerDuty, and the deployment is blocked until the root cause is fixed."
Q: What is SAST and why is it important in CI/CD?
"SAST (Static Application Security Testing) scans source code for security vulnerabilities without executing it. Tools like SonarQube and Snyk check for SQL injection, hardcoded secrets, outdated dependencies with known CVEs, and other security issues. Running SAST in the CI pipeline ensures vulnerable code never reaches production."
Q: Why do we need a staging environment?
"Staging is a production replica used to validate deployments before going live. It catches environment-specific issues like database migration problems, configuration errors, and integration failures that unit tests might miss. Smoke tests on staging verify critical flows (health check, login, checkout) work correctly with the production-like setup."
Q: What metrics should you monitor during a production deployment?
"The key metrics are: error rate (percentage of 5xx responses), p99 latency (99th percentile response time), CPU and memory usage, and health check status. Business KPIs like conversion rate and checkout success are also important. If any metric crosses the threshold during canary rollout, an automatic rollback is triggered."
Quick Recap
| Step | What Happens | Tools |
|---|---|---|
| 1. Code Review | PR raised → Team reviews → Merge to main | GitHub, GitLab |
| 2. Quality Gates | Tests + Security scans + Linting (auto) | Jest, Snyk, ESLint, SonarQube |
| 3. Containerize | Build Docker image → Push to registry | Docker, ECR, Docker Hub |
| 4. Staging | Deploy to staging → Run smoke tests | Kubernetes, curl, Cypress |
| 5. Production | Canary rollout: 5% → 25% → 50% → 100% | Argo Rollouts, Kubernetes |
| 6. Safety Net | Auto-rollback if metrics exceed thresholds | Prometheus, Grafana, PagerDuty |
Key Points to Remember
- CI = Merge + Build + Test automatically on every push
- CD (Delivery) = Code always deployable, one-click deploy
- CD (Deployment) = Auto-deploy every passing change, no human needed
- Pull Requests = Code review before merge, branch protection rules
- Quality Gates = Unit tests, integration tests, SAST scans, linting
- Any gate failure = Pipeline stops, broken code blocked
- Docker image = Immutable artifact, same everywhere (dev/staging/prod)
- Image tag = version + git SHA for traceability (v2.4.1-abc123f)
- Staging = Production replica for final validation + smoke tests
- Canary Deployment = Gradual rollout (5% → 25% → 50% → 100%)
- Auto-Rollback = If error rate > 5% OR latency > 500ms → revert instantly
- Argo Rollouts = Progressive delivery controller for Kubernetes
- Prometheus + Grafana = Metrics collection + visualization
- PagerDuty + Slack = Alert on-call engineers immediately
- Total time = 15-30 minutes, zero downtime, fully automated
- Security is shift-left = Catch vulnerabilities early in the pipeline, not in production
What's Next?
Now you understand the complete CI/CD journey from git push to production! In the next episode, we can explore:
- Kubernetes Deep Dive - Pods, Services, Deployments
- Infrastructure as Code (Terraform)
- GitOps with ArgoCD
- Monitoring & Observability (Prometheus, Grafana, ELK Stack)
Keep coding, keep learning! See you in the next one!
Post a Comment