Skip to content

Instantly share code, notes, and snippets.

@Lubdhak
Created August 13, 2025 18:03
Show Gist options
  • Select an option

  • Save Lubdhak/f737be88d55859c1ac0b4ebee19b5b71 to your computer and use it in GitHub Desktop.

Select an option

Save Lubdhak/f737be88d55859c1ac0b4ebee19b5b71 to your computer and use it in GitHub Desktop.
in_depth
Business Features
- Admin Portal:
Developed an Admin Portal for the support team to streamline high-frequency operations.
- Implemented role-based access control (RBAC) with a dedicated “Support Read-Only” role.
- Integrated a Metabase analytics dashboard for storing and executing read-only SQL queries.
- Built an AWS S3–backed ingestion pipeline for processing Excel (XLSX/CSV) datasets.
- Enabled non–user-facing features using Feature Flags
- Used AWS EventBridge to orchestrate on-demand and scheduled task execution through separate code execution pipelines.
- Reduced JIRA ticket volume by 70% and improved SLA compliance time by 60%.
- App Catalog:
Designed and developed an App Catalog ticketing platform for managing application access and support requests across the organization.
- Integrated workflow automation to intelligently route requests through designated approvers, reducing manual coordination and ensuring compliance.
- Implemented configurable Single & Multi-Step Sequential Approval workflows with custom approve & reject rules.
- Integrated Webhook notifications to external systems with robust error handling for Webhook Delivery Failure scenarios.
- Designed escalation management, including Escalation Path routing for overdue approvals.
- Leveraged PostgreSQL, Sidekiq for background job processing, and AWS S3/EventBridge for asset storage and asynchronous workflow triggers.
- Seen an higher adoption by keeping application-related requests within the App Catalog instead of diverting them to the organization’s JIRA.
- Smart Contracts:
Built GPT-3.5-powered Smart Contracts system with automated clause extraction, compliance alerts, spend benchmarking, and savings insights
- Used Pretrained Large Language models to classify clauses (SLAs, termination terms) and extract critical dates, obligations, and pricing tables with >90% precision.
- Event-driven alerts via Sidekiq CRON jobs and AWS EventBridge, triggering notifications for renewals (30/60/90-day windows), spend anomalies, or non-compliant terms.
- Benchmarking engine comparing rates against our historical spend data in Snowflake and third-party APIs (e.g., Spend Intelligence).
- Generated savings recommendations via aggregated time-series analysis (Python Pandas) and outlier detection (DBSCAN clustering).
- UI dashboards visualizing contract health (burn rate, utilization) and benchmark gaps.
- LLM was GPT-3.5 base (trained on ~500B tokens) with context window of 4K tokens with per request cost < $0.002
- Product Sentiments:
An automated survey tool that analyzes user-app sentiment via targeted feedback for identifying unused/inefficient tools.
- Enables admins to launch targeted email campaigns to assess user sentiment about specific apps.
- Workspace owners select apps based on spend data (overlapping/expensive) or SSO logs (unused).
- Users receive personalized survey links via email to provide feedback.
- Aggregates responses into an interactive dashboard showing trends and suggestions.
- Plots app sentiment on a 4-quadrant grid (e.g., "High Cost vs. Low Satisfaction") for prioritization.
- Helped organizations cut costs (unused apps) and improve ROI (high-value tools).
Technical Features
- SAML:
Implemented SAML 2.0 (Security Assertion Markup Language) for enterprise-grade SSO.
- Launched with Okta (v1) – Enabled enterprise SSO via SAML 2.0, later expanded to Azure AD, OneLogin, and custom IdPs.
- Challenges being Varied XML formats, certificate rotations, and strict NameID requirements caused integration hurdles.
- Testing & Debugging caused a lot of pain due the obvious mismatch of the ACS URLs, relied exclusively on network tunneling.
- in v2 we rolled out automated user provisioning using SCIM authorization flows
- Post-Acquisition: Consolidated IdPs under Auth0, migrated SAML customers, established cross-domain federated identity and managed session across the products
- Scaled to 150+ Customers – Cut support tickets by 90% and sped up integrations from 1 hour to 10 mins.
- Multi tenant & Microservice:
- Architected multi-tenant microservices using Django, PostgreSQL schema isolation, and Node API Gateway, implementing tenant-aware routing via subdomains.
- Designed schema-per-tenant architecture leveraging PostgreSQL's CREATE SCHEMA and Django-Tenants, developing middleware for automatic search_path switching.
- Implemented event-driven communication using Apache Kafka with tenant ID headers, enabling asynchronous processing while maintaining schema isolation
- Optimized database performance with PgBouncer connection pooling & schema-aware Django ORM extensions, achieving 2ms schema switches and 30% faster tenant-specific queries
- The node API Gateway was responsible for Service Discovery, Tenant Routing, Schema Injection, Authentication, Upload capabilities, Elastic Search Chores.
- Table Saw:
Tools that dumps the referentially intact minimal subset of a postgres database with custom query selection & PII masking.
- Long-running production queries (debugging, reports, load-testing) required full DB restores or read-only replicas, slowing workflows.
- Built an open-source tool to extract minimal, referentially intact subsets of Postgres data instead of full dumps.
- Used topological sorting to auto-include all parent/child records via FK relationships from any seed row.
- This Enabled accelerated targeted data related debugging (e.g., single customer workspace) without multi-TB restores.
- Handledumps up to ~50GB before hitting VM memory limits.
- Email Server:
Linux-based email server using Postfix (MTA) and Dovecot (IMAP/POP3) with TLS encryption for secure SMTP relay.
- Configured NodeMailer for programmatic email sending, integrating OAuth2 and SMTP authentication.
- Implemented SPF, DKIM, DMARC, and Reverse DNS (PTR) to ensure inbox placement (reduced spam rate from ~50% to <5%).
- Monitored sender reputation using Google Postmaster Tools and MXToolbox to maintain high deliverability.
- Developed a queue-based scheduling system using Redis/BullMQ to delay emails and send them at predefined times.
- Engineered an email retraction feature (for unread emails) via IMAP IDLE tracking and custom API hooks.
- Optimized Postfix with rate limiting, connection pooling, and failover SMTP relays (AWS SES backup).
- Resume Parser & Ranking:
- Built parallel parsers using Apache Tika OCR microservice for scanned PDFs (92% text recovery)
- Extracted key fields (skills, experience, education) via rule-based matching and NER (spaCy/Stanford NLP).
- Created TF-IDF & Word2Vec embeddings for semantic similarity between resumes and job descriptions.
- Added handcrafted features (years of experience, skill overlap, education tier) for ML modeling.
- Experimented with Logistic Regression, Random Forests, and XGBoost (Bayesian hyperparameter tuning) to rank resumes by JD fit.
- Fine-tuned weights for tech vs. non-tech roles (e.g., heavier skill weighting for engineering jobs).
- Incorporated hard filters (e.g., "Must have: Python") to auto-reject mismatches.
- Achieved ~84% precision in top-5 shortlisting via cross-validation (human-annotated dataset).
- Addressed sparse data challege via synthetic oversampling of niche roles.
- Reduced bias by anonymizing resumes (removing names/gender cues) during ranking.
- Served predictions via a Flask API with caching (Redis) for batch processing.
- Designed Kafka topics with 3-partition architecture for load balancing
- Implemented idempotent consumers for resume/JD processing (exactly-once semantics)
- Scaled to 50 docs/minute using Kafka Connect S3 sink
- Integrated with AWS S3 for resume storage and Airflow for scheduled JD updates.
- Performance Tuning:
- Stack Upgrade:
- Migrated from an 8-year-old Ruby/Rails monolith using iterative, zero-downtime strategies
- Replaced Unicorn with Puma (thread-safe scaling) & Transitioned from Asset Pipeline to Webpacker
- Rolling tweaks for compatibilty of native extensions, deprecated methods & args, minimal code changes
- Hardware / VM
- Upgraded to NVMe SSDs for disk I/O bound workloads -> Reduced CPU wait states -> CPU utilization dropped from 85% to <70% sustained
- Enabled huge pages (2MB) for memory-intensive apps -> Improved TLB hit rate -> Disk IOPS increased from 15k to 22k (random read/write)
- Switched to ARM-based Graviton3 instances -> Better price-performance -> Cost per 1000 requests reduced by 35%
- Infrastructure (AWS)
- Implemented spot instances for batch processing -> Cost savings -> EC2 costs decreased by 60% for non-critical workloads
- Right-sized RDS to r6gd.2xlarge -> Balanced memory/CPU -> Query throughput increased from 1.2k to 2.1k QPS
- Configured VPC flow logs -> Identified network bottlenecks -> Cross-AZ traffic reduced by 40%
- Docker
- Multi-stage builds -> Smaller images -> Image size reduced from 1.8GB to 450MB
- Set CPU limits (4 cores) -> Prevented noisy neighbors -> Container throttling events dropped from 12/hr to 0
- Switched to distroless base images -> Reduced attack surface -> CVE vulnerabilities decreased by 90%
- Language (Ruby)
- Enabled YJIT -> Faster execution -> Median request latency improved from 48ms to 29ms
- Tuned GC (RUBY_GC_HEAP_GROWTH_MAX_SLOTS=300k) -> Fewer GC pauses -> GC time per request reduced from 8ms to 3ms
- Adopted jemalloc -> Less fragmentation -> RSS memory stabilized at 1.2GB (was fluctuating 1-2GB)
- Database (PostgreSQL)
- Added partial indexes (WHERE status='active') -> Faster queries -> SELECT latency (p95) dropped from 120ms to 45ms
- Tuned autovacuum (autovacuum_vacuum_scale_factor=0.1) -> Fewer dead tuples -> Vacuum runs decreased from 20/day to 5/day
- Enabled parallel queries (max_parallel_workers=8) -> Improved analytics -> COUNT(*) runtime reduced from 12s to 3.2s
- Server (Puma)
- Adjusted workers:threads (4:8 -> 2:16) -> Better throughput -> Requests/sec increased from 850 to 1,100
- Enabled socket activation -> Zero-downtime restarts -> Deployment downtime reduced from 8s to 0s
- Set worker timeout (worker_timeout=30) -> Killed hung workers -> 5xx errors decreased by 75%
- Framework (Rails)
- Russian doll caching -> Fewer DB hits -> Cache hit rate improved from 65% to 92%
- Optimized ActiveRecord (pluck vs. select) -> Less memory -> Allocations/request dropped from 45k to 12k objects
- Enabled bootsnap -> Faster boots -> Application startup reduced from 12s to 4s
- Background Jobs (Sidekiq)
- Weighted queues (critical=5, default=1) -> Priority handling -> Critical job latency (p95) improved from 8s to 1.2s
- Set job expiration (30m) -> Redis memory control -> Redis memory usage stabilized at 800MB (was spiking to 2GB)
- Added idempotency keys -> Fewer duplicates -> Duplicate jobs dropped from 5% to 0.1%
- GraphQL
- Persisted queries -> Smaller payloads -> Network throughput reduced by 40%
- Dataloader batching -> Eliminated N+1 -> Resolver calls/query decreased from 32 to 5
- Query complexity limits (max_depth=10) -> Blocked expensive queries -> Timeout errors dropped by 90%
- Frontend (Angular)
- AOT compilation -> Faster rendering -> First Contentful Paint improved from 2.1s to 1.3s
- Lazy-loaded modules -> Smaller bundles -> Main.js size reduced from 1.4MB to 580KB
- OnPush change detection -> Less CPU usage -> Animation jank decreased from 12% to 2% frames dropped
- Build & Deployment
- Parallelized RSpec (--jobs 8) -> Faster CI -> Test suite runtime reduced from 18m to 6m
- Cached node_modules -> Less rebuilds -> Docker build time decreased from 5m to 90s
- Canary deployments (5% traffic) -> Safer releases -> Rollback rate dropped from 8% to 1%
- Scalability
- Read replicas -> Offloaded primary DB -> Primary DB CPU reduced from 80% to 45%
- HPA (CPU=70%) -> Auto-scaling pods -> Peak traffic capacity increased from 1k to 5k RPS
- Database sharding (by region) -> Reduced contention -> Write latency (p99) improved from 250ms to 90ms
- Observability
- Distributed tracing -> Faster debugging -> MTTR (Mean Time to Repair) reduced from 47m to 12m
- SLO-based alerts -> Fewer false positives -> Alert volume decreased by 70%
- Log sampling (10%) -> Cost control -> CloudWatch costs reduced by $1,200/month
NonTech Features:
- PR Review
- Hackathon
- BiWeekly Tech Talks
- Conferences
Drives:
- Better Service Design
- Sidekiq Pro Adoption
- NewRelic Adoption
- EC2 to ECS
- DB to S3 using AWS Glue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment