Created
August 13, 2025 18:03
-
-
Save Lubdhak/f737be88d55859c1ac0b4ebee19b5b71 to your computer and use it in GitHub Desktop.
in_depth
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Business Features | |
| - Admin Portal: | |
| Developed an Admin Portal for the support team to streamline high-frequency operations. | |
| - Implemented role-based access control (RBAC) with a dedicated “Support Read-Only” role. | |
| - Integrated a Metabase analytics dashboard for storing and executing read-only SQL queries. | |
| - Built an AWS S3–backed ingestion pipeline for processing Excel (XLSX/CSV) datasets. | |
| - Enabled non–user-facing features using Feature Flags | |
| - Used AWS EventBridge to orchestrate on-demand and scheduled task execution through separate code execution pipelines. | |
| - Reduced JIRA ticket volume by 70% and improved SLA compliance time by 60%. | |
| - App Catalog: | |
| Designed and developed an App Catalog ticketing platform for managing application access and support requests across the organization. | |
| - Integrated workflow automation to intelligently route requests through designated approvers, reducing manual coordination and ensuring compliance. | |
| - Implemented configurable Single & Multi-Step Sequential Approval workflows with custom approve & reject rules. | |
| - Integrated Webhook notifications to external systems with robust error handling for Webhook Delivery Failure scenarios. | |
| - Designed escalation management, including Escalation Path routing for overdue approvals. | |
| - Leveraged PostgreSQL, Sidekiq for background job processing, and AWS S3/EventBridge for asset storage and asynchronous workflow triggers. | |
| - Seen an higher adoption by keeping application-related requests within the App Catalog instead of diverting them to the organization’s JIRA. | |
| - Smart Contracts: | |
| Built GPT-3.5-powered Smart Contracts system with automated clause extraction, compliance alerts, spend benchmarking, and savings insights | |
| - Used Pretrained Large Language models to classify clauses (SLAs, termination terms) and extract critical dates, obligations, and pricing tables with >90% precision. | |
| - Event-driven alerts via Sidekiq CRON jobs and AWS EventBridge, triggering notifications for renewals (30/60/90-day windows), spend anomalies, or non-compliant terms. | |
| - Benchmarking engine comparing rates against our historical spend data in Snowflake and third-party APIs (e.g., Spend Intelligence). | |
| - Generated savings recommendations via aggregated time-series analysis (Python Pandas) and outlier detection (DBSCAN clustering). | |
| - UI dashboards visualizing contract health (burn rate, utilization) and benchmark gaps. | |
| - LLM was GPT-3.5 base (trained on ~500B tokens) with context window of 4K tokens with per request cost < $0.002 | |
| - Product Sentiments: | |
| An automated survey tool that analyzes user-app sentiment via targeted feedback for identifying unused/inefficient tools. | |
| - Enables admins to launch targeted email campaigns to assess user sentiment about specific apps. | |
| - Workspace owners select apps based on spend data (overlapping/expensive) or SSO logs (unused). | |
| - Users receive personalized survey links via email to provide feedback. | |
| - Aggregates responses into an interactive dashboard showing trends and suggestions. | |
| - Plots app sentiment on a 4-quadrant grid (e.g., "High Cost vs. Low Satisfaction") for prioritization. | |
| - Helped organizations cut costs (unused apps) and improve ROI (high-value tools). | |
| Technical Features | |
| - SAML: | |
| Implemented SAML 2.0 (Security Assertion Markup Language) for enterprise-grade SSO. | |
| - Launched with Okta (v1) – Enabled enterprise SSO via SAML 2.0, later expanded to Azure AD, OneLogin, and custom IdPs. | |
| - Challenges being Varied XML formats, certificate rotations, and strict NameID requirements caused integration hurdles. | |
| - Testing & Debugging caused a lot of pain due the obvious mismatch of the ACS URLs, relied exclusively on network tunneling. | |
| - in v2 we rolled out automated user provisioning using SCIM authorization flows | |
| - Post-Acquisition: Consolidated IdPs under Auth0, migrated SAML customers, established cross-domain federated identity and managed session across the products | |
| - Scaled to 150+ Customers – Cut support tickets by 90% and sped up integrations from 1 hour to 10 mins. | |
| - Multi tenant & Microservice: | |
| - Architected multi-tenant microservices using Django, PostgreSQL schema isolation, and Node API Gateway, implementing tenant-aware routing via subdomains. | |
| - Designed schema-per-tenant architecture leveraging PostgreSQL's CREATE SCHEMA and Django-Tenants, developing middleware for automatic search_path switching. | |
| - Implemented event-driven communication using Apache Kafka with tenant ID headers, enabling asynchronous processing while maintaining schema isolation | |
| - Optimized database performance with PgBouncer connection pooling & schema-aware Django ORM extensions, achieving 2ms schema switches and 30% faster tenant-specific queries | |
| - The node API Gateway was responsible for Service Discovery, Tenant Routing, Schema Injection, Authentication, Upload capabilities, Elastic Search Chores. | |
| - Table Saw: | |
| Tools that dumps the referentially intact minimal subset of a postgres database with custom query selection & PII masking. | |
| - Long-running production queries (debugging, reports, load-testing) required full DB restores or read-only replicas, slowing workflows. | |
| - Built an open-source tool to extract minimal, referentially intact subsets of Postgres data instead of full dumps. | |
| - Used topological sorting to auto-include all parent/child records via FK relationships from any seed row. | |
| - This Enabled accelerated targeted data related debugging (e.g., single customer workspace) without multi-TB restores. | |
| - Handledumps up to ~50GB before hitting VM memory limits. | |
| - Email Server: | |
| Linux-based email server using Postfix (MTA) and Dovecot (IMAP/POP3) with TLS encryption for secure SMTP relay. | |
| - Configured NodeMailer for programmatic email sending, integrating OAuth2 and SMTP authentication. | |
| - Implemented SPF, DKIM, DMARC, and Reverse DNS (PTR) to ensure inbox placement (reduced spam rate from ~50% to <5%). | |
| - Monitored sender reputation using Google Postmaster Tools and MXToolbox to maintain high deliverability. | |
| - Developed a queue-based scheduling system using Redis/BullMQ to delay emails and send them at predefined times. | |
| - Engineered an email retraction feature (for unread emails) via IMAP IDLE tracking and custom API hooks. | |
| - Optimized Postfix with rate limiting, connection pooling, and failover SMTP relays (AWS SES backup). | |
| - Resume Parser & Ranking: | |
| - Built parallel parsers using Apache Tika OCR microservice for scanned PDFs (92% text recovery) | |
| - Extracted key fields (skills, experience, education) via rule-based matching and NER (spaCy/Stanford NLP). | |
| - Created TF-IDF & Word2Vec embeddings for semantic similarity between resumes and job descriptions. | |
| - Added handcrafted features (years of experience, skill overlap, education tier) for ML modeling. | |
| - Experimented with Logistic Regression, Random Forests, and XGBoost (Bayesian hyperparameter tuning) to rank resumes by JD fit. | |
| - Fine-tuned weights for tech vs. non-tech roles (e.g., heavier skill weighting for engineering jobs). | |
| - Incorporated hard filters (e.g., "Must have: Python") to auto-reject mismatches. | |
| - Achieved ~84% precision in top-5 shortlisting via cross-validation (human-annotated dataset). | |
| - Addressed sparse data challege via synthetic oversampling of niche roles. | |
| - Reduced bias by anonymizing resumes (removing names/gender cues) during ranking. | |
| - Served predictions via a Flask API with caching (Redis) for batch processing. | |
| - Designed Kafka topics with 3-partition architecture for load balancing | |
| - Implemented idempotent consumers for resume/JD processing (exactly-once semantics) | |
| - Scaled to 50 docs/minute using Kafka Connect S3 sink | |
| - Integrated with AWS S3 for resume storage and Airflow for scheduled JD updates. | |
| - Performance Tuning: | |
| - Stack Upgrade: | |
| - Migrated from an 8-year-old Ruby/Rails monolith using iterative, zero-downtime strategies | |
| - Replaced Unicorn with Puma (thread-safe scaling) & Transitioned from Asset Pipeline to Webpacker | |
| - Rolling tweaks for compatibilty of native extensions, deprecated methods & args, minimal code changes | |
| - Hardware / VM | |
| - Upgraded to NVMe SSDs for disk I/O bound workloads -> Reduced CPU wait states -> CPU utilization dropped from 85% to <70% sustained | |
| - Enabled huge pages (2MB) for memory-intensive apps -> Improved TLB hit rate -> Disk IOPS increased from 15k to 22k (random read/write) | |
| - Switched to ARM-based Graviton3 instances -> Better price-performance -> Cost per 1000 requests reduced by 35% | |
| - Infrastructure (AWS) | |
| - Implemented spot instances for batch processing -> Cost savings -> EC2 costs decreased by 60% for non-critical workloads | |
| - Right-sized RDS to r6gd.2xlarge -> Balanced memory/CPU -> Query throughput increased from 1.2k to 2.1k QPS | |
| - Configured VPC flow logs -> Identified network bottlenecks -> Cross-AZ traffic reduced by 40% | |
| - Docker | |
| - Multi-stage builds -> Smaller images -> Image size reduced from 1.8GB to 450MB | |
| - Set CPU limits (4 cores) -> Prevented noisy neighbors -> Container throttling events dropped from 12/hr to 0 | |
| - Switched to distroless base images -> Reduced attack surface -> CVE vulnerabilities decreased by 90% | |
| - Language (Ruby) | |
| - Enabled YJIT -> Faster execution -> Median request latency improved from 48ms to 29ms | |
| - Tuned GC (RUBY_GC_HEAP_GROWTH_MAX_SLOTS=300k) -> Fewer GC pauses -> GC time per request reduced from 8ms to 3ms | |
| - Adopted jemalloc -> Less fragmentation -> RSS memory stabilized at 1.2GB (was fluctuating 1-2GB) | |
| - Database (PostgreSQL) | |
| - Added partial indexes (WHERE status='active') -> Faster queries -> SELECT latency (p95) dropped from 120ms to 45ms | |
| - Tuned autovacuum (autovacuum_vacuum_scale_factor=0.1) -> Fewer dead tuples -> Vacuum runs decreased from 20/day to 5/day | |
| - Enabled parallel queries (max_parallel_workers=8) -> Improved analytics -> COUNT(*) runtime reduced from 12s to 3.2s | |
| - Server (Puma) | |
| - Adjusted workers:threads (4:8 -> 2:16) -> Better throughput -> Requests/sec increased from 850 to 1,100 | |
| - Enabled socket activation -> Zero-downtime restarts -> Deployment downtime reduced from 8s to 0s | |
| - Set worker timeout (worker_timeout=30) -> Killed hung workers -> 5xx errors decreased by 75% | |
| - Framework (Rails) | |
| - Russian doll caching -> Fewer DB hits -> Cache hit rate improved from 65% to 92% | |
| - Optimized ActiveRecord (pluck vs. select) -> Less memory -> Allocations/request dropped from 45k to 12k objects | |
| - Enabled bootsnap -> Faster boots -> Application startup reduced from 12s to 4s | |
| - Background Jobs (Sidekiq) | |
| - Weighted queues (critical=5, default=1) -> Priority handling -> Critical job latency (p95) improved from 8s to 1.2s | |
| - Set job expiration (30m) -> Redis memory control -> Redis memory usage stabilized at 800MB (was spiking to 2GB) | |
| - Added idempotency keys -> Fewer duplicates -> Duplicate jobs dropped from 5% to 0.1% | |
| - GraphQL | |
| - Persisted queries -> Smaller payloads -> Network throughput reduced by 40% | |
| - Dataloader batching -> Eliminated N+1 -> Resolver calls/query decreased from 32 to 5 | |
| - Query complexity limits (max_depth=10) -> Blocked expensive queries -> Timeout errors dropped by 90% | |
| - Frontend (Angular) | |
| - AOT compilation -> Faster rendering -> First Contentful Paint improved from 2.1s to 1.3s | |
| - Lazy-loaded modules -> Smaller bundles -> Main.js size reduced from 1.4MB to 580KB | |
| - OnPush change detection -> Less CPU usage -> Animation jank decreased from 12% to 2% frames dropped | |
| - Build & Deployment | |
| - Parallelized RSpec (--jobs 8) -> Faster CI -> Test suite runtime reduced from 18m to 6m | |
| - Cached node_modules -> Less rebuilds -> Docker build time decreased from 5m to 90s | |
| - Canary deployments (5% traffic) -> Safer releases -> Rollback rate dropped from 8% to 1% | |
| - Scalability | |
| - Read replicas -> Offloaded primary DB -> Primary DB CPU reduced from 80% to 45% | |
| - HPA (CPU=70%) -> Auto-scaling pods -> Peak traffic capacity increased from 1k to 5k RPS | |
| - Database sharding (by region) -> Reduced contention -> Write latency (p99) improved from 250ms to 90ms | |
| - Observability | |
| - Distributed tracing -> Faster debugging -> MTTR (Mean Time to Repair) reduced from 47m to 12m | |
| - SLO-based alerts -> Fewer false positives -> Alert volume decreased by 70% | |
| - Log sampling (10%) -> Cost control -> CloudWatch costs reduced by $1,200/month | |
| NonTech Features: | |
| - PR Review | |
| - Hackathon | |
| - BiWeekly Tech Talks | |
| - Conferences | |
| Drives: | |
| - Better Service Design | |
| - Sidekiq Pro Adoption | |
| - NewRelic Adoption | |
| - EC2 to ECS | |
| - DB to S3 using AWS Glue |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment