NoSQL and MongoDB Interview Preparation Guide

Introduction to NoSQL
Types of NoSQL Databases
MongoDB Fundamentals
MongoDB CRUD Operations
Indexing in MongoDB
Aggregation Framework
Replication
Sharding
Schema Design
Transactions
Performance Optimization
Security
Common Interview Questions

Introduction to NoSQL

What is NoSQL?

NoSQL (Not Only SQL) refers to a broad class of database management systems that differ from traditional relational databases (RDBMS). NoSQL databases are designed to handle:

Large volumes of data: Capable of handling big data applications
High velocity data: Real-time data processing
Variety of data: Structured, semi-structured, and unstructured data
Horizontal scalability: Scale out across multiple servers

Key Characteristics:

Schema-less or flexible schema design
Distributed architecture
High availability and fault tolerance
Eventual consistency (in most cases)
Optimized for specific data models and access patterns

Why NoSQL?

Use Cases for NoSQL:

High Write Throughput: Applications requiring massive write operations (e.g., logging, IoT sensors)
Flexible Schema: Applications with evolving data structures
Horizontal Scalability: Need to scale across multiple servers
Low Latency: Real-time applications requiring fast read/write operations
Large Data Volumes: Big data applications with petabytes of data
Geographically Distributed: Data needs to be replicated across regions

When NOT to use NoSQL:

Complex transactions with ACID guarantees required
Complex joins across multiple entities
Strong consistency is critical
Well-defined, stable schema
Limited data growth expectations

CAP Theorem

The CAP theorem states that a distributed system can only guarantee two out of three properties simultaneously:

Consistency (C): All nodes see the same data at the same time
Availability (A): Every request receives a response (success or failure)
Partition Tolerance (P): System continues to operate despite network partitions

Trade-offs:

CA Systems: Sacrifice partition tolerance (traditional RDBMS in single-node setup)
CP Systems: Sacrifice availability (MongoDB, HBase, Redis)
AP Systems: Sacrifice consistency (Cassandra, CouchDB, DynamoDB)

MongoDB's Position: MongoDB is considered a CP system, prioritizing consistency and partition tolerance. However, it offers tunable consistency through read/write concerns.

BASE vs ACID

ACID (Traditional RDBMS):

Atomicity: All or nothing transactions
Consistency: Database remains in valid state
Isolation: Concurrent transactions don't interfere
Durability: Committed data is permanent

BASE (NoSQL):

Basically Available: System appears available most of the time
Soft State: State may change over time without input
Eventually Consistent: System will become consistent over time

MongoDB supports both models:

Single document operations are ACID compliant
Multi-document transactions (4.0+) provide ACID guarantees
Tunable consistency through read/write concerns

Types of NoSQL Databases

Document Databases

Store data in document format (JSON, BSON, XML).

Examples: MongoDB, CouchDB, RavenDB

Characteristics:

Documents contain key-value pairs
Documents can have nested structures
Flexible schema within collection
Rich query capabilities

Use Cases:

Content management systems
E-commerce product catalogs
User profiles and preferences
Mobile applications

Key-Value Stores

Simplest NoSQL databases storing data as key-value pairs.

Examples: Redis, DynamoDB, Riak

Characteristics:

Fast read/write operations
Simple data model
Limited query capabilities
Highly scalable

Use Cases:

Session management
Caching layer
Shopping carts
Real-time recommendations

Column-Family Stores

Store data in column families rather than rows.

Examples: Cassandra, HBase, ScyllaDB

Characteristics:

Optimized for write-heavy workloads
Efficient data compression
Sparse data handling
Linear scalability

Use Cases:

Time-series data
Event logging
IoT sensor data
Financial transactions

Graph Databases

Designed to store and navigate relationships.

Examples: Neo4j, Amazon Neptune, ArangoDB

Characteristics:

Nodes, edges, and properties
Relationship traversal
Pattern matching
ACID transactions

Use Cases:

Social networks
Recommendation engines
Fraud detection
Knowledge graphs

MongoDB Fundamentals

What is MongoDB?

MongoDB is a document-oriented NoSQL database that stores data in flexible, JSON-like documents called BSON (Binary JSON).

Key Features:

Document-oriented storage
Full index support
Replication and high availability
Horizontal scalability through sharding
Rich query language
Aggregation framework
GridFS for large files
Multi-document ACID transactions (4.0+)

Versions:

MongoDB 4.0+: Multi-document transactions
MongoDB 4.2+: Distributed transactions, field-level encryption
MongoDB 5.0+: Time series collections, versioned API
MongoDB 6.0+: Queryable encryption
MongoDB 7.0+: Enhanced query performance

MongoDB Architecture

Components:

mongod: The primary daemon process for MongoDB server
- Handles data requests
- Manages data access
- Performs background operations
mongos: Query router for sharded clusters
- Routes operations to appropriate shards
- Merges results from shards
mongo/mongosh: MongoDB shell for database interaction
- Command-line interface
- JavaScript environment

Storage Engine:

WiredTiger (default since 3.2): Document-level concurrency control, compression
In-Memory: For predictable latency
Encrypted: For data-at-rest encryption

Data Model

Hierarchy:

Database → Collections → Documents → Fields

Database:

Container for collections
Each database has separate files
Multiple databases can exist on a single server

Collection:

Group of MongoDB documents
Equivalent to RDBMS table
No fixed schema
Documents in collection can have different fields

Document:

Basic unit of data
JSON-like structure (stored as BSON)
Can contain nested documents and arrays
Maximum size: 16MB

Example Document:

{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "John Doe",
  "email": "john@example.com",
  "age": 30,
  "address": {
    "street": "123 Main St",
    "city": "New York",
    "zipCode": "10001"
  },
  "hobbies": ["reading", "gaming", "coding"],
  "createdAt": ISODate("2024-01-15T10:30:00Z")
}

BSON Format

BSON (Binary JSON):

Binary-encoded serialization of JSON-like documents
Extends JSON with additional data types
Efficient for storage and traversal
Supports embedded documents and arrays

Additional Data Types:

ObjectId: Unique identifier (12 bytes)
Date: 64-bit integer (milliseconds since Unix epoch)
Binary Data: For storing binary data
Regular Expression: For pattern matching
Decimal128: High-precision decimal numbers
Int32, Int64: Integer types
MinKey, MaxKey: Comparison purposes

ObjectId Structure:

4 bytes: Timestamp
5 bytes: Random value
3 bytes: Incrementing counter

MongoDB CRUD Operations

Create Operations

Insert Single Document:

db.users.insertOne({
  name: "Alice",
  email: "alice@example.com",
  age: 25
})

Insert Multiple Documents:

db.users.insertMany([
  { name: "Bob", email: "bob@example.com", age: 30 },
  { name: "Charlie", email: "charlie@example.com", age: 35 }
])

Options:

ordered: If false, continues on error (default: true)
writeConcern: Acknowledgment level

Read Operations

Find One Document:

db.users.findOne({ email: "alice@example.com" })

Find Multiple Documents:

db.users.find({ age: { $gte: 25 } })

Query Operators:

Comparison:

$eq: Equal to
$ne: Not equal to
$gt: Greater than
$gte: Greater than or equal
$lt: Less than
$lte: Less than or equal
$in: Matches any value in array
$nin: Matches none of the values in array

Logical:

$and: Joins clauses with logical AND
$or: Joins clauses with logical OR
$not: Inverts effect of query
$nor: Joins clauses with logical NOR

Element:

$exists: Matches documents with field
$type: Matches documents with field type

Array:

$all: Matches arrays containing all elements
$elemMatch: Matches documents with array element matching criteria
$size: Matches arrays with specific length

Projection:

db.users.find(
  { age: { $gte: 25 } },
  { name: 1, email: 1, _id: 0 }  // Include name and email, exclude _id
)

Cursor Methods:

db.users.find()
  .sort({ age: -1 })     // Sort descending by age
  .limit(10)             // Limit to 10 documents
  .skip(20)              // Skip first 20 documents

Update Operations

Update One Document:

db.users.updateOne(
  { email: "alice@example.com" },
  { $set: { age: 26 } }
)

Update Multiple Documents:

db.users.updateMany(
  { age: { $lt: 18 } },
  { $set: { minor: true } }
)

Replace Document:

db.users.replaceOne(
  { email: "alice@example.com" },
  { name: "Alice Smith", email: "alice@example.com", age: 26 }
)

Update Operators:

Field Update:

$set: Sets field value
$unset: Removes field
$rename: Renames field
$inc: Increments field value
$mul: Multiplies field value
$min: Updates if less than current
$max: Updates if greater than current
$currentDate: Sets to current date

Array Update:

$push: Adds element to array
$pop: Removes first or last element
$pull: Removes elements matching condition
$addToSet: Adds element if not exists
$each: Modifies $push and $addToSet
$position: Specifies position for $push
$: Positional operator for array elements

Upsert:

db.users.updateOne(
  { email: "newuser@example.com" },
  { $set: { name: "New User", age: 25 } },
  { upsert: true }  // Creates document if not found
)

Delete Operations

Delete One Document:

db.users.deleteOne({ email: "alice@example.com" })

Delete Multiple Documents:

db.users.deleteMany({ age: { $lt: 18 } })

Delete All Documents:

db.users.deleteMany({})  // Be careful!

Drop Collection:

db.users.drop()

Indexing in MongoDB

Types of Indexes

1. Single Field Index:

db.users.createIndex({ email: 1 })  // Ascending
db.users.createIndex({ age: -1 })   // Descending

2. Compound Index:

db.users.createIndex({ lastName: 1, firstName: 1 })

3. Multikey Index:

// Automatically created for array fields
db.products.createIndex({ tags: 1 })

4. Text Index:

db.articles.createIndex({ content: "text", title: "text" })

// Search
db.articles.find({ $text: { $search: "mongodb tutorial" } })

5. Geospatial Index:

// 2dsphere for spherical geometry
db.locations.createIndex({ location: "2dsphere" })

// 2d for flat geometry
db.places.createIndex({ coordinates: "2d" })

6. Hashed Index:

db.users.createIndex({ userId: "hashed" })  // For sharding

7. Wildcard Index:

db.products.createIndex({ "attributes.$**": 1 })  // All fields under attributes

8. TTL Index:

db.sessions.createIndex(
  { createdAt: 1 },
  { expireAfterSeconds: 3600 }  // Expire after 1 hour
)

Index Performance

Index Properties:

Unique:

db.users.createIndex({ email: 1 }, { unique: true })

Sparse:

db.users.createIndex(
  { phoneNumber: 1 },
  { sparse: true }  // Only documents with phoneNumber
)

Partial:

db.orders.createIndex(
  { customerId: 1, orderDate: -1 },
  { partialFilterExpression: { status: "active" } }
)

Covering Queries: Query that can be satisfied entirely using index without examining documents.

db.users.createIndex({ email: 1, name: 1 })

// This query is covered
db.users.find(
  { email: "alice@example.com" },
  { email: 1, name: 1, _id: 0 }
)

Compound Indexes

Index Prefix: Compound index can support queries on prefixes.

db.users.createIndex({ lastName: 1, firstName: 1, age: 1 })

// Supported queries:
// - { lastName: ... }
// - { lastName: ..., firstName: ... }
// - { lastName: ..., firstName: ..., age: ... }

// NOT supported efficiently:
// - { firstName: ... }
// - { age: ... }

Sort Order: Order matters for sort operations.

db.events.createIndex({ date: 1, priority: -1 })

// Efficient
db.events.find().sort({ date: 1, priority: -1 })
db.events.find().sort({ date: -1, priority: 1 })

// Inefficient (requires in-memory sort)
db.events.find().sort({ date: 1, priority: 1 })

Index Best Practices

Guidelines:

ESR Rule (Equality, Sort, Range):
- Equality conditions first
- Sort conditions second
- Range conditions last

// Query: Find active users older than 25, sorted by lastName
// Good index:
db.users.createIndex({ status: 1, lastName: 1, age: 1 })

// Query pattern:
db.users.find({ status: "active", age: { $gt: 25 } }).sort({ lastName: 1 })

Selectivity: Create indexes on fields with high cardinality
Index Size: Keep indexes in RAM for best performance
Too Many Indexes: Each index impacts write performance
Monitor: Use explain() to analyze query performance
Drop Unused Indexes: Regular maintenance

Aggregation Framework

Aggregation Pipeline

The aggregation pipeline processes documents through a sequence of stages.

Basic Structure:

db.collection.aggregate([
  { $stage1: { ... } },
  { $stage2: { ... } },
  { $stage3: { ... } }
])

Common Pipeline Stages

$match: Filter documents

db.orders.aggregate([
  { $match: { status: "completed", total: { $gte: 100 } } }
])

$project: Reshape documents

db.orders.aggregate([
  { $project: {
    orderId: 1,
    totalAmount: "$total",
    year: { $year: "$orderDate" },
    _id: 0
  }}
])

$group: Group by expression

db.orders.aggregate([
  { $group: {
    _id: "$customerId",
    totalOrders: { $sum: 1 },
    totalAmount: { $sum: "$total" },
    avgAmount: { $avg: "$total" }
  }}
])

$sort: Sort documents

db.orders.aggregate([
  { $sort: { orderDate: -1 } }
])

$limit: Limit number of documents

db.orders.aggregate([
  { $limit: 10 }
])

$skip: Skip documents

db.orders.aggregate([
  { $skip: 20 }
])

$unwind: Deconstruct array field

db.orders.aggregate([
  { $unwind: "$items" }
])

// Input: { _id: 1, items: ["a", "b", "c"] }
// Output: 
// { _id: 1, items: "a" }
// { _id: 1, items: "b" }
// { _id: 1, items: "c" }

$lookup: Left outer join

db.orders.aggregate([
  { $lookup: {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customerInfo"
  }}
])

$addFields: Add new fields

db.orders.aggregate([
  { $addFields: {
    totalWithTax: { $multiply: ["$total", 1.08] }
  }}
])

$replaceRoot: Replace document root

db.orders.aggregate([
  { $replaceRoot: { newRoot: "$customer" } }
])

$facet: Multiple pipelines

db.products.aggregate([
  { $facet: {
    "categorizedByPrice": [
      { $bucket: {
        groupBy: "$price",
        boundaries: [0, 50, 100, 200],
        default: "Other"
      }}
    ],
    "categorizedByTags": [
      { $unwind: "$tags" },
      { $sortByCount: "$tags" }
    ]
  }}
])

Aggregation Operators

Arithmetic:

$add, $subtract, $multiply, $divide, $mod
$abs, $ceil, $floor, $round
$pow, $sqrt, $exp, $log

String:

$concat, $substr, $toLower, $toUpper
$split, $trim, $ltrim, $rtrim
$strcasecmp, $strLenCP

Array:

$size, $arrayElemAt, $slice, $filter
$map, $reduce, $in, $concatArrays

Date:

$year, $month, $dayOfMonth, $hour, $minute
$dateToString, $dateToParts

Conditional:

$cond, $ifNull, $switch

Accumulator (in $group):

$sum, $avg, $min, $max
$first, $last, $push, $addToSet
$stdDevPop, $stdDevSamp

Example Pipeline:

db.sales.aggregate([
  // Stage 1: Filter sales from 2024
  { $match: {
    orderDate: {
      $gte: ISODate("2024-01-01"),
      $lt: ISODate("2025-01-01")
    }
  }},
  
  // Stage 2: Unwind items array
  { $unwind: "$items" },
  
  // Stage 3: Group by product and calculate metrics
  { $group: {
    _id: "$items.product",
    totalQuantity: { $sum: "$items.quantity" },
    totalRevenue: { $sum: { $multiply: ["$items.quantity", "$items.price"] } },
    avgPrice: { $avg: "$items.price" },
    orderCount: { $sum: 1 }
  }},
  
  // Stage 4: Sort by revenue descending
  { $sort: { totalRevenue: -1 } },
  
  // Stage 5: Limit to top 10
  { $limit: 10 },
  
  // Stage 6: Format output
  { $project: {
    product: "$_id",
    totalQuantity: 1,
    totalRevenue: { $round: ["$totalRevenue", 2] },
    avgPrice: { $round: ["$avgPrice", 2] },
    orderCount: 1,
    _id: 0
  }}
])

Performance Considerations

Optimization Tips:

Place $match early: Reduce documents as soon as possible
Place $project early: Reduce document size
Use indexes: $match and $sort can use indexes if early in pipeline
Avoid $lookup when possible: Can be expensive
Use $limit: When you don't need all results
allowDiskUse: For large datasets exceeding 100MB memory

db.collection.aggregate(pipeline, { allowDiskUse: true })

Replication

Replica Sets

A replica set is a group of MongoDB instances that maintain the same data set, providing redundancy and high availability.

Components:

Primary: Receives all write operations
Secondary: Replicates primary's oplog and applies operations
Arbiter: Participates in elections but doesn't hold data

Minimum Configuration:

1 Primary + 2 Secondaries
1 Primary + 1 Secondary + 1 Arbiter

Benefits:

High availability (automatic failover)
Data redundancy
Read scaling (read from secondaries)
Disaster recovery
Zero-downtime maintenance

Replication Process

Oplog (Operations Log):

Special capped collection: local.oplog.rs
Records all write operations
Secondaries tail and apply operations
Size configurable (default: 5% of disk space)

Initial Sync:

Clone all databases from primary
Apply operations from oplog during clone
Build indexes
Pull and apply remaining operations

Steady State Replication:

Secondary fetches oplog entries from primary
Applies operations in batches
Records progress in local.oplog.rs

Elections and Failover

Election Process: Triggered when:

Primary becomes unreachable
Primary steps down
New member added with higher priority
Automatic maintenance

Election Protocol:

Members vote for a new primary
Requires majority (n/2 + 1) to elect
Member with highest priority eligible
Most up-to-date oplog wins

Factors Affecting Elections:

Priority: Higher priority more likely (0-1000, default: 1)
Votes: Each member has 0 or 1 vote
Oplog: More recent oplog data preferred
Network latency: Lower latency preferred

Priority 0 Members:

// Cannot become primary
{
  _id: 2,
  host: "mongodb2.example.com:27017",
  priority: 0
}

Hidden Members:

// Hidden from application, priority 0, used for backups
{
  _id: 3,
  host: "mongodb3.example.com:27017",
  hidden: true,
  priority: 0
}

Delayed Members:

// Maintains delayed copy, useful for recovery
{
  _id: 4,
  host: "mongodb4.example.com:27017",
  priority: 0,
  hidden: true,
  slaveDelay: 3600  // 1 hour delay
}

Read Preference

Controls where reads are directed in a replica set.

Modes:

primary (default): All reads from primary
- Strong consistency
- No stale reads
- Single point of read load
primaryPreferred: Primary if available, otherwise secondary
- Falls back on primary unavailability
secondary: All reads from secondary
- Distributes read load
- May return stale data
- Good for analytics
secondaryPreferred: Secondary if available, otherwise primary
- Reduces primary load
nearest: Reads from lowest latency member
- Best for geographically distributed apps

Tag Sets:

db.collection.find().readPref("secondary", [
  { "datacenter": "west" },
  { "datacenter": "east" }
])

Write Concern:

db.collection.insertOne(
  { name: "Alice" },
  { writeConcern: { w: "majority", j: true, wtimeout: 5000 } }
)

w: Number of members to acknowledge (1, 2, "majority")
j: Wait for journal (true/false)
wtimeout: Time limit in milliseconds

Read Concern:

db.collection.find().readConcern("majority")

local: Default, returns most recent data
available: No guarantee on data durability
majority: Returns data acknowledged by majority
linearizable: Strongest consistency
snapshot: For transactions

Sharding

What is Sharding?

Sharding is MongoDB's approach to horizontal scaling by distributing data across multiple machines.

Components:

Shard: Holds subset of data (replica set)
Config Servers: Store cluster metadata (replica set)
Query Router (mongos): Routes operations to shards

Benefits:

Horizontal scaling
Increased storage capacity
Higher throughput
Geographic distribution

When to Shard:

Data exceeds single server capacity
Working set exceeds RAM
Write throughput exceeds single server
Need geographic distribution

Shard Keys

The shard key determines how data is distributed across shards.

Characteristics of Good Shard Key:

High Cardinality: Many distinct values
Even Distribution: Data spread evenly
Query Isolation: Queries target single shard
Monotonically Changing: Avoid hotspots

Shard Key Strategies:

1. Hashed Shard Key:

sh.shardCollection("mydb.users", { userId: "hashed" })

Pros: Even distribution, no hotspots
Cons: Range queries scatter across all shards

2. Range-Based Shard Key:

sh.shardCollection("mydb.orders", { customerId: 1, orderDate: 1 })

Pros: Range queries efficient, geographically targetable
Cons: Potential hotspots with poor key choice

3. Compound Shard Key:

sh.shardCollection("mydb.events", { userId: 1, timestamp: 1 })

Balances distribution and query performance

Anti-Patterns:

Monotonically increasing values (timestamps, ObjectIds)
Low cardinality fields (boolean, status)
Frequently updated fields

Sharding Strategies

Chunk Distribution:

Data divided into chunks (default: 64MB)
Chunks distributed across shards
Splits occur when chunk exceeds size
Balancer migrates chunks between shards

Zone Sharding: Associate tag with shard range for data locality.

// Add tags to shards
sh.addShardTag("shard0000", "US")
sh.addShardTag("shard0001", "EU")

// Associate ranges with tags
sh.addTagRange(
  "mydb.users",
  { country: "US", userId: MinKey },
  { country: "US", userId: MaxKey },
  "US"
)

Balancing

Balancer: Background process that distributes chunks evenly.

Balancing Window:

db.settings.update(
  { _id: "balancer" },
  { $set: { activeWindow: { start: "23:00", stop: "06:00" } } },
  { upsert: true }
)

Disable Balancer:

sh.stopBalancer()
sh.setBalancerState(false)

Check Balancer Status:

sh.getBalancerState()
sh.isBalancerRunning()

Schema Design

Embedding vs Referencing

Embedding (Denormalization): Store related data in single document.

{
  _id: ObjectId("..."),
  name: "John Doe",
  addresses: [
    { street: "123 Main St", city: "NYC", type: "home" },
    { street: "456 Work Ave", city: "NYC", type: "work" }
  ]
}

When to Embed:

One-to-one relationships
One-to-few relationships
Data accessed together
Limited data growth
Strong data locality needed

Pros:

Better read performance
Single query retrieves all data
Atomic updates

Cons:

Document size limits (16MB)
Data duplication
Update complexity

Referencing (Normalization): Store related data in separate collections.

// Users collection
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  name: "John Doe"
}

// Addresses collection
{
  _id: ObjectId("507f191e810c19729de860ea"),
  userId: ObjectId("507f1f77bcf86cd799439011"),
  street: "123 Main St",
  city: "NYC"
}

When to Reference:

One-to-many relationships
Many-to-many relationships
Data accessed independently
Large subdocuments
Unbounded data growth

Pros:

No duplication
Smaller documents
Flexibility

Cons:

Multiple queries or $lookup needed
No atomicity across documents

Design Patterns

1. Attribute Pattern: For documents with many similar fields.

// Instead of:
{
  productId: 123,
  color_red: true,
  color_blue: false,
  size_small: true,
  size_large: false
}

// Use:
{
  productId: 123,
  attributes: [
    { key: "color", value: "red" },
    { key: "size", value: "small" }
  ]
}

// Index: db.products.createIndex({ "attributes.key": 1, "attributes.value": 1 })

2. Bucket Pattern: Group time-series data into buckets.

{
  sensorId: "sensor1",
  timestamp: ISODate("2024-01-01T00:00:00Z"),
  measurements: [
    { time: 0, temp: 20.5 },
    { time: 60, temp: 20.7 },
    { time: 120, temp: 20.9 }
  ],
  count: 3,
  sum: 62.1,
  min: 20.5,
  max: 20.9
}

3. Subset Pattern: Store frequently accessed subset of data.

{
  _id: ObjectId("..."),
  title: "Movie Title",
  director: "Director Name",
  recentReviews: [  // Last 10 reviews
    { user: "Alice", rating: 5, date: "..." },
    { user: "Bob", rating: 4, date: "..." }
  ],
  reviewCount: 1523,
  avgRating: 4.2
}

// Full reviews in separate collection

4. Extended Reference Pattern: Store frequently accessed fields from referenced document.

{
  _id: ObjectId("..."),
  orderId: "ORD-001",
  customerId: ObjectId("..."),
  customerName: "John Doe",  // Duplicated from customer
  customerEmail: "john@example.com",  // Duplicated
  items: [...],
  total: 250.00
}

5. Outlier Pattern: Handle documents with disproportionate size.

// Normal document
{
  _id: ObjectId("..."),
  productId: "P001",
  reviews: [
    { user: "Alice", rating: 5 },
    { user: "Bob", rating: 4 }
  ],
  hasOverflow: false
}

// Outlier document with overflow
{
  _id: ObjectId("..."),
  productId: "P002",
  reviews: [/* first 100 reviews */],
  hasOverflow: true
}

// Overflow document
{
  _id: ObjectId("..."),
  productId: "P002",
  reviewsBatch: 2,
  reviews: [/* reviews 101-200 */]
}

6. Schema Versioning Pattern: Handle schema evolution.

{
  _id: ObjectId("..."),
  schemaVersion: 2,
  name: "Product",
  // v2 fields
  newField: "value"
}

Anti-Patterns

1. Massive Arrays: Unbounded array growth exceeds 16MB limit.

// Bad: Unbounded array
{
  userId: 123,
  posts: [/* thousands of posts */]
}

// Good: Reference or bucket pattern

2. Massive Number of Collections: Thousands of collections impact performance.

3. Bloated Documents: Storing unnecessary data in documents.

4. Unnecessary Indexes: Every index impacts write performance.

5. Case-Insensitive Queries Without Index:

// Bad: Slow regex
db.users.find({ email: /^alice@/i })

// Good: Store lowercase version
{
  email: "Alice@Example.com",
  emailLower: "alice@example.com"
}
db.users.createIndex({ emailLower: 1 })

6. Separating Data That's Accessed Together:

// Bad: Multiple queries
db.users.findOne({ _id: userId })
db.addresses.find({ userId: userId })

// Good: Embed if accessed together
{
  _id: userId,
  name: "...",
  addresses: [...]
}

Data Modeling Best Practices

Guidelines:

Design for Access Patterns: Model based on how data is queried
Embed for Atomicity: Embed related data needing atomic updates
Reference for Flexibility: Reference when data accessed independently
Consider Document Growth: Avoid unbounded arrays
Optimize for Reads or Writes: Balance based on workload
Denormalize Strategically: Duplicate frequently accessed data
Use Indexes Wisely: Index fields used in queries
Monitor Document Size: Stay well below 16MB limit
Plan for Evolution: Design for schema changes

Decision Framework:

Identify entities and relationships
Determine access patterns
Decide embedding vs referencing
Consider data lifecycle
Evaluate performance requirements
Plan for scale
Validate with prototypes

Transactions

ACID Transactions in MongoDB

MongoDB supports ACID transactions at multiple levels:

Single Document:

Atomic by default
All or nothing modifications
Isolated from other operations

Multi-Document Transactions (4.0+):

ACID guarantees across multiple documents
Across multiple collections
Distributed transactions across shards (4.2+)

Multi-Document Transactions

Basic Transaction:

const session = db.getMongo().startSession()
session.startTransaction()

try {
  const accounts = session.getDatabase("bank").accounts
  
  accounts.updateOne(
    { account: "A" },
    { $inc: { balance: -100 } },
    { session }
  )
  
  accounts.updateOne(
    { account: "B" },
    { $inc: { balance: 100 } },
    { session }
  )
  
  session.commitTransaction()
} catch (error) {
  session.abortTransaction()
  throw error
} finally {
  session.endSession()
}

With Retry Logic:

async function runTransactionWithRetry(txnFunc, session) {
  while (true) {
    try {
      return await txnFunc(session)
    } catch (error) {
      if (error.hasErrorLabel("TransientTransactionError")) {
        console.log("TransientTransactionError, retrying...")
        continue
      }
      throw error
    }
  }
}

async function commitWithRetry(session) {
  while (true) {
    try {
      await session.commitTransaction()
      break
    } catch (error) {
      if (error.hasErrorLabel("UnknownTransactionCommitResult")) {
        console.log("UnknownTransactionCommitResult, retrying...")
        continue
      }
      throw error
    }
  }
}

Transaction Options:

session.startTransaction({
  readConcern: { level: "snapshot" },
  writeConcern: { w: "majority" },
  readPreference: "primary"
})

Transaction Limitations

Restrictions:

Time Limit: Default 60 seconds (configurable)
Oplog Size: Transaction size limited by oplog
16MB Document Limit: Still applies to individual documents
DDL Operations: Cannot create/drop collections in transactions (5.0+)
No Mixed Operations: Can't mix sharded and unsharded collections

Operations Not Allowed:

Creating indexes
Creating collections (before 5.0)
Dropping collections (before 5.0)
Listing collections
Listing indexes

Best Practices:

Keep Transactions Short: Minimize transaction duration
Use Appropriate Concerns: Balance consistency and performance
Handle Errors: Implement retry logic
Consider Alternatives: Use single document atomicity when possible
Monitor Performance: Track transaction metrics
Limit Scope: Include only necessary operations

When to Use Transactions:

Financial transactions
Multi-document consistency required
Complex business logic requiring atomicity

When NOT to Use Transactions:

Single document operations (already atomic)
Read-heavy workloads
High-throughput scenarios
Simple operations

Performance Optimization

Query Optimization

Use Explain:

db.users.find({ age: { $gt: 25 } }).explain("executionStats")

Execution Stages:

COLLSCAN: Collection scan (no index)
IXSCAN: Index scan
FETCH: Retrieve documents
SORT: In-memory sort
LIMIT: Limit results

Key Metrics:

executionTimeMillis: Query execution time
totalDocsExamined: Documents scanned
totalKeysExamined: Index keys scanned
nReturned: Documents returned

Optimization Strategies:

Use Covered Queries:

db.users.createIndex({ email: 1, name: 1 })

// Covered query
db.users.find(
  { email: "alice@example.com" },
  { email: 1, name: 1, _id: 0 }
)

Project Only Needed Fields:

// Bad
db.users.find({ age: { $gt: 25 } })

// Good
db.users.find(
  { age: { $gt: 25 } },
  { name: 1, email: 1 }
)

Use Limit with Sort:

db.users.find().sort({ age: -1 }).limit(10)

Avoid Negation:

// Bad: Cannot use index efficiently
db.users.find({ status: { $ne: "inactive" } })

// Good: Use positive conditions
db.users.find({ status: { $in: ["active", "pending"] } })

Use $in Instead of $or for Same Field:

// Bad
db.users.find({ $or: [
  { status: "active" },
  { status: "pending" }
]})

// Good
db.users.find({ status: { $in: ["active", "pending"] } })

Profiling

Enable Profiling:

// 0 = off, 1 = slow operations (> slowms), 2 = all operations
db.setProfilingLevel(1, { slowms: 100 })

Query Profiler Output:

db.system.profile.find().sort({ ts: -1 }).limit(5).pretty()

Disable Profiling:

db.setProfilingLevel(0)

Analyze Slow Queries:

db.system.profile.find({
  millis: { $gt: 100 }
}).sort({ millis: -1 })

Monitoring

Database Statistics:

db.stats()
db.collection.stats()

Server Status:

db.serverStatus()

Current Operations:

db.currentOp()

Kill Operation:

db.killOp(opId)

Connection Statistics:

db.serverStatus().connections

Replica Set Status:

rs.status()

Sharding Status:

sh.status()

Monitoring Tools:

MongoDB Atlas (cloud monitoring)
MongoDB Compass (GUI)
MongoDB Ops Manager
Prometheus + Grafana
Third-party APM tools

Caching Strategies

WiredTiger Cache:

Default: 50% of (RAM - 1GB) or 256MB
Configurable via --wiredTigerCacheSizeGB

Working Set:

Frequently accessed data
Should fit in RAM for optimal performance

Application-Level Caching:

Redis/Memcached: Cache frequently accessed queries
Read from Secondaries: Distribute read load
Materialized Views: Pre-aggregate data
Connection Pooling: Reuse connections

Query Result Caching:

// Application-level cache
const cache = new Map()

async function getUser(userId) {
  const cacheKey = `user:${userId}`
  
  if (cache.has(cacheKey)) {
    return cache.get(cacheKey)
  }
  
  const user = await db.users.findOne({ _id: userId })
  cache.set(cacheKey, user)
  
  return user
}

Cache Invalidation:

Time-based expiration (TTL)
Event-based invalidation
Write-through cache
Cache-aside pattern

Security

Authentication

Authentication Methods:

SCRAM (Salted Challenge Response Authentication Mechanism):
- Default authentication mechanism
- SCRAM-SHA-1, SCRAM-SHA-256
x.509 Certificate Authentication:
- Uses SSL/TLS certificates
LDAP (Enterprise):
- External authentication
Kerberos (Enterprise):
- External authentication

Create User:

db.createUser({
  user: "appUser",
  pwd: "securePassword",
  roles: [
    { role: "readWrite", db: "mydb" },
    { role: "read", db: "reporting" }
  ]
})

Enable Authentication:

# In mongod.conf
security:
  authorization: enabled

Connect with Authentication:

mongosh mongodb://username:password@localhost:27017/admin

Authorization

Built-in Roles:

Database User Roles:

read: Read data from all non-system collections
readWrite: Read and modify data

Database Admin Roles:

dbAdmin: Schema, indexing operations
dbOwner: Any action on database
userAdmin: Create and modify users

Cluster Admin Roles:

clusterAdmin: Greatest cluster admin access
clusterManager: Cluster management
clusterMonitor: Monitoring tools
hostManager: Monitor and manage servers

Backup/Restore Roles:

backup: Backup data
restore: Restore data

All-Database Roles:

readAnyDatabase
readWriteAnyDatabase
userAdminAnyDatabase
dbAdminAnyDatabase

Superuser Roles:

root: Full access to all resources

Custom Roles:

db.createRole({
  role: "customRole",
  privileges: [
    {
      resource: { db: "mydb", collection: "users" },
      actions: ["find", "insert", "update"]
    }
  ],
  roles: []
})

Encryption

Encryption at Rest:

Storage Engine Encryption (Enterprise):
- Encrypts data files
- Key management via KMIP
File System Encryption:
- OS-level encryption (e.g., LUKS, BitLocker)
Client-Side Field Level Encryption:
- Application encrypts sensitive fields
- Available in MongoDB 4.2+

Encryption in Transit:

Enable TLS/SSL:

# mongod.conf
net:
  tls:
    mode: requireTLS
    certificateKeyFile: /path/to/mongodb.pem
    CAFile: /path/to/ca.pem

Client-Side Field Level Encryption:

const { ClientEncryption } = require('mongodb-client-encryption')

const encryption = new ClientEncryption(client, {
  keyVaultNamespace: 'encryption.__keyVault',
  kmsProviders: {
    local: {
      key: Buffer.from(masterKey, 'base64')
    }
  }
})

// Encrypt field
const encryptedField = await encryption.encrypt(
  'sensitive data',
  {
    algorithm: 'AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic',
    keyId: dataKeyId
  }
)

Auditing

Enable Auditing (Enterprise):

# mongod.conf
auditLog:
  destination: file
  format: JSON
  path: /var/log/mongodb/audit.json
  filter: '{ atype: { $in: ["authenticate", "createUser", "dropUser"] } }'

Audit Events:

Authentication attempts
User management operations
Role management operations
Database operations
Collection operations

Filter Example:

// Audit failed authentication
{
  atype: "authenticate",
  "param.result": { $ne: 0 }
}

Security Best Practices:

Enable Authentication: Always require authentication
Use Role-Based Access: Principle of least privilege
Enable TLS/SSL: Encrypt data in transit
Regular Updates: Keep MongoDB updated
Network Security: Use firewalls, VPNs
Audit Logging: Monitor security events
Strong Passwords: Enforce password policies
Backup Encryption: Encrypt backups
Limit Network Exposure: Bind to specific IPs
Monitor Access: Track user activity

Common Interview Questions

Conceptual Questions

Q: What is MongoDB and why use it?

A: MongoDB is a document-oriented NoSQL database that stores data in flexible, JSON-like documents. Key advantages include:

Flexible schema for evolving applications
Horizontal scalability through sharding
Rich query language and aggregation framework
High availability through replica sets
Better performance for certain use cases
Native support for hierarchical data structures

Q: Explain the difference between SQL and NoSQL databases.

Data Model: SQL uses structured tables with fixed schema; NoSQL offers flexible schemas (documents, key-value, graphs, columns)
Scalability: SQL typically scales vertically; NoSQL scales horizontally
ACID vs BASE: SQL guarantees ACID; NoSQL often uses BASE
Relationships: SQL uses joins; NoSQL uses embedding or references
Schema: SQL requires predefined schema; NoSQL is schema-less or flexible
Use Cases: SQL for complex transactions; NoSQL for big data, real-time applications

Q: What is the CAP theorem?

A: CAP theorem states distributed systems can guarantee only two of three properties: Consistency (all nodes see same data), Availability (every request gets response), and Partition Tolerance (system works despite network failures). MongoDB is a CP system, prioritizing consistency and partition tolerance, though it offers tunable consistency.

Q: Explain replica sets in MongoDB.

A: Replica sets provide redundancy and high availability through data replication across multiple servers. They consist of a primary node (receives writes), secondary nodes (replicate data), and optionally arbiters (voting only). Automatic failover occurs if primary becomes unavailable, with secondaries electing a new primary. Benefits include data redundancy, disaster recovery, read scaling, and zero-downtime maintenance.

Q: What is sharding and when to use it?

A: Sharding is horizontal partitioning that distributes data across multiple servers (shards). Use sharding when: data exceeds single server capacity, working set exceeds RAM, write throughput is too high for one server, or geographic distribution is needed. Components include shards (hold data), config servers (metadata), and mongos routers (query routing).

Q: Explain indexing in MongoDB.

A: Indexes improve query performance by creating data structures that allow efficient lookups. MongoDB supports various index types: single field, compound, multikey (arrays), text, geospatial, hashed, and wildcard. Indexes speed up queries but impact write performance and consume storage. Best practices include indexing frequently queried fields, using compound indexes for multiple conditions, and following the ESR rule (Equality, Sort, Range).

Q: What is the aggregation framework?

A: The aggregation framework processes documents through a pipeline of stages to transform and analyze data. Common stages include $match (filter), $group (aggregate), $project (reshape), $sort, $lookup (join), and $unwind (deconstruct arrays). It's more powerful than simple queries for complex data analysis, reporting, and data transformation tasks.

Q: Explain embedding vs referencing.

A: Embedding stores related data within a single document, providing better read performance and atomic updates but limited by 16MB document size. Referencing stores related data in separate collections, offering flexibility and avoiding duplication but requiring multiple queries or $lookup. Choose embedding for one-to-few relationships accessed together; choose referencing for one-to-many or many-to-many relationships with independent access patterns.

Q: Does MongoDB support transactions?

A: Yes, MongoDB supports ACID transactions. Single document operations are atomic by default. Multi-document transactions (4.0+) provide ACID guarantees across multiple documents and collections, with distributed transaction support across shards (4.2+). Transactions should be kept short and used when atomicity across multiple documents is required, such as financial transactions or complex business logic.

Q: How does MongoDB ensure high availability?

A: MongoDB ensures high availability through replica sets with automatic failover. When a primary node fails, secondaries automatically elect a new primary within seconds. Write concerns and read preferences allow tuning consistency and availability trade-offs. Additional features include rolling upgrades, backup and restore capabilities, and monitoring tools for proactive issue detection.

Practical Questions

Q: Write a query to find all users older than 25 and sort by name.

db.users.find({ age: { $gt: 25 } }).sort({ name: 1 })

Q: How do you create a unique index on email field?

db.users.createIndex({ email: 1 }, { unique: true })

Q: Write an aggregation pipeline to count users by country.

db.users.aggregate([
  { $group: {
    _id: "$country",
    count: { $sum: 1 }
  }},
  { $sort: { count: -1 } }
])

Q: How do you update multiple documents?

db.users.updateMany(
  { status: "inactive" },
  { $set: { archived: true } }
)

Q: Write a query to find users with at least one hobby in a list.

db.users.find({ hobbies: { $in: ["reading", "gaming"] } })

Q: How do you create a compound index?

db.orders.createIndex({ customerId: 1, orderDate: -1 })

Q: Write an aggregation to calculate average order value per customer.

db.orders.aggregate([
  { $group: {
    _id: "$customerId",
    avgOrderValue: { $avg: "$total" },
    orderCount: { $sum: 1 }
  }},
  { $sort: { avgOrderValue: -1 } }
])

Q: How do you perform a text search?

// Create text index
db.articles.createIndex({ title: "text", content: "text" })

// Search
db.articles.find({ $text: { $search: "mongodb tutorial" } })

Q: Write a query using $lookup to join collections.

db.orders.aggregate([
  { $lookup: {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customer"
  }},
  { $unwind: "$customer" }
])

Q: How do you implement pagination?

const page = 2
const pageSize = 10

db.users.find()
  .sort({ createdAt: -1 })
  .skip((page - 1) * pageSize)
  .limit(pageSize)

Scenario-Based Questions

Q: How would you design a schema for a blog platform?

A: Design considerations:

Users Collection: Store user profiles with embedded preferences
Posts Collection: Store posts with embedded comments (if limited) or referenced comments (if unbounded)
Comments Collection: Separate if posts have many comments
Tags: Array field in posts for many-to-many relationship
Indexes: On authorId, tags, publishDate, slug

// Users
{
  _id: ObjectId,
  username: String,
  email: String,
  profile: { bio, avatar }
}

// Posts
{
  _id: ObjectId,
  title: String,
  content: String,
  authorId: ObjectId,
  tags: [String],
  publishDate: Date,
  comments: [  // Embed first 10
    { author, text, date }
  ],
  commentCount: Number
}

// Comments (for overflow)
{
  _id: ObjectId,
  postId: ObjectId,
  author: String,
  text: String,
  date: Date
}

Q: Your application has slow queries. How do you diagnose and fix?

A: Diagnostic steps:

Enable profiling: db.setProfilingLevel(1, { slowms: 100 })
Analyze slow queries: db.system.profile.find()
Use explain(): Check execution plan
Check indexes: Ensure proper indexes exist
Monitor server metrics: CPU, memory, disk I/O

Solutions:

Create appropriate indexes
Use covered queries
Limit returned fields with projection
Implement caching
Consider sharding for large datasets
Optimize query patterns
Use aggregation pipeline efficiently

Q: How would you migrate from SQL to MongoDB?

A: Migration approach:

Analyze Schema: Identify entities and relationships
Design MongoDB Schema: Decide embedding vs referencing
Create Indexes: Based on query patterns
Data Migration:
- ETL tools or custom scripts
- Incremental migration if possible
- Validate data integrity
Application Changes: Update queries and transactions
Testing: Comprehensive testing of all functionality
Gradual Rollout: Parallel run if possible
Monitor: Watch performance and errors

Q: How do you handle a growing dataset that exceeds single server capacity?

A: Scaling strategy:

Vertical Scaling: Upgrade server resources (temporary solution)
Indexing: Optimize queries to reduce working set
Archive Old Data: Move historical data to separate collection
Sharding: Horizontal scaling
- Choose appropriate shard key
- Plan shard key early (hard to change)
- Consider hashed vs range-based sharding
- Implement gradually
Read Replicas: Distribute read load
Caching: Reduce database load
Data Lifecycle: Implement TTL indexes for ephemeral data

Q: Your replica set primary keeps stepping down. How do you troubleshoot?

A: Investigation steps:

Check Logs: Look for election triggers
Network Issues: Check connectivity between nodes
Resource Constraints: CPU, memory, disk I/O
Replica Lag: Check if secondaries are far behind
Priority Settings: Verify member priorities
Write Concerns: Check if too strict
Oplog Size: Ensure adequate oplog

Solutions:

Fix network issues or latency
Increase server resources
Adjust replica set configuration
Optimize queries reducing load
Check application write patterns
Review monitoring and alerts

Q: How would you implement real-time analytics on operational data?

A: Architecture options:

Read from Secondaries:
- Configure secondary with delayed replication
- Direct analytics queries to secondary
- Reduces primary load
Change Streams:
- Real-time data streaming
- Trigger analytics on data changes
- Push to analytics platform
Separate Analytics Database:
- ETL pipeline from operational to analytics DB
- Optimized schema for analytics
- No impact on operational performance
Time Series Collections (5.0+):
- Optimized for time-series data
- Efficient storage and queries
Materialized Views:
- Pre-aggregated data
- Periodic updates via aggregation
- Fast query performance

Q: How do you ensure data consistency across microservices?

A: Strategies:

Saga Pattern:
- Sequence of local transactions
- Compensating transactions for rollback
- Event-driven coordination
Event Sourcing:
- Store events, not state
- Replay events to rebuild state
- MongoDB as event store
Two-Phase Commit (limited use):
- Application-level coordination
- Fallback mechanisms
Eventual Consistency:
- Accept temporary inconsistency
- Design for idempotency
- Conflict resolution strategies
Shared Database (anti-pattern):
- Single database for related services
- Tight coupling (avoid if possible)

Summary

MongoDB is a powerful NoSQL database offering flexibility, scalability, and rich features. Key concepts for interviews include:

Fundamentals: Document model, BSON, replica sets, sharding
Operations: CRUD, indexing, aggregation framework
Design: Schema design patterns, embedding vs referencing
Scalability: Horizontal scaling via sharding, replica sets for HA
Performance: Query optimization, indexing strategies, caching
Advanced: Transactions, security, monitoring

Interview Tips:

Understand core concepts thoroughly
Practice hands-on with MongoDB shell
Know when to use MongoDB vs RDBMS
Explain trade-offs in design decisions
Demonstrate problem-solving approach
Discuss real-world scenarios
Stay updated on latest features
Emphasize best practices

Resources for Further Learning:

MongoDB University (free courses)
Official MongoDB documentation
MongoDB blog and community forums
Practice on MongoDB Atlas (free tier)
Build projects to apply concepts

Good luck with your interviews!

carefree-ladka/NoSQL and MongoDB Interview Preparation Guide.mdx

NoSQL and MongoDB Interview Preparation Guide

Table of Contents

Introduction to NoSQL

What is NoSQL?

Why NoSQL?

CAP Theorem

BASE vs ACID

Types of NoSQL Databases

Document Databases

Key-Value Stores

Column-Family Stores

Graph Databases

MongoDB Fundamentals

What is MongoDB?

MongoDB Architecture

Data Model

BSON Format

MongoDB CRUD Operations

Create Operations

Read Operations

Update Operations

Delete Operations

Indexing in MongoDB

Types of Indexes

Index Performance

Compound Indexes

Index Best Practices

Aggregation Framework

Aggregation Pipeline

Common Pipeline Stages

Aggregation Operators

Performance Considerations

Replication

Replica Sets

Replication Process

Elections and Failover

Read Preference

Sharding

What is Sharding?

Shard Keys

Sharding Strategies

Balancing

Schema Design

Embedding vs Referencing

Design Patterns

Anti-Patterns

Data Modeling Best Practices

Transactions

ACID Transactions in MongoDB

Multi-Document Transactions

Transaction Limitations

Performance Optimization

Query Optimization

Profiling

Monitoring

Caching Strategies

Security

Authentication

Authorization

Encryption

Auditing

Common Interview Questions

Conceptual Questions

Practical Questions

Scenario-Based Questions

Summary