NoSQL (Not Only SQL) refers to a broad class of database management systems that differ from traditional relational databases (RDBMS). NoSQL databases are designed to handle:
- Large volumes of data: Capable of handling big data applications
- High velocity data: Real-time data processing
- Variety of data: Structured, semi-structured, and unstructured data
- Horizontal scalability: Scale out across multiple servers
Key Characteristics:
- Schema-less or flexible schema design
- Distributed architecture
- High availability and fault tolerance
- Eventual consistency (in most cases)
- Optimized for specific data models and access patterns
Use Cases for NoSQL:
- High Write Throughput: Applications requiring massive write operations (e.g., logging, IoT sensors)
- Flexible Schema: Applications with evolving data structures
- Horizontal Scalability: Need to scale across multiple servers
- Low Latency: Real-time applications requiring fast read/write operations
- Large Data Volumes: Big data applications with petabytes of data
- Geographically Distributed: Data needs to be replicated across regions
When NOT to use NoSQL:
- Complex transactions with ACID guarantees required
- Complex joins across multiple entities
- Strong consistency is critical
- Well-defined, stable schema
- Limited data growth expectations
The CAP theorem states that a distributed system can only guarantee two out of three properties simultaneously:
- Consistency (C): All nodes see the same data at the same time
- Availability (A): Every request receives a response (success or failure)
- Partition Tolerance (P): System continues to operate despite network partitions
Trade-offs:
- CA Systems: Sacrifice partition tolerance (traditional RDBMS in single-node setup)
- CP Systems: Sacrifice availability (MongoDB, HBase, Redis)
- AP Systems: Sacrifice consistency (Cassandra, CouchDB, DynamoDB)
MongoDB's Position: MongoDB is considered a CP system, prioritizing consistency and partition tolerance. However, it offers tunable consistency through read/write concerns.
ACID (Traditional RDBMS):
- Atomicity: All or nothing transactions
- Consistency: Database remains in valid state
- Isolation: Concurrent transactions don't interfere
- Durability: Committed data is permanent
BASE (NoSQL):
- Basically Available: System appears available most of the time
- Soft State: State may change over time without input
- Eventually Consistent: System will become consistent over time
MongoDB supports both models:
- Single document operations are ACID compliant
- Multi-document transactions (4.0+) provide ACID guarantees
- Tunable consistency through read/write concerns
Store data in document format (JSON, BSON, XML).
Examples: MongoDB, CouchDB, RavenDB
Characteristics:
- Documents contain key-value pairs
- Documents can have nested structures
- Flexible schema within collection
- Rich query capabilities
Use Cases:
- Content management systems
- E-commerce product catalogs
- User profiles and preferences
- Mobile applications
Simplest NoSQL databases storing data as key-value pairs.
Examples: Redis, DynamoDB, Riak
Characteristics:
- Fast read/write operations
- Simple data model
- Limited query capabilities
- Highly scalable
Use Cases:
- Session management
- Caching layer
- Shopping carts
- Real-time recommendations
Store data in column families rather than rows.
Examples: Cassandra, HBase, ScyllaDB
Characteristics:
- Optimized for write-heavy workloads
- Efficient data compression
- Sparse data handling
- Linear scalability
Use Cases:
- Time-series data
- Event logging
- IoT sensor data
- Financial transactions
Designed to store and navigate relationships.
Examples: Neo4j, Amazon Neptune, ArangoDB
Characteristics:
- Nodes, edges, and properties
- Relationship traversal
- Pattern matching
- ACID transactions
Use Cases:
- Social networks
- Recommendation engines
- Fraud detection
- Knowledge graphs
MongoDB is a document-oriented NoSQL database that stores data in flexible, JSON-like documents called BSON (Binary JSON).
Key Features:
- Document-oriented storage
- Full index support
- Replication and high availability
- Horizontal scalability through sharding
- Rich query language
- Aggregation framework
- GridFS for large files
- Multi-document ACID transactions (4.0+)
Versions:
- MongoDB 4.0+: Multi-document transactions
- MongoDB 4.2+: Distributed transactions, field-level encryption
- MongoDB 5.0+: Time series collections, versioned API
- MongoDB 6.0+: Queryable encryption
- MongoDB 7.0+: Enhanced query performance
Components:
-
mongod: The primary daemon process for MongoDB server
- Handles data requests
- Manages data access
- Performs background operations
-
mongos: Query router for sharded clusters
- Routes operations to appropriate shards
- Merges results from shards
-
mongo/mongosh: MongoDB shell for database interaction
- Command-line interface
- JavaScript environment
Storage Engine:
- WiredTiger (default since 3.2): Document-level concurrency control, compression
- In-Memory: For predictable latency
- Encrypted: For data-at-rest encryption
Hierarchy:
Database → Collections → Documents → Fields
Database:
- Container for collections
- Each database has separate files
- Multiple databases can exist on a single server
Collection:
- Group of MongoDB documents
- Equivalent to RDBMS table
- No fixed schema
- Documents in collection can have different fields
Document:
- Basic unit of data
- JSON-like structure (stored as BSON)
- Can contain nested documents and arrays
- Maximum size: 16MB
Example Document:
{
"_id": ObjectId("507f1f77bcf86cd799439011"),
"name": "John Doe",
"email": "john@example.com",
"age": 30,
"address": {
"street": "123 Main St",
"city": "New York",
"zipCode": "10001"
},
"hobbies": ["reading", "gaming", "coding"],
"createdAt": ISODate("2024-01-15T10:30:00Z")
}BSON (Binary JSON):
- Binary-encoded serialization of JSON-like documents
- Extends JSON with additional data types
- Efficient for storage and traversal
- Supports embedded documents and arrays
Additional Data Types:
ObjectId: Unique identifier (12 bytes)Date: 64-bit integer (milliseconds since Unix epoch)Binary Data: For storing binary dataRegular Expression: For pattern matchingDecimal128: High-precision decimal numbersInt32,Int64: Integer typesMinKey,MaxKey: Comparison purposes
ObjectId Structure:
- 4 bytes: Timestamp
- 5 bytes: Random value
- 3 bytes: Incrementing counter
Insert Single Document:
db.users.insertOne({
name: "Alice",
email: "alice@example.com",
age: 25
})Insert Multiple Documents:
db.users.insertMany([
{ name: "Bob", email: "bob@example.com", age: 30 },
{ name: "Charlie", email: "charlie@example.com", age: 35 }
])Options:
ordered: If false, continues on error (default: true)writeConcern: Acknowledgment level
Find One Document:
db.users.findOne({ email: "alice@example.com" })Find Multiple Documents:
db.users.find({ age: { $gte: 25 } })Query Operators:
Comparison:
$eq: Equal to$ne: Not equal to$gt: Greater than$gte: Greater than or equal$lt: Less than$lte: Less than or equal$in: Matches any value in array$nin: Matches none of the values in array
Logical:
$and: Joins clauses with logical AND$or: Joins clauses with logical OR$not: Inverts effect of query$nor: Joins clauses with logical NOR
Element:
$exists: Matches documents with field$type: Matches documents with field type
Array:
$all: Matches arrays containing all elements$elemMatch: Matches documents with array element matching criteria$size: Matches arrays with specific length
Projection:
db.users.find(
{ age: { $gte: 25 } },
{ name: 1, email: 1, _id: 0 } // Include name and email, exclude _id
)Cursor Methods:
db.users.find()
.sort({ age: -1 }) // Sort descending by age
.limit(10) // Limit to 10 documents
.skip(20) // Skip first 20 documentsUpdate One Document:
db.users.updateOne(
{ email: "alice@example.com" },
{ $set: { age: 26 } }
)Update Multiple Documents:
db.users.updateMany(
{ age: { $lt: 18 } },
{ $set: { minor: true } }
)Replace Document:
db.users.replaceOne(
{ email: "alice@example.com" },
{ name: "Alice Smith", email: "alice@example.com", age: 26 }
)Update Operators:
Field Update:
$set: Sets field value$unset: Removes field$rename: Renames field$inc: Increments field value$mul: Multiplies field value$min: Updates if less than current$max: Updates if greater than current$currentDate: Sets to current date
Array Update:
$push: Adds element to array$pop: Removes first or last element$pull: Removes elements matching condition$addToSet: Adds element if not exists$each: Modifies $push and $addToSet$position: Specifies position for $push$: Positional operator for array elements
Upsert:
db.users.updateOne(
{ email: "newuser@example.com" },
{ $set: { name: "New User", age: 25 } },
{ upsert: true } // Creates document if not found
)Delete One Document:
db.users.deleteOne({ email: "alice@example.com" })Delete Multiple Documents:
db.users.deleteMany({ age: { $lt: 18 } })Delete All Documents:
db.users.deleteMany({}) // Be careful!Drop Collection:
db.users.drop()1. Single Field Index:
db.users.createIndex({ email: 1 }) // Ascending
db.users.createIndex({ age: -1 }) // Descending2. Compound Index:
db.users.createIndex({ lastName: 1, firstName: 1 })3. Multikey Index:
// Automatically created for array fields
db.products.createIndex({ tags: 1 })4. Text Index:
db.articles.createIndex({ content: "text", title: "text" })
// Search
db.articles.find({ $text: { $search: "mongodb tutorial" } })5. Geospatial Index:
// 2dsphere for spherical geometry
db.locations.createIndex({ location: "2dsphere" })
// 2d for flat geometry
db.places.createIndex({ coordinates: "2d" })6. Hashed Index:
db.users.createIndex({ userId: "hashed" }) // For sharding7. Wildcard Index:
db.products.createIndex({ "attributes.$**": 1 }) // All fields under attributes8. TTL Index:
db.sessions.createIndex(
{ createdAt: 1 },
{ expireAfterSeconds: 3600 } // Expire after 1 hour
)Index Properties:
Unique:
db.users.createIndex({ email: 1 }, { unique: true })Sparse:
db.users.createIndex(
{ phoneNumber: 1 },
{ sparse: true } // Only documents with phoneNumber
)Partial:
db.orders.createIndex(
{ customerId: 1, orderDate: -1 },
{ partialFilterExpression: { status: "active" } }
)Covering Queries: Query that can be satisfied entirely using index without examining documents.
db.users.createIndex({ email: 1, name: 1 })
// This query is covered
db.users.find(
{ email: "alice@example.com" },
{ email: 1, name: 1, _id: 0 }
)Index Prefix: Compound index can support queries on prefixes.
db.users.createIndex({ lastName: 1, firstName: 1, age: 1 })
// Supported queries:
// - { lastName: ... }
// - { lastName: ..., firstName: ... }
// - { lastName: ..., firstName: ..., age: ... }
// NOT supported efficiently:
// - { firstName: ... }
// - { age: ... }Sort Order: Order matters for sort operations.
db.events.createIndex({ date: 1, priority: -1 })
// Efficient
db.events.find().sort({ date: 1, priority: -1 })
db.events.find().sort({ date: -1, priority: 1 })
// Inefficient (requires in-memory sort)
db.events.find().sort({ date: 1, priority: 1 })Guidelines:
- ESR Rule (Equality, Sort, Range):
- Equality conditions first
- Sort conditions second
- Range conditions last
// Query: Find active users older than 25, sorted by lastName
// Good index:
db.users.createIndex({ status: 1, lastName: 1, age: 1 })
// Query pattern:
db.users.find({ status: "active", age: { $gt: 25 } }).sort({ lastName: 1 })-
Selectivity: Create indexes on fields with high cardinality
-
Index Size: Keep indexes in RAM for best performance
-
Too Many Indexes: Each index impacts write performance
-
Monitor: Use explain() to analyze query performance
-
Drop Unused Indexes: Regular maintenance
The aggregation pipeline processes documents through a sequence of stages.
Basic Structure:
db.collection.aggregate([
{ $stage1: { ... } },
{ $stage2: { ... } },
{ $stage3: { ... } }
])$match: Filter documents
db.orders.aggregate([
{ $match: { status: "completed", total: { $gte: 100 } } }
])$project: Reshape documents
db.orders.aggregate([
{ $project: {
orderId: 1,
totalAmount: "$total",
year: { $year: "$orderDate" },
_id: 0
}}
])$group: Group by expression
db.orders.aggregate([
{ $group: {
_id: "$customerId",
totalOrders: { $sum: 1 },
totalAmount: { $sum: "$total" },
avgAmount: { $avg: "$total" }
}}
])$sort: Sort documents
db.orders.aggregate([
{ $sort: { orderDate: -1 } }
])$limit: Limit number of documents
db.orders.aggregate([
{ $limit: 10 }
])$skip: Skip documents
db.orders.aggregate([
{ $skip: 20 }
])$unwind: Deconstruct array field
db.orders.aggregate([
{ $unwind: "$items" }
])
// Input: { _id: 1, items: ["a", "b", "c"] }
// Output:
// { _id: 1, items: "a" }
// { _id: 1, items: "b" }
// { _id: 1, items: "c" }$lookup: Left outer join
db.orders.aggregate([
{ $lookup: {
from: "customers",
localField: "customerId",
foreignField: "_id",
as: "customerInfo"
}}
])$addFields: Add new fields
db.orders.aggregate([
{ $addFields: {
totalWithTax: { $multiply: ["$total", 1.08] }
}}
])$replaceRoot: Replace document root
db.orders.aggregate([
{ $replaceRoot: { newRoot: "$customer" } }
])$facet: Multiple pipelines
db.products.aggregate([
{ $facet: {
"categorizedByPrice": [
{ $bucket: {
groupBy: "$price",
boundaries: [0, 50, 100, 200],
default: "Other"
}}
],
"categorizedByTags": [
{ $unwind: "$tags" },
{ $sortByCount: "$tags" }
]
}}
])Arithmetic:
$add,$subtract,$multiply,$divide,$mod$abs,$ceil,$floor,$round$pow,$sqrt,$exp,$log
String:
$concat,$substr,$toLower,$toUpper$split,$trim,$ltrim,$rtrim$strcasecmp,$strLenCP
Array:
$size,$arrayElemAt,$slice,$filter$map,$reduce,$in,$concatArrays
Date:
$year,$month,$dayOfMonth,$hour,$minute$dateToString,$dateToParts
Conditional:
$cond,$ifNull,$switch
Accumulator (in $group):
$sum,$avg,$min,$max$first,$last,$push,$addToSet$stdDevPop,$stdDevSamp
Example Pipeline:
db.sales.aggregate([
// Stage 1: Filter sales from 2024
{ $match: {
orderDate: {
$gte: ISODate("2024-01-01"),
$lt: ISODate("2025-01-01")
}
}},
// Stage 2: Unwind items array
{ $unwind: "$items" },
// Stage 3: Group by product and calculate metrics
{ $group: {
_id: "$items.product",
totalQuantity: { $sum: "$items.quantity" },
totalRevenue: { $sum: { $multiply: ["$items.quantity", "$items.price"] } },
avgPrice: { $avg: "$items.price" },
orderCount: { $sum: 1 }
}},
// Stage 4: Sort by revenue descending
{ $sort: { totalRevenue: -1 } },
// Stage 5: Limit to top 10
{ $limit: 10 },
// Stage 6: Format output
{ $project: {
product: "$_id",
totalQuantity: 1,
totalRevenue: { $round: ["$totalRevenue", 2] },
avgPrice: { $round: ["$avgPrice", 2] },
orderCount: 1,
_id: 0
}}
])Optimization Tips:
- Place $match early: Reduce documents as soon as possible
- Place $project early: Reduce document size
- Use indexes: $match and $sort can use indexes if early in pipeline
- Avoid $lookup when possible: Can be expensive
- Use $limit: When you don't need all results
- allowDiskUse: For large datasets exceeding 100MB memory
db.collection.aggregate(pipeline, { allowDiskUse: true })A replica set is a group of MongoDB instances that maintain the same data set, providing redundancy and high availability.
Components:
- Primary: Receives all write operations
- Secondary: Replicates primary's oplog and applies operations
- Arbiter: Participates in elections but doesn't hold data
Minimum Configuration:
- 1 Primary + 2 Secondaries
- 1 Primary + 1 Secondary + 1 Arbiter
Benefits:
- High availability (automatic failover)
- Data redundancy
- Read scaling (read from secondaries)
- Disaster recovery
- Zero-downtime maintenance
Oplog (Operations Log):
- Special capped collection:
local.oplog.rs - Records all write operations
- Secondaries tail and apply operations
- Size configurable (default: 5% of disk space)
Initial Sync:
- Clone all databases from primary
- Apply operations from oplog during clone
- Build indexes
- Pull and apply remaining operations
Steady State Replication:
- Secondary fetches oplog entries from primary
- Applies operations in batches
- Records progress in
local.oplog.rs
Election Process: Triggered when:
- Primary becomes unreachable
- Primary steps down
- New member added with higher priority
- Automatic maintenance
Election Protocol:
- Members vote for a new primary
- Requires majority (n/2 + 1) to elect
- Member with highest priority eligible
- Most up-to-date oplog wins
Factors Affecting Elections:
- Priority: Higher priority more likely (0-1000, default: 1)
- Votes: Each member has 0 or 1 vote
- Oplog: More recent oplog data preferred
- Network latency: Lower latency preferred
Priority 0 Members:
// Cannot become primary
{
_id: 2,
host: "mongodb2.example.com:27017",
priority: 0
}Hidden Members:
// Hidden from application, priority 0, used for backups
{
_id: 3,
host: "mongodb3.example.com:27017",
hidden: true,
priority: 0
}Delayed Members:
// Maintains delayed copy, useful for recovery
{
_id: 4,
host: "mongodb4.example.com:27017",
priority: 0,
hidden: true,
slaveDelay: 3600 // 1 hour delay
}Controls where reads are directed in a replica set.
Modes:
-
primary (default): All reads from primary
- Strong consistency
- No stale reads
- Single point of read load
-
primaryPreferred: Primary if available, otherwise secondary
- Falls back on primary unavailability
-
secondary: All reads from secondary
- Distributes read load
- May return stale data
- Good for analytics
-
secondaryPreferred: Secondary if available, otherwise primary
- Reduces primary load
-
nearest: Reads from lowest latency member
- Best for geographically distributed apps
Tag Sets:
db.collection.find().readPref("secondary", [
{ "datacenter": "west" },
{ "datacenter": "east" }
])Write Concern:
db.collection.insertOne(
{ name: "Alice" },
{ writeConcern: { w: "majority", j: true, wtimeout: 5000 } }
)w: Number of members to acknowledge (1, 2, "majority")j: Wait for journal (true/false)wtimeout: Time limit in milliseconds
Read Concern:
db.collection.find().readConcern("majority")local: Default, returns most recent dataavailable: No guarantee on data durabilitymajority: Returns data acknowledged by majoritylinearizable: Strongest consistencysnapshot: For transactions
Sharding is MongoDB's approach to horizontal scaling by distributing data across multiple machines.
Components:
- Shard: Holds subset of data (replica set)
- Config Servers: Store cluster metadata (replica set)
- Query Router (mongos): Routes operations to shards
Benefits:
- Horizontal scaling
- Increased storage capacity
- Higher throughput
- Geographic distribution
When to Shard:
- Data exceeds single server capacity
- Working set exceeds RAM
- Write throughput exceeds single server
- Need geographic distribution
The shard key determines how data is distributed across shards.
Characteristics of Good Shard Key:
- High Cardinality: Many distinct values
- Even Distribution: Data spread evenly
- Query Isolation: Queries target single shard
- Monotonically Changing: Avoid hotspots
Shard Key Strategies:
1. Hashed Shard Key:
sh.shardCollection("mydb.users", { userId: "hashed" })- Pros: Even distribution, no hotspots
- Cons: Range queries scatter across all shards
2. Range-Based Shard Key:
sh.shardCollection("mydb.orders", { customerId: 1, orderDate: 1 })- Pros: Range queries efficient, geographically targetable
- Cons: Potential hotspots with poor key choice
3. Compound Shard Key:
sh.shardCollection("mydb.events", { userId: 1, timestamp: 1 })- Balances distribution and query performance
Anti-Patterns:
- Monotonically increasing values (timestamps, ObjectIds)
- Low cardinality fields (boolean, status)
- Frequently updated fields
Chunk Distribution:
- Data divided into chunks (default: 64MB)
- Chunks distributed across shards
- Splits occur when chunk exceeds size
- Balancer migrates chunks between shards
Zone Sharding: Associate tag with shard range for data locality.
// Add tags to shards
sh.addShardTag("shard0000", "US")
sh.addShardTag("shard0001", "EU")
// Associate ranges with tags
sh.addTagRange(
"mydb.users",
{ country: "US", userId: MinKey },
{ country: "US", userId: MaxKey },
"US"
)Balancer: Background process that distributes chunks evenly.
Balancing Window:
db.settings.update(
{ _id: "balancer" },
{ $set: { activeWindow: { start: "23:00", stop: "06:00" } } },
{ upsert: true }
)Disable Balancer:
sh.stopBalancer()
sh.setBalancerState(false)Check Balancer Status:
sh.getBalancerState()
sh.isBalancerRunning()Embedding (Denormalization): Store related data in single document.
{
_id: ObjectId("..."),
name: "John Doe",
addresses: [
{ street: "123 Main St", city: "NYC", type: "home" },
{ street: "456 Work Ave", city: "NYC", type: "work" }
]
}When to Embed:
- One-to-one relationships
- One-to-few relationships
- Data accessed together
- Limited data growth
- Strong data locality needed
Pros:
- Better read performance
- Single query retrieves all data
- Atomic updates
Cons:
- Document size limits (16MB)
- Data duplication
- Update complexity
Referencing (Normalization): Store related data in separate collections.
// Users collection
{
_id: ObjectId("507f1f77bcf86cd799439011"),
name: "John Doe"
}
// Addresses collection
{
_id: ObjectId("507f191e810c19729de860ea"),
userId: ObjectId("507f1f77bcf86cd799439011"),
street: "123 Main St",
city: "NYC"
}When to Reference:
- One-to-many relationships
- Many-to-many relationships
- Data accessed independently
- Large subdocuments
- Unbounded data growth
Pros:
- No duplication
- Smaller documents
- Flexibility
Cons:
- Multiple queries or $lookup needed
- No atomicity across documents
1. Attribute Pattern: For documents with many similar fields.
// Instead of:
{
productId: 123,
color_red: true,
color_blue: false,
size_small: true,
size_large: false
}
// Use:
{
productId: 123,
attributes: [
{ key: "color", value: "red" },
{ key: "size", value: "small" }
]
}
// Index: db.products.createIndex({ "attributes.key": 1, "attributes.value": 1 })2. Bucket Pattern: Group time-series data into buckets.
{
sensorId: "sensor1",
timestamp: ISODate("2024-01-01T00:00:00Z"),
measurements: [
{ time: 0, temp: 20.5 },
{ time: 60, temp: 20.7 },
{ time: 120, temp: 20.9 }
],
count: 3,
sum: 62.1,
min: 20.5,
max: 20.9
}3. Subset Pattern: Store frequently accessed subset of data.
{
_id: ObjectId("..."),
title: "Movie Title",
director: "Director Name",
recentReviews: [ // Last 10 reviews
{ user: "Alice", rating: 5, date: "..." },
{ user: "Bob", rating: 4, date: "..." }
],
reviewCount: 1523,
avgRating: 4.2
}
// Full reviews in separate collection4. Extended Reference Pattern: Store frequently accessed fields from referenced document.
{
_id: ObjectId("..."),
orderId: "ORD-001",
customerId: ObjectId("..."),
customerName: "John Doe", // Duplicated from customer
customerEmail: "john@example.com", // Duplicated
items: [...],
total: 250.00
}5. Outlier Pattern: Handle documents with disproportionate size.
// Normal document
{
_id: ObjectId("..."),
productId: "P001",
reviews: [
{ user: "Alice", rating: 5 },
{ user: "Bob", rating: 4 }
],
hasOverflow: false
}
// Outlier document with overflow
{
_id: ObjectId("..."),
productId: "P002",
reviews: [/* first 100 reviews */],
hasOverflow: true
}
// Overflow document
{
_id: ObjectId("..."),
productId: "P002",
reviewsBatch: 2,
reviews: [/* reviews 101-200 */]
}6. Schema Versioning Pattern: Handle schema evolution.
{
_id: ObjectId("..."),
schemaVersion: 2,
name: "Product",
// v2 fields
newField: "value"
}1. Massive Arrays: Unbounded array growth exceeds 16MB limit.
// Bad: Unbounded array
{
userId: 123,
posts: [/* thousands of posts */]
}
// Good: Reference or bucket pattern2. Massive Number of Collections: Thousands of collections impact performance.
3. Bloated Documents: Storing unnecessary data in documents.
4. Unnecessary Indexes: Every index impacts write performance.
5. Case-Insensitive Queries Without Index:
// Bad: Slow regex
db.users.find({ email: /^alice@/i })
// Good: Store lowercase version
{
email: "Alice@Example.com",
emailLower: "alice@example.com"
}
db.users.createIndex({ emailLower: 1 })6. Separating Data That's Accessed Together:
// Bad: Multiple queries
db.users.findOne({ _id: userId })
db.addresses.find({ userId: userId })
// Good: Embed if accessed together
{
_id: userId,
name: "...",
addresses: [...]
}Guidelines:
- Design for Access Patterns: Model based on how data is queried
- Embed for Atomicity: Embed related data needing atomic updates
- Reference for Flexibility: Reference when data accessed independently
- Consider Document Growth: Avoid unbounded arrays
- Optimize for Reads or Writes: Balance based on workload
- Denormalize Strategically: Duplicate frequently accessed data
- Use Indexes Wisely: Index fields used in queries
- Monitor Document Size: Stay well below 16MB limit
- Plan for Evolution: Design for schema changes
Decision Framework:
- Identify entities and relationships
- Determine access patterns
- Decide embedding vs referencing
- Consider data lifecycle
- Evaluate performance requirements
- Plan for scale
- Validate with prototypes
MongoDB supports ACID transactions at multiple levels:
Single Document:
- Atomic by default
- All or nothing modifications
- Isolated from other operations
Multi-Document Transactions (4.0+):
- ACID guarantees across multiple documents
- Across multiple collections
- Distributed transactions across shards (4.2+)
Basic Transaction:
const session = db.getMongo().startSession()
session.startTransaction()
try {
const accounts = session.getDatabase("bank").accounts
accounts.updateOne(
{ account: "A" },
{ $inc: { balance: -100 } },
{ session }
)
accounts.updateOne(
{ account: "B" },
{ $inc: { balance: 100 } },
{ session }
)
session.commitTransaction()
} catch (error) {
session.abortTransaction()
throw error
} finally {
session.endSession()
}With Retry Logic:
async function runTransactionWithRetry(txnFunc, session) {
while (true) {
try {
return await txnFunc(session)
} catch (error) {
if (error.hasErrorLabel("TransientTransactionError")) {
console.log("TransientTransactionError, retrying...")
continue
}
throw error
}
}
}
async function commitWithRetry(session) {
while (true) {
try {
await session.commitTransaction()
break
} catch (error) {
if (error.hasErrorLabel("UnknownTransactionCommitResult")) {
console.log("UnknownTransactionCommitResult, retrying...")
continue
}
throw error
}
}
}Transaction Options:
session.startTransaction({
readConcern: { level: "snapshot" },
writeConcern: { w: "majority" },
readPreference: "primary"
})Restrictions:
- Time Limit: Default 60 seconds (configurable)
- Oplog Size: Transaction size limited by oplog
- 16MB Document Limit: Still applies to individual documents
- DDL Operations: Cannot create/drop collections in transactions (5.0+)
- No Mixed Operations: Can't mix sharded and unsharded collections
Operations Not Allowed:
- Creating indexes
- Creating collections (before 5.0)
- Dropping collections (before 5.0)
- Listing collections
- Listing indexes
Best Practices:
- Keep Transactions Short: Minimize transaction duration
- Use Appropriate Concerns: Balance consistency and performance
- Handle Errors: Implement retry logic
- Consider Alternatives: Use single document atomicity when possible
- Monitor Performance: Track transaction metrics
- Limit Scope: Include only necessary operations
When to Use Transactions:
- Financial transactions
- Multi-document consistency required
- Complex business logic requiring atomicity
When NOT to Use Transactions:
- Single document operations (already atomic)
- Read-heavy workloads
- High-throughput scenarios
- Simple operations
Use Explain:
db.users.find({ age: { $gt: 25 } }).explain("executionStats")Execution Stages:
COLLSCAN: Collection scan (no index)IXSCAN: Index scanFETCH: Retrieve documentsSORT: In-memory sortLIMIT: Limit results
Key Metrics:
executionTimeMillis: Query execution timetotalDocsExamined: Documents scannedtotalKeysExamined: Index keys scannednReturned: Documents returned
Optimization Strategies:
- Use Covered Queries:
db.users.createIndex({ email: 1, name: 1 })
// Covered query
db.users.find(
{ email: "alice@example.com" },
{ email: 1, name: 1, _id: 0 }
)- Project Only Needed Fields:
// Bad
db.users.find({ age: { $gt: 25 } })
// Good
db.users.find(
{ age: { $gt: 25 } },
{ name: 1, email: 1 }
)- Use Limit with Sort:
db.users.find().sort({ age: -1 }).limit(10)- Avoid Negation:
// Bad: Cannot use index efficiently
db.users.find({ status: { $ne: "inactive" } })
// Good: Use positive conditions
db.users.find({ status: { $in: ["active", "pending"] } })- Use $in Instead of $or for Same Field:
// Bad
db.users.find({ $or: [
{ status: "active" },
{ status: "pending" }
]})
// Good
db.users.find({ status: { $in: ["active", "pending"] } })Enable Profiling:
// 0 = off, 1 = slow operations (> slowms), 2 = all operations
db.setProfilingLevel(1, { slowms: 100 })Query Profiler Output:
db.system.profile.find().sort({ ts: -1 }).limit(5).pretty()Disable Profiling:
db.setProfilingLevel(0)Analyze Slow Queries:
db.system.profile.find({
millis: { $gt: 100 }
}).sort({ millis: -1 })Database Statistics:
db.stats()
db.collection.stats()Server Status:
db.serverStatus()Current Operations:
db.currentOp()Kill Operation:
db.killOp(opId)Connection Statistics:
db.serverStatus().connectionsReplica Set Status:
rs.status()Sharding Status:
sh.status()Monitoring Tools:
- MongoDB Atlas (cloud monitoring)
- MongoDB Compass (GUI)
- MongoDB Ops Manager
- Prometheus + Grafana
- Third-party APM tools
WiredTiger Cache:
- Default: 50% of (RAM - 1GB) or 256MB
- Configurable via
--wiredTigerCacheSizeGB
Working Set:
- Frequently accessed data
- Should fit in RAM for optimal performance
Application-Level Caching:
- Redis/Memcached: Cache frequently accessed queries
- Read from Secondaries: Distribute read load
- Materialized Views: Pre-aggregate data
- Connection Pooling: Reuse connections
Query Result Caching:
// Application-level cache
const cache = new Map()
async function getUser(userId) {
const cacheKey = `user:${userId}`
if (cache.has(cacheKey)) {
return cache.get(cacheKey)
}
const user = await db.users.findOne({ _id: userId })
cache.set(cacheKey, user)
return user
}Cache Invalidation:
- Time-based expiration (TTL)
- Event-based invalidation
- Write-through cache
- Cache-aside pattern
Authentication Methods:
-
SCRAM (Salted Challenge Response Authentication Mechanism):
- Default authentication mechanism
- SCRAM-SHA-1, SCRAM-SHA-256
-
x.509 Certificate Authentication:
- Uses SSL/TLS certificates
-
LDAP (Enterprise):
- External authentication
-
Kerberos (Enterprise):
- External authentication
Create User:
db.createUser({
user: "appUser",
pwd: "securePassword",
roles: [
{ role: "readWrite", db: "mydb" },
{ role: "read", db: "reporting" }
]
})Enable Authentication:
# In mongod.conf
security:
authorization: enabledConnect with Authentication:
mongosh mongodb://username:password@localhost:27017/adminBuilt-in Roles:
Database User Roles:
read: Read data from all non-system collectionsreadWrite: Read and modify data
Database Admin Roles:
dbAdmin: Schema, indexing operationsdbOwner: Any action on databaseuserAdmin: Create and modify users
Cluster Admin Roles:
clusterAdmin: Greatest cluster admin accessclusterManager: Cluster managementclusterMonitor: Monitoring toolshostManager: Monitor and manage servers
Backup/Restore Roles:
backup: Backup datarestore: Restore data
All-Database Roles:
readAnyDatabasereadWriteAnyDatabaseuserAdminAnyDatabasedbAdminAnyDatabase
Superuser Roles:
root: Full access to all resources
Custom Roles:
db.createRole({
role: "customRole",
privileges: [
{
resource: { db: "mydb", collection: "users" },
actions: ["find", "insert", "update"]
}
],
roles: []
})Encryption at Rest:
-
Storage Engine Encryption (Enterprise):
- Encrypts data files
- Key management via KMIP
-
File System Encryption:
- OS-level encryption (e.g., LUKS, BitLocker)
-
Client-Side Field Level Encryption:
- Application encrypts sensitive fields
- Available in MongoDB 4.2+
Encryption in Transit:
Enable TLS/SSL:
# mongod.conf
net:
tls:
mode: requireTLS
certificateKeyFile: /path/to/mongodb.pem
CAFile: /path/to/ca.pemClient-Side Field Level Encryption:
const { ClientEncryption } = require('mongodb-client-encryption')
const encryption = new ClientEncryption(client, {
keyVaultNamespace: 'encryption.__keyVault',
kmsProviders: {
local: {
key: Buffer.from(masterKey, 'base64')
}
}
})
// Encrypt field
const encryptedField = await encryption.encrypt(
'sensitive data',
{
algorithm: 'AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic',
keyId: dataKeyId
}
)Enable Auditing (Enterprise):
# mongod.conf
auditLog:
destination: file
format: JSON
path: /var/log/mongodb/audit.json
filter: '{ atype: { $in: ["authenticate", "createUser", "dropUser"] } }'Audit Events:
- Authentication attempts
- User management operations
- Role management operations
- Database operations
- Collection operations
Filter Example:
// Audit failed authentication
{
atype: "authenticate",
"param.result": { $ne: 0 }
}Security Best Practices:
- Enable Authentication: Always require authentication
- Use Role-Based Access: Principle of least privilege
- Enable TLS/SSL: Encrypt data in transit
- Regular Updates: Keep MongoDB updated
- Network Security: Use firewalls, VPNs
- Audit Logging: Monitor security events
- Strong Passwords: Enforce password policies
- Backup Encryption: Encrypt backups
- Limit Network Exposure: Bind to specific IPs
- Monitor Access: Track user activity
Q: What is MongoDB and why use it?
A: MongoDB is a document-oriented NoSQL database that stores data in flexible, JSON-like documents. Key advantages include:
- Flexible schema for evolving applications
- Horizontal scalability through sharding
- Rich query language and aggregation framework
- High availability through replica sets
- Better performance for certain use cases
- Native support for hierarchical data structures
Q: Explain the difference between SQL and NoSQL databases.
A:
- Data Model: SQL uses structured tables with fixed schema; NoSQL offers flexible schemas (documents, key-value, graphs, columns)
- Scalability: SQL typically scales vertically; NoSQL scales horizontally
- ACID vs BASE: SQL guarantees ACID; NoSQL often uses BASE
- Relationships: SQL uses joins; NoSQL uses embedding or references
- Schema: SQL requires predefined schema; NoSQL is schema-less or flexible
- Use Cases: SQL for complex transactions; NoSQL for big data, real-time applications
Q: What is the CAP theorem?
A: CAP theorem states distributed systems can guarantee only two of three properties: Consistency (all nodes see same data), Availability (every request gets response), and Partition Tolerance (system works despite network failures). MongoDB is a CP system, prioritizing consistency and partition tolerance, though it offers tunable consistency.
Q: Explain replica sets in MongoDB.
A: Replica sets provide redundancy and high availability through data replication across multiple servers. They consist of a primary node (receives writes), secondary nodes (replicate data), and optionally arbiters (voting only). Automatic failover occurs if primary becomes unavailable, with secondaries electing a new primary. Benefits include data redundancy, disaster recovery, read scaling, and zero-downtime maintenance.
Q: What is sharding and when to use it?
A: Sharding is horizontal partitioning that distributes data across multiple servers (shards). Use sharding when: data exceeds single server capacity, working set exceeds RAM, write throughput is too high for one server, or geographic distribution is needed. Components include shards (hold data), config servers (metadata), and mongos routers (query routing).
Q: Explain indexing in MongoDB.
A: Indexes improve query performance by creating data structures that allow efficient lookups. MongoDB supports various index types: single field, compound, multikey (arrays), text, geospatial, hashed, and wildcard. Indexes speed up queries but impact write performance and consume storage. Best practices include indexing frequently queried fields, using compound indexes for multiple conditions, and following the ESR rule (Equality, Sort, Range).
Q: What is the aggregation framework?
A: The aggregation framework processes documents through a pipeline of stages to transform and analyze data. Common stages include $match (filter), $group (aggregate), $project (reshape), $sort, $lookup (join), and $unwind (deconstruct arrays). It's more powerful than simple queries for complex data analysis, reporting, and data transformation tasks.
Q: Explain embedding vs referencing.
A: Embedding stores related data within a single document, providing better read performance and atomic updates but limited by 16MB document size. Referencing stores related data in separate collections, offering flexibility and avoiding duplication but requiring multiple queries or $lookup. Choose embedding for one-to-few relationships accessed together; choose referencing for one-to-many or many-to-many relationships with independent access patterns.
Q: Does MongoDB support transactions?
A: Yes, MongoDB supports ACID transactions. Single document operations are atomic by default. Multi-document transactions (4.0+) provide ACID guarantees across multiple documents and collections, with distributed transaction support across shards (4.2+). Transactions should be kept short and used when atomicity across multiple documents is required, such as financial transactions or complex business logic.
Q: How does MongoDB ensure high availability?
A: MongoDB ensures high availability through replica sets with automatic failover. When a primary node fails, secondaries automatically elect a new primary within seconds. Write concerns and read preferences allow tuning consistency and availability trade-offs. Additional features include rolling upgrades, backup and restore capabilities, and monitoring tools for proactive issue detection.
Q: Write a query to find all users older than 25 and sort by name.
db.users.find({ age: { $gt: 25 } }).sort({ name: 1 })Q: How do you create a unique index on email field?
db.users.createIndex({ email: 1 }, { unique: true })Q: Write an aggregation pipeline to count users by country.
db.users.aggregate([
{ $group: {
_id: "$country",
count: { $sum: 1 }
}},
{ $sort: { count: -1 } }
])Q: How do you update multiple documents?
db.users.updateMany(
{ status: "inactive" },
{ $set: { archived: true } }
)Q: Write a query to find users with at least one hobby in a list.
db.users.find({ hobbies: { $in: ["reading", "gaming"] } })Q: How do you create a compound index?
db.orders.createIndex({ customerId: 1, orderDate: -1 })Q: Write an aggregation to calculate average order value per customer.
db.orders.aggregate([
{ $group: {
_id: "$customerId",
avgOrderValue: { $avg: "$total" },
orderCount: { $sum: 1 }
}},
{ $sort: { avgOrderValue: -1 } }
])Q: How do you perform a text search?
// Create text index
db.articles.createIndex({ title: "text", content: "text" })
// Search
db.articles.find({ $text: { $search: "mongodb tutorial" } })Q: Write a query using $lookup to join collections.
db.orders.aggregate([
{ $lookup: {
from: "customers",
localField: "customerId",
foreignField: "_id",
as: "customer"
}},
{ $unwind: "$customer" }
])Q: How do you implement pagination?
const page = 2
const pageSize = 10
db.users.find()
.sort({ createdAt: -1 })
.skip((page - 1) * pageSize)
.limit(pageSize)Q: How would you design a schema for a blog platform?
A: Design considerations:
- Users Collection: Store user profiles with embedded preferences
- Posts Collection: Store posts with embedded comments (if limited) or referenced comments (if unbounded)
- Comments Collection: Separate if posts have many comments
- Tags: Array field in posts for many-to-many relationship
- Indexes: On authorId, tags, publishDate, slug
// Users
{
_id: ObjectId,
username: String,
email: String,
profile: { bio, avatar }
}
// Posts
{
_id: ObjectId,
title: String,
content: String,
authorId: ObjectId,
tags: [String],
publishDate: Date,
comments: [ // Embed first 10
{ author, text, date }
],
commentCount: Number
}
// Comments (for overflow)
{
_id: ObjectId,
postId: ObjectId,
author: String,
text: String,
date: Date
}Q: Your application has slow queries. How do you diagnose and fix?
A: Diagnostic steps:
- Enable profiling:
db.setProfilingLevel(1, { slowms: 100 }) - Analyze slow queries:
db.system.profile.find() - Use explain(): Check execution plan
- Check indexes: Ensure proper indexes exist
- Monitor server metrics: CPU, memory, disk I/O
Solutions:
- Create appropriate indexes
- Use covered queries
- Limit returned fields with projection
- Implement caching
- Consider sharding for large datasets
- Optimize query patterns
- Use aggregation pipeline efficiently
Q: How would you migrate from SQL to MongoDB?
A: Migration approach:
- Analyze Schema: Identify entities and relationships
- Design MongoDB Schema: Decide embedding vs referencing
- Create Indexes: Based on query patterns
- Data Migration:
- ETL tools or custom scripts
- Incremental migration if possible
- Validate data integrity
- Application Changes: Update queries and transactions
- Testing: Comprehensive testing of all functionality
- Gradual Rollout: Parallel run if possible
- Monitor: Watch performance and errors
Q: How do you handle a growing dataset that exceeds single server capacity?
A: Scaling strategy:
- Vertical Scaling: Upgrade server resources (temporary solution)
- Indexing: Optimize queries to reduce working set
- Archive Old Data: Move historical data to separate collection
- Sharding: Horizontal scaling
- Choose appropriate shard key
- Plan shard key early (hard to change)
- Consider hashed vs range-based sharding
- Implement gradually
- Read Replicas: Distribute read load
- Caching: Reduce database load
- Data Lifecycle: Implement TTL indexes for ephemeral data
Q: Your replica set primary keeps stepping down. How do you troubleshoot?
A: Investigation steps:
- Check Logs: Look for election triggers
- Network Issues: Check connectivity between nodes
- Resource Constraints: CPU, memory, disk I/O
- Replica Lag: Check if secondaries are far behind
- Priority Settings: Verify member priorities
- Write Concerns: Check if too strict
- Oplog Size: Ensure adequate oplog
Solutions:
- Fix network issues or latency
- Increase server resources
- Adjust replica set configuration
- Optimize queries reducing load
- Check application write patterns
- Review monitoring and alerts
Q: How would you implement real-time analytics on operational data?
A: Architecture options:
-
Read from Secondaries:
- Configure secondary with delayed replication
- Direct analytics queries to secondary
- Reduces primary load
-
Change Streams:
- Real-time data streaming
- Trigger analytics on data changes
- Push to analytics platform
-
Separate Analytics Database:
- ETL pipeline from operational to analytics DB
- Optimized schema for analytics
- No impact on operational performance
-
Time Series Collections (5.0+):
- Optimized for time-series data
- Efficient storage and queries
-
Materialized Views:
- Pre-aggregated data
- Periodic updates via aggregation
- Fast query performance
Q: How do you ensure data consistency across microservices?
A: Strategies:
-
Saga Pattern:
- Sequence of local transactions
- Compensating transactions for rollback
- Event-driven coordination
-
Event Sourcing:
- Store events, not state
- Replay events to rebuild state
- MongoDB as event store
-
Two-Phase Commit (limited use):
- Application-level coordination
- Fallback mechanisms
-
Eventual Consistency:
- Accept temporary inconsistency
- Design for idempotency
- Conflict resolution strategies
-
Shared Database (anti-pattern):
- Single database for related services
- Tight coupling (avoid if possible)
MongoDB is a powerful NoSQL database offering flexibility, scalability, and rich features. Key concepts for interviews include:
- Fundamentals: Document model, BSON, replica sets, sharding
- Operations: CRUD, indexing, aggregation framework
- Design: Schema design patterns, embedding vs referencing
- Scalability: Horizontal scaling via sharding, replica sets for HA
- Performance: Query optimization, indexing strategies, caching
- Advanced: Transactions, security, monitoring
Interview Tips:
- Understand core concepts thoroughly
- Practice hands-on with MongoDB shell
- Know when to use MongoDB vs RDBMS
- Explain trade-offs in design decisions
- Demonstrate problem-solving approach
- Discuss real-world scenarios
- Stay updated on latest features
- Emphasize best practices
Resources for Further Learning:
- MongoDB University (free courses)
- Official MongoDB documentation
- MongoDB blog and community forums
- Practice on MongoDB Atlas (free tier)
- Build projects to apply concepts
Good luck with your interviews!