MongoDB Relations: Embedding vs Referencing (Scriptum)
This scriptum explains how to model common relationships in MongoDB—1:1, 1:n, and m:n—using either:
- Embedded documents (store related data inside the same document)
- References (store IDs and keep related data in separate collections)
It also shows how schema choices influence querying, performance, and when mixing approaches is a good idea.
0) Mental model: what “relations” mean in MongoDB
MongoDB is a document database. You can represent relations, but you choose how:
- Embedding = like “pre-joining” data into one document
- Referencing = like “foreign keys” (but MongoDB doesn’t enforce them automatically)
MongoDB’s key strength is reading one document quickly. Schema design often aims to answer:
“What does my application most often need to read together?”
1) Embedding vs Referencing: overview
Embedded documents
Idea: store related objects inside a parent document.
Pros
- Usually fast reads for “fetch parent with its children”
- Atomic updates within one document (single-document transactions are atomic)
- Fewer round-trips / fewer
$lookups
Cons
- Document can grow large (MongoDB document limit is 16MB)
- Updating duplicated embedded data can be harder (fan-out updates)
- Child data cannot be efficiently queried/updated independently in some patterns
References (IDs)
Idea: store related data in other collections and refer by _id.
Pros
- Avoid duplication; single source of truth
- Children can be queried independently and indexed independently
- Works well for large “child” sets (unbounded arrays)
Cons
- Reads can require multiple queries or aggregation with
$lookup - Harder to guarantee consistency without transactions / application logic
$lookupcan be expensive if not indexed well
2) Modeling 1:1 relationships
Option A: Embed (common when “always together”)
Example: users with an embedded profile.
// users
{
_id: ObjectId("..."),
username: "alina",
email: "alina@example.com",
profile: {
fullName: "Alina Berger",
birthday: ISODate("2008-03-14"),
address: { city: "Vienna", zip: "1010" }
}
}When it fits
- Profile is small
- You almost always load user + profile together
Query
db.users.find({ username: "alina" }, { username: 1, profile: 1 })Option B: Reference (when profile is optional or large or separately managed)
// users
{ _id: ObjectId("U1"), username: "alina", profileId: ObjectId("P1") }
// profiles
{ _id: ObjectId("P1"), fullName: "Alina Berger", address: { city: "Vienna" } }Query with multiple queries
const user = db.users.findOne({ username: "alina" });
const profile = db.profiles.findOne({ _id: user.profileId });Query with a single aggregation using $lookup
db.users.aggregate([
{ $match: { username: "alina" } },
{ $lookup: {
from: "profiles",
localField: "profileId",
foreignField: "_id",
as: "profile"
}},
{ $unwind: "$profile" }
])Performance note: $lookup is much faster when profiles._id is indexed (it is by default).
3) Modeling 1:n relationships
Typical case: “one parent has many children”.
Option A: Embed an array (great for bounded sets)
Example: blog post with a small number of tags.
// posts
{
_id: ObjectId("..."),
title: "MongoDB relations",
tags: ["mongodb", "schema", "nosql"]
}Query
db.posts.find({ tags: "mongodb" })Option B: Embed child documents (good for bounded “child” collections)
Example: order with line items.
// orders
{
_id: ObjectId("O1"),
customerId: ObjectId("C1"),
createdAt: ISODate("2026-03-01"),
items: [
{ productId: ObjectId("P1"), nameSnapshot: "SSD 1TB", priceSnapshot: 109, qty: 1 },
{ productId: ObjectId("P2"), nameSnapshot: "USB-C Cable", priceSnapshot: 9, qty: 2 }
]
}Why snapshots? This is intentional duplication:
- If the product name/price changes later, the order should keep the historic “what was bought”.
Query
db.orders.find({ customerId: ObjectId("C1") }, { items: 1, createdAt: 1 })When it fits
- Items are bounded (e.g. typical cart sizes)
- You often need order + items together
- You value atomicity of updating the order
Option C: Reference children in another collection (best for unbounded sets)
Example: post comments can be huge.
// posts
{ _id: ObjectId("Post1"), title: "..." }
// comments
{ _id: ObjectId("Cmt1"), postId: ObjectId("Post1"), text: "Nice!" }
{ _id: ObjectId("Cmt2"), postId: ObjectId("Post1"), text: "Question: ..." }Query
const post = db.posts.findOne({ _id: ObjectId("Post1") });
const comments = db.comments.find({ postId: post._id }).toArray();Single query with $lookup
db.posts.aggregate([
{ $match: { _id: ObjectId("Post1") } },
{ $lookup: {
from: "comments",
localField: "_id",
foreignField: "postId",
as: "comments"
}}
])Performance note: Index comments.postId to make this fast.
Option D: “Bucketing” / chunking (hybrid for large 1:n)
If children are huge, you can store them in buckets to reduce document growth issues:
// commentBuckets
{
_id: ObjectId("B1"),
postId: ObjectId("Post1"),
bucketNo: 0,
comments: [ { ... }, { ... }, ... ] // e.g. 200 comments per bucket
}4) Modeling m:n (many-to-many)
Example: students enroll in courses; courses have many students.
Option A: Arrays of references (simple, but can grow)
// students
{ _id: ObjectId("S1"), name: "Alina", courseIds: [ObjectId("C1"), ObjectId("C2")] }
// courses
{ _id: ObjectId("C1"), title: "Networks", studentIds: [ObjectId("S1"), ObjectId("S9")] }Pros
- Easy to read “student with courses” if embedded IDs are small
Cons
- Arrays can become large
- Updates must maintain both sides (risk of inconsistency)
- Queries like “who is enrolled?” can be heavy if arrays are huge
Query with $lookup
db.students.aggregate([
{ $match: { _id: ObjectId("S1") } },
{ $lookup: {
from: "courses",
localField: "courseIds",
foreignField: "_id",
as: "courses"
}}
])Option B: Join/bridge collection (most scalable)
Create an enrollments collection.
// students
{ _id: ObjectId("S1"), name: "Alina" }
// courses
{ _id: ObjectId("C1"), title: "Networks" }
// enrollments
{ _id: ObjectId("E1"), studentId: ObjectId("S1"), courseId: ObjectId("C1"), since: ISODate("2025-09-10") }Pros
- Scales to large sizes
- Easy to query and index in both directions
- You can store attributes on the relationship (grade, status, …)
Cons
- Requires joins (multiple queries or
$lookup)
Single query to get student + enrolled courses (via enrollments)
db.students.aggregate([
{ $match: { _id: ObjectId("S1") } },
{ $lookup: { from: "enrollments", localField: "_id", foreignField: "studentId", as: "enrollments" } },
{ $lookup: { from: "courses", localField: "enrollments.courseId", foreignField: "_id", as: "courses" } },
{ $project: { name: 1, courses: { title: 1, teacherId: 1 } } }
])Indexing tip
enrollments.studentIdenrollments.courseId
5) Querying across collections: multiple queries vs single query
A) Multiple queries (application-side join)
Pattern
- Query parent, extract IDs
- Query children by those IDs
const student = db.students.findOne({ _id: sid });
const enrolls = db.enrollments.find({ studentId: sid }).toArray();
const courseIds = enrolls.map(e => e.courseId);
const courses = db.courses.find({ _id: { $in: courseIds } }).toArray();Pros
- Sometimes simpler to understand
- You can cache intermediate results
Cons
- More round-trips (network latency)
- Harder to keep “consistent snapshot” unless you use transactions / read concerns
B) Single query with aggregation $lookup
Pattern: do the join inside MongoDB.
db.students.aggregate([
{ $match: { _id: sid } },
{ $lookup: { from: "enrollments", localField: "_id", foreignField: "studentId", as: "enrollments" } },
{ $lookup: { from: "courses", localField: "enrollments.courseId", foreignField: "_id", as: "courses" } }
])Pros
- One round-trip
- MongoDB does the joining work
- Can filter/project early in the pipeline
Cons
$lookupcan be expensive if not indexed or if it brings lots of data- Pipeline can become complex
6) How schema design affects performance
Reads
- Embedded: often fastest for “read everything together”
- Referenced: can be slower unless you query selectively or join efficiently
Writes / updates
- Embedded + duplication: may require updating many documents (fan-out)
- References: update once in the owning collection, but reads may need joins
Working set / memory
- Embedding lots of rarely used fields can make documents “heavy”, increasing I/O.
- Referencing can keep documents smaller so more fit in RAM.
Index usage
$lookupbecomes much cheaper when the joined field is indexed (e.g.enrollments.courseId).- For embedded arrays, indexes can be created on nested fields, e.g.
{"items.productId": 1}.
Big red flags
- Unbounded arrays inside one document (e.g. “all comments ever” in one post)
- Large documents when most queries only need a small subset of fields
7) Mixing approaches (very common in real systems)
Many real schemas use both:
- Reference for core entities (Users, Products, Courses)
- Embed small/bounded subdocuments (addresses, settings, line items)
- Store snapshots of important fields even if you reference the full entity
Example: enrollment references courseId, but also stores courseTitleSnapshot:
{ studentId, courseId, courseTitleSnapshot: "Networks 1", since: ... }This is intentional duplication to speed up common queries and preserve history.
8) When duplication is desirable
Embedding or storing snapshots duplicates data. That’s not always “bad”; it can be correct:
- History / auditing: orders keep product name/price at purchase time
- Performance: avoid joining to fetch a frequently used label
- Availability: your “read model” must work even if related data changes
Rule of thumb:
- Duplicate stable fields or historic snapshots
- Reference mutable fields you must keep consistent everywhere
9) Tooling you should practice
find(),distinct(),sort(),limit()- Indexing:
createIndex() explain("executionStats")to compare plans and performance- Aggregation pipeline:
$match,$project,$group,$sort,$limit$lookup,$unwind,$addFields- (optional)
$facetfor multi-result pipelines
10) Mini checklist for choosing embed vs reference
Embed when:
- Bounded size
- Read together often
- Need atomic update
Reference when:
- Child count grows without bound
- Need to query/update children independently
- Want a single source of truth
Mix when:
- You want reference for correctness, plus snapshots/embedded for speed/history