MongoDB Relations: Embedding vs Referencing (Scriptum)

This scriptum explains how to model common relationships in MongoDB—1:1, 1:n, and m:n—using either:

Embedded documents (store related data inside the same document)
References (store IDs and keep related data in separate collections)

It also shows how schema choices influence querying, performance, and when mixing approaches is a good idea.

0) Mental model: what “relations” mean in MongoDB

MongoDB is a document database. You can represent relations, but you choose how:

Embedding = like “pre-joining” data into one document
Referencing = like “foreign keys” (but MongoDB doesn’t enforce them automatically)

MongoDB’s key strength is reading one document quickly. Schema design often aims to answer:

“What does my application most often need to read together?”

1) Embedding vs Referencing: overview

Embedded documents

Idea: store related objects inside a parent document.

Pros

Usually fast reads for “fetch parent with its children”
Atomic updates within one document (single-document transactions are atomic)
Fewer round-trips / fewer $lookups

Cons

Document can grow large (MongoDB document limit is 16MB)
Updating duplicated embedded data can be harder (fan-out updates)
Child data cannot be efficiently queried/updated independently in some patterns

References (IDs)

Idea: store related data in other collections and refer by _id.

Pros

Avoid duplication; single source of truth
Children can be queried independently and indexed independently
Works well for large “child” sets (unbounded arrays)

Cons

Reads can require multiple queries or aggregation with $lookup
Harder to guarantee consistency without transactions / application logic
$lookup can be expensive if not indexed well

2) Modeling 1:1 relationships

Option A: Embed (common when “always together”)

Example: users with an embedded profile.

// users
{
  _id: ObjectId("..."),
  username: "alina",
  email: "alina@example.com",
  profile: {
    fullName: "Alina Berger",
    birthday: ISODate("2008-03-14"),
    address: { city: "Vienna", zip: "1010" }
  }
}

When it fits

Profile is small
You almost always load user + profile together

Query

db.users.find({ username: "alina" }, { username: 1, profile: 1 })

Option B: Reference (when profile is optional or large or separately managed)

// users
{ _id: ObjectId("U1"), username: "alina", profileId: ObjectId("P1") }
 
// profiles
{ _id: ObjectId("P1"), fullName: "Alina Berger", address: { city: "Vienna" } }

Query with multiple queries

const user = db.users.findOne({ username: "alina" });
const profile = db.profiles.findOne({ _id: user.profileId });

Query with a single aggregation using $lookup

db.users.aggregate([
  { $match: { username: "alina" } },
  { $lookup: {
      from: "profiles",
      localField: "profileId",
      foreignField: "_id",
      as: "profile"
  }},
  { $unwind: "$profile" }
])

Performance note: $lookup is much faster when profiles._id is indexed (it is by default).

3) Modeling 1:n relationships

Typical case: “one parent has many children”.

Option A: Embed an array (great for bounded sets)

Example: blog post with a small number of tags.

// posts
{
  _id: ObjectId("..."),
  title: "MongoDB relations",
  tags: ["mongodb", "schema", "nosql"]
}

Query

db.posts.find({ tags: "mongodb" })

Option B: Embed child documents (good for bounded “child” collections)

Example: order with line items.

// orders
{
  _id: ObjectId("O1"),
  customerId: ObjectId("C1"),
  createdAt: ISODate("2026-03-01"),
  items: [
    { productId: ObjectId("P1"), nameSnapshot: "SSD 1TB", priceSnapshot: 109, qty: 1 },
    { productId: ObjectId("P2"), nameSnapshot: "USB-C Cable", priceSnapshot: 9, qty: 2 }
  ]
}

Why snapshots? This is intentional duplication:

If the product name/price changes later, the order should keep the historic “what was bought”.

Query

db.orders.find({ customerId: ObjectId("C1") }, { items: 1, createdAt: 1 })

When it fits

Items are bounded (e.g. typical cart sizes)
You often need order + items together
You value atomicity of updating the order

Option C: Reference children in another collection (best for unbounded sets)

Example: post comments can be huge.

// posts
{ _id: ObjectId("Post1"), title: "..." }
 
// comments
{ _id: ObjectId("Cmt1"), postId: ObjectId("Post1"), text: "Nice!" }
{ _id: ObjectId("Cmt2"), postId: ObjectId("Post1"), text: "Question: ..." }

Query

const post = db.posts.findOne({ _id: ObjectId("Post1") });
const comments = db.comments.find({ postId: post._id }).toArray();

Single query with $lookup

db.posts.aggregate([
  { $match: { _id: ObjectId("Post1") } },
  { $lookup: {
      from: "comments",
      localField: "_id",
      foreignField: "postId",
      as: "comments"
  }}
])

Performance note: Index comments.postId to make this fast.

Option D: “Bucketing” / chunking (hybrid for large 1:n)

If children are huge, you can store them in buckets to reduce document growth issues:

// commentBuckets
{
  _id: ObjectId("B1"),
  postId: ObjectId("Post1"),
  bucketNo: 0,
  comments: [ { ... }, { ... }, ... ] // e.g. 200 comments per bucket
}

4) Modeling m:n (many-to-many)

Example: students enroll in courses; courses have many students.

Option A: Arrays of references (simple, but can grow)

// students
{ _id: ObjectId("S1"), name: "Alina", courseIds: [ObjectId("C1"), ObjectId("C2")] }
 
// courses
{ _id: ObjectId("C1"), title: "Networks", studentIds: [ObjectId("S1"), ObjectId("S9")] }

Pros

Easy to read “student with courses” if embedded IDs are small

Cons

Arrays can become large
Updates must maintain both sides (risk of inconsistency)
Queries like “who is enrolled?” can be heavy if arrays are huge

Query with $lookup

db.students.aggregate([
  { $match: { _id: ObjectId("S1") } },
  { $lookup: {
      from: "courses",
      localField: "courseIds",
      foreignField: "_id",
      as: "courses"
  }}
])

Option B: Join/bridge collection (most scalable)

Create an enrollments collection.

// students
{ _id: ObjectId("S1"), name: "Alina" }
 
// courses
{ _id: ObjectId("C1"), title: "Networks" }
 
// enrollments
{ _id: ObjectId("E1"), studentId: ObjectId("S1"), courseId: ObjectId("C1"), since: ISODate("2025-09-10") }

Pros

Scales to large sizes
Easy to query and index in both directions
You can store attributes on the relationship (grade, status, …)

Cons

Requires joins (multiple queries or $lookup)

Single query to get student + enrolled courses (via enrollments)

db.students.aggregate([
  { $match: { _id: ObjectId("S1") } },
  { $lookup: { from: "enrollments", localField: "_id", foreignField: "studentId", as: "enrollments" } },
  { $lookup: { from: "courses", localField: "enrollments.courseId", foreignField: "_id", as: "courses" } },
  { $project: { name: 1, courses: { title: 1, teacherId: 1 } } }
])

Indexing tip

enrollments.studentId
enrollments.courseId

5) Querying across collections: multiple queries vs single query

A) Multiple queries (application-side join)

Pattern

Query parent, extract IDs
Query children by those IDs

const student = db.students.findOne({ _id: sid });
const enrolls = db.enrollments.find({ studentId: sid }).toArray();
const courseIds = enrolls.map(e => e.courseId);
const courses = db.courses.find({ _id: { $in: courseIds } }).toArray();

Pros

Sometimes simpler to understand
You can cache intermediate results

Cons

More round-trips (network latency)
Harder to keep “consistent snapshot” unless you use transactions / read concerns

B) Single query with aggregation `$lookup`

Pattern: do the join inside MongoDB.

db.students.aggregate([
  { $match: { _id: sid } },
  { $lookup: { from: "enrollments", localField: "_id", foreignField: "studentId", as: "enrollments" } },
  { $lookup: { from: "courses", localField: "enrollments.courseId", foreignField: "_id", as: "courses" } }
])

Pros

One round-trip
MongoDB does the joining work
Can filter/project early in the pipeline

Cons

$lookup can be expensive if not indexed or if it brings lots of data
Pipeline can become complex

6) How schema design affects performance

Reads

Embedded: often fastest for “read everything together”
Referenced: can be slower unless you query selectively or join efficiently

Writes / updates

Embedded + duplication: may require updating many documents (fan-out)
References: update once in the owning collection, but reads may need joins

Working set / memory

Embedding lots of rarely used fields can make documents “heavy”, increasing I/O.
Referencing can keep documents smaller so more fit in RAM.

Index usage

$lookup becomes much cheaper when the joined field is indexed (e.g. enrollments.courseId).
For embedded arrays, indexes can be created on nested fields, e.g. {"items.productId": 1}.

Big red flags

Unbounded arrays inside one document (e.g. “all comments ever” in one post)
Large documents when most queries only need a small subset of fields

7) Mixing approaches (very common in real systems)

Many real schemas use both:

Reference for core entities (Users, Products, Courses)
Embed small/bounded subdocuments (addresses, settings, line items)
Store snapshots of important fields even if you reference the full entity

Example: enrollment references courseId, but also stores courseTitleSnapshot:

{ studentId, courseId, courseTitleSnapshot: "Networks 1", since: ... }

This is intentional duplication to speed up common queries and preserve history.

8) When duplication is desirable

Embedding or storing snapshots duplicates data. That’s not always “bad”; it can be correct:

History / auditing: orders keep product name/price at purchase time
Performance: avoid joining to fetch a frequently used label
Availability: your “read model” must work even if related data changes

Rule of thumb:

Duplicate stable fields or historic snapshots
Reference mutable fields you must keep consistent everywhere

9) Tooling you should practice

find(), distinct(), sort(), limit()
Indexing: createIndex()
explain("executionStats") to compare plans and performance
Aggregation pipeline:
- $match, $project, $group, $sort, $limit
- $lookup, $unwind, $addFields
- (optional) $facet for multi-result pipelines

10) Mini checklist for choosing embed vs reference

Embed when:

Bounded size
Read together often
Need atomic update

Reference when:

Child count grows without bound
Need to query/update children independently
Want a single source of truth

Mix when:

You want reference for correctness, plus snapshots/embedded for speed/history

Deep Thought

Explorer

MongoDb Relations

MongoDB Relations: Embedding vs Referencing (Scriptum)

0) Mental model: what “relations” mean in MongoDB

1) Embedding vs Referencing: overview

Embedded documents

References (IDs)

2) Modeling 1:1 relationships

Option A: Embed (common when “always together”)

Option B: Reference (when profile is optional or large or separately managed)

3) Modeling 1:n relationships

Option A: Embed an array (great for bounded sets)

Option B: Embed child documents (good for bounded “child” collections)

Option C: Reference children in another collection (best for unbounded sets)

Option D: “Bucketing” / chunking (hybrid for large 1:n)

4) Modeling m:n (many-to-many)

Option A: Arrays of references (simple, but can grow)

Option B: Join/bridge collection (most scalable)

5) Querying across collections: multiple queries vs single query

A) Multiple queries (application-side join)

B) Single query with aggregation `$lookup`

6) How schema design affects performance

Reads

Writes / updates

Working set / memory

Index usage

Big red flags

7) Mixing approaches (very common in real systems)

8) When duplication is desirable

9) Tooling you should practice

10) Mini checklist for choosing embed vs reference

Graph View

Table of Contents

Backlinks

Deep Thought

Explorer

MongoDb Relations

MongoDB Relations: Embedding vs Referencing (Scriptum)

0) Mental model: what “relations” mean in MongoDB

1) Embedding vs Referencing: overview

Embedded documents

References (IDs)

2) Modeling 1:1 relationships

Option A: Embed (common when “always together”)

Option B: Reference (when profile is optional or large or separately managed)

3) Modeling 1:n relationships

Option A: Embed an array (great for bounded sets)

Option B: Embed child documents (good for bounded “child” collections)

Option C: Reference children in another collection (best for unbounded sets)

Option D: “Bucketing” / chunking (hybrid for large 1:n)

4) Modeling m:n (many-to-many)

Option A: Arrays of references (simple, but can grow)

Option B: Join/bridge collection (most scalable)

5) Querying across collections: multiple queries vs single query

A) Multiple queries (application-side join)

B) Single query with aggregation $lookup

6) How schema design affects performance

Reads

Writes / updates

Working set / memory

Index usage

Big red flags

7) Mixing approaches (very common in real systems)

8) When duplication is desirable

9) Tooling you should practice

10) Mini checklist for choosing embed vs reference

Graph View

Table of Contents

Backlinks

B) Single query with aggregation `$lookup`