MongoDB Relations: Embedding vs Referencing (Scriptum)

This scriptum explains how to model common relationships in MongoDB—1:1, 1:n, and m:n—using either:

  • Embedded documents (store related data inside the same document)
  • References (store IDs and keep related data in separate collections)

It also shows how schema choices influence querying, performance, and when mixing approaches is a good idea.


0) Mental model: what “relations” mean in MongoDB

MongoDB is a document database. You can represent relations, but you choose how:

  • Embedding = like “pre-joining” data into one document
  • Referencing = like “foreign keys” (but MongoDB doesn’t enforce them automatically)

MongoDB’s key strength is reading one document quickly. Schema design often aims to answer:

“What does my application most often need to read together?”


1) Embedding vs Referencing: overview

Embedded documents

Idea: store related objects inside a parent document.

Pros

  • Usually fast reads for “fetch parent with its children”
  • Atomic updates within one document (single-document transactions are atomic)
  • Fewer round-trips / fewer $lookups

Cons

  • Document can grow large (MongoDB document limit is 16MB)
  • Updating duplicated embedded data can be harder (fan-out updates)
  • Child data cannot be efficiently queried/updated independently in some patterns

References (IDs)

Idea: store related data in other collections and refer by _id.

Pros

  • Avoid duplication; single source of truth
  • Children can be queried independently and indexed independently
  • Works well for large “child” sets (unbounded arrays)

Cons

  • Reads can require multiple queries or aggregation with $lookup
  • Harder to guarantee consistency without transactions / application logic
  • $lookup can be expensive if not indexed well

2) Modeling 1:1 relationships

Option A: Embed (common when “always together”)

Example: users with an embedded profile.

// users
{
  _id: ObjectId("..."),
  username: "alina",
  email: "alina@example.com",
  profile: {
    fullName: "Alina Berger",
    birthday: ISODate("2008-03-14"),
    address: { city: "Vienna", zip: "1010" }
  }
}

When it fits

  • Profile is small
  • You almost always load user + profile together

Query

db.users.find({ username: "alina" }, { username: 1, profile: 1 })

Option B: Reference (when profile is optional or large or separately managed)

// users
{ _id: ObjectId("U1"), username: "alina", profileId: ObjectId("P1") }
 
// profiles
{ _id: ObjectId("P1"), fullName: "Alina Berger", address: { city: "Vienna" } }

Query with multiple queries

const user = db.users.findOne({ username: "alina" });
const profile = db.profiles.findOne({ _id: user.profileId });

Query with a single aggregation using $lookup

db.users.aggregate([
  { $match: { username: "alina" } },
  { $lookup: {
      from: "profiles",
      localField: "profileId",
      foreignField: "_id",
      as: "profile"
  }},
  { $unwind: "$profile" }
])

Performance note: $lookup is much faster when profiles._id is indexed (it is by default).


3) Modeling 1:n relationships

Typical case: “one parent has many children”.

Option A: Embed an array (great for bounded sets)

Example: blog post with a small number of tags.

// posts
{
  _id: ObjectId("..."),
  title: "MongoDB relations",
  tags: ["mongodb", "schema", "nosql"]
}

Query

db.posts.find({ tags: "mongodb" })

Option B: Embed child documents (good for bounded “child” collections)

Example: order with line items.

// orders
{
  _id: ObjectId("O1"),
  customerId: ObjectId("C1"),
  createdAt: ISODate("2026-03-01"),
  items: [
    { productId: ObjectId("P1"), nameSnapshot: "SSD 1TB", priceSnapshot: 109, qty: 1 },
    { productId: ObjectId("P2"), nameSnapshot: "USB-C Cable", priceSnapshot: 9, qty: 2 }
  ]
}

Why snapshots? This is intentional duplication:

  • If the product name/price changes later, the order should keep the historic “what was bought”.

Query

db.orders.find({ customerId: ObjectId("C1") }, { items: 1, createdAt: 1 })

When it fits

  • Items are bounded (e.g. typical cart sizes)
  • You often need order + items together
  • You value atomicity of updating the order

Option C: Reference children in another collection (best for unbounded sets)

Example: post comments can be huge.

// posts
{ _id: ObjectId("Post1"), title: "..." }
 
// comments
{ _id: ObjectId("Cmt1"), postId: ObjectId("Post1"), text: "Nice!" }
{ _id: ObjectId("Cmt2"), postId: ObjectId("Post1"), text: "Question: ..." }

Query

const post = db.posts.findOne({ _id: ObjectId("Post1") });
const comments = db.comments.find({ postId: post._id }).toArray();

Single query with $lookup

db.posts.aggregate([
  { $match: { _id: ObjectId("Post1") } },
  { $lookup: {
      from: "comments",
      localField: "_id",
      foreignField: "postId",
      as: "comments"
  }}
])

Performance note: Index comments.postId to make this fast.


Option D: “Bucketing” / chunking (hybrid for large 1:n)

If children are huge, you can store them in buckets to reduce document growth issues:

// commentBuckets
{
  _id: ObjectId("B1"),
  postId: ObjectId("Post1"),
  bucketNo: 0,
  comments: [ { ... }, { ... }, ... ] // e.g. 200 comments per bucket
}

4) Modeling m:n (many-to-many)

Example: students enroll in courses; courses have many students.

Option A: Arrays of references (simple, but can grow)

// students
{ _id: ObjectId("S1"), name: "Alina", courseIds: [ObjectId("C1"), ObjectId("C2")] }
 
// courses
{ _id: ObjectId("C1"), title: "Networks", studentIds: [ObjectId("S1"), ObjectId("S9")] }

Pros

  • Easy to read “student with courses” if embedded IDs are small

Cons

  • Arrays can become large
  • Updates must maintain both sides (risk of inconsistency)
  • Queries like “who is enrolled?” can be heavy if arrays are huge

Query with $lookup

db.students.aggregate([
  { $match: { _id: ObjectId("S1") } },
  { $lookup: {
      from: "courses",
      localField: "courseIds",
      foreignField: "_id",
      as: "courses"
  }}
])

Option B: Join/bridge collection (most scalable)

Create an enrollments collection.

// students
{ _id: ObjectId("S1"), name: "Alina" }
 
// courses
{ _id: ObjectId("C1"), title: "Networks" }
 
// enrollments
{ _id: ObjectId("E1"), studentId: ObjectId("S1"), courseId: ObjectId("C1"), since: ISODate("2025-09-10") }

Pros

  • Scales to large sizes
  • Easy to query and index in both directions
  • You can store attributes on the relationship (grade, status, …)

Cons

  • Requires joins (multiple queries or $lookup)

Single query to get student + enrolled courses (via enrollments)

db.students.aggregate([
  { $match: { _id: ObjectId("S1") } },
  { $lookup: { from: "enrollments", localField: "_id", foreignField: "studentId", as: "enrollments" } },
  { $lookup: { from: "courses", localField: "enrollments.courseId", foreignField: "_id", as: "courses" } },
  { $project: { name: 1, courses: { title: 1, teacherId: 1 } } }
])

Indexing tip

  • enrollments.studentId
  • enrollments.courseId

5) Querying across collections: multiple queries vs single query

A) Multiple queries (application-side join)

Pattern

  1. Query parent, extract IDs
  2. Query children by those IDs
const student = db.students.findOne({ _id: sid });
const enrolls = db.enrollments.find({ studentId: sid }).toArray();
const courseIds = enrolls.map(e => e.courseId);
const courses = db.courses.find({ _id: { $in: courseIds } }).toArray();

Pros

  • Sometimes simpler to understand
  • You can cache intermediate results

Cons

  • More round-trips (network latency)
  • Harder to keep “consistent snapshot” unless you use transactions / read concerns

B) Single query with aggregation $lookup

Pattern: do the join inside MongoDB.

db.students.aggregate([
  { $match: { _id: sid } },
  { $lookup: { from: "enrollments", localField: "_id", foreignField: "studentId", as: "enrollments" } },
  { $lookup: { from: "courses", localField: "enrollments.courseId", foreignField: "_id", as: "courses" } }
])

Pros

  • One round-trip
  • MongoDB does the joining work
  • Can filter/project early in the pipeline

Cons

  • $lookup can be expensive if not indexed or if it brings lots of data
  • Pipeline can become complex

6) How schema design affects performance

Reads

  • Embedded: often fastest for “read everything together”
  • Referenced: can be slower unless you query selectively or join efficiently

Writes / updates

  • Embedded + duplication: may require updating many documents (fan-out)
  • References: update once in the owning collection, but reads may need joins

Working set / memory

  • Embedding lots of rarely used fields can make documents “heavy”, increasing I/O.
  • Referencing can keep documents smaller so more fit in RAM.

Index usage

  • $lookup becomes much cheaper when the joined field is indexed (e.g. enrollments.courseId).
  • For embedded arrays, indexes can be created on nested fields, e.g. {"items.productId": 1}.

Big red flags

  • Unbounded arrays inside one document (e.g. “all comments ever” in one post)
  • Large documents when most queries only need a small subset of fields

7) Mixing approaches (very common in real systems)

Many real schemas use both:

  • Reference for core entities (Users, Products, Courses)
  • Embed small/bounded subdocuments (addresses, settings, line items)
  • Store snapshots of important fields even if you reference the full entity

Example: enrollment references courseId, but also stores courseTitleSnapshot:

{ studentId, courseId, courseTitleSnapshot: "Networks 1", since: ... }

This is intentional duplication to speed up common queries and preserve history.


8) When duplication is desirable

Embedding or storing snapshots duplicates data. That’s not always “bad”; it can be correct:

  • History / auditing: orders keep product name/price at purchase time
  • Performance: avoid joining to fetch a frequently used label
  • Availability: your “read model” must work even if related data changes

Rule of thumb:

  • Duplicate stable fields or historic snapshots
  • Reference mutable fields you must keep consistent everywhere

9) Tooling you should practice

  • find(), distinct(), sort(), limit()
  • Indexing: createIndex()
  • explain("executionStats") to compare plans and performance
  • Aggregation pipeline:
    • $match, $project, $group, $sort, $limit
    • $lookup, $unwind, $addFields
    • (optional) $facet for multi-result pipelines

10) Mini checklist for choosing embed vs reference

Embed when:

  • Bounded size
  • Read together often
  • Need atomic update

Reference when:

  • Child count grows without bound
  • Need to query/update children independently
  • Want a single source of truth

Mix when:

  • You want reference for correctness, plus snapshots/embedded for speed/history