MongoDB Aggregation Pipeline: Data Transformations

The Query You Actually Want to Write

You need to group documents by date, filter out inactive users, compute an average per group, and sort by that average. One query. That is what the aggregation pipeline is for.

If your application pulls thousands of documents over the network just to filter and sum them in JavaScript, you are doing the database's job in your application code. The aggregation pipeline pushes that work to MongoDB itself. The performance difference is not 2x. It is often 50x or more.

This covers the three stages everyone uses ($match, $group, $sort), then the three that are actually interesting ($lookup, $unwind, $facet). Plus $setWindowFields because window functions are absurdly useful once you know they exist.

Pipeline Basics: How Stages Flow

Array of stage objects. Pass it to aggregate(). Output of stage one becomes input of stage two.

JavaScript (MongoDB Shell)

// A basic aggregation pipeline with two stages
db.orders.aggregate([
 // Stage 1: Filter documents
 { $match: { status: "completed" } },
 // Stage 2: Shape the output
 { $project: {
 customerName: 1,
 total: 1,
 orderDate: 1,
 _id: 0
 }}
]);

Order matters. $match first means fewer documents for every downstream stage. Stages can repeat — $match early, group, $match again on the grouped results. And the pipeline never modifies your collection. Returns a cursor, same as find().

Filtering with $match and $project

$match uses the same query syntax as find(). Same operators. Same patterns. Nothing new to learn.

$project is where it gets interesting. Not just showing and hiding fields — computed fields, renames, inline expressions. Here is a products collection query that filters in-stock electronics and adds a formatted price label:

JavaScript

db.products.aggregate([
 // Only in-stock electronics
 { $match: {
 category: "Electronics",
 stock: { $gt: 0 }
 }},
 // Reshape the output with computed fields
 { $project: {
 _id: 0,
 name: 1,
 priceLabel: {
 $concat: ["$", { $toString: "$price" }]
 },
 discountedPrice: {
 $round: [
 { $multiply: ["$price", 0.9] },
 2
 ]
 },
 inStock: "$stock"
 }}
]);

priceLabel concatenates a dollar sign with the price. discountedPrice applies a 10% discount and rounds to two decimals. All inside the database. Your application code receives the finished result.

Want computed fields without dropping existing ones? Use $addFields (aliased as $set) instead of $project. Keeps every field, appends the new ones.

Grouping and Accumulating with $group

SQL's GROUP BY with SUM(), AVG(), COUNT(). Monthly sales summary:

JavaScript

db.orders.aggregate([
 { $match: {
 status: "completed",
 orderDate: {
 $gte: ISODate("2026-01-01"),
 $lt: ISODate("2027-01-01")
 }
 }},
 { $group: {
 _id: {
 year: { $year: "$orderDate" },
 month: { $month: "$orderDate" }
 },
 totalRevenue: { $sum: "$total" },
 averageOrderValue: { $avg: "$total" },
 orderCount: { $sum: 1 },
 largestOrder: { $max: "$total" }
 }},
 // Sort by year and month ascending
 { $sort: { "_id.year": 1, "_id.month": 1 } }
]);

Accumulators worth knowing: $sum (set to 1 to count documents), $avg, $min/$max, $first/$last (useful after a $sort), $push (collects into array), $addToSet (unique values only).

_id: null treats everything as one group. Global totals across whatever survived the $match.

Joining Collections with $lookup

$lookup is basically a join. But worse in specific ways.

It is a left outer join only. Same database only. Cannot join across databases. The syntax requires four fields and you will forget which is which for the first dozen times: from (the other collection), localField (field in your current documents), foreignField (matching field in the foreign collection), as (name for the resulting array). Yes, array — it always produces an array, even for one-to-one relationships. You almost always need $unwind right after it to flatten that array into a single object.

But be careful. If a reference is broken and the array is empty, $unwind drops the entire document. Add preserveNullAndEmptyArrays: true if you need to keep unmatched documents.

Orders enriched with customer name and email:

JavaScript

db.orders.aggregate([
 { $match: { status: "completed" } },
 // Join with the customers collection
 { $lookup: {
 from: "customers",
 localField: "customerId",
 foreignField: "_id",
 as: "customerInfo"
 }},
 // $lookup returns an array; unwind to get a single object
 { $unwind: "$customerInfo" },
 // Clean up the output
 { $project: {
 orderId: "$_id",
 total: 1,
 orderDate: 1,
 customerName: "$customerInfo.name",
 customerEmail: "$customerInfo.email",
 _id: 0
 }}
]);

For more complex join logic, the pipeline form of $lookup. Instead of simple field matching, you define a sub-pipeline that filters, projects, and sorts the joined documents before they get embedded:

JavaScript

// Pipeline form of $lookup for complex joins
db.orders.aggregate([
 { $lookup: {
 from: "reviews",
 let: { orderId: "$_id" },
 pipeline: [
 { $match: {
 $expr: {
 $and: [
 { $eq: ["$orderId", "$$orderId"] },
 { $gte: ["$rating", 4] }
 ]
 }
 }},
 { $project: { rating: 1, comment: 1, _id: 0 } }
 ],
 as: "positiveReviews"
 }}
]);

let passes variables from the outer pipeline into the sub-pipeline. Especially useful when the foreign collection is large — narrow the join results before they get embedded instead of pulling every matching document and filtering afterward. On a collection with millions of reviews, the difference between the simple form and the pipeline form can be the difference between a query that finishes and one that times out.

Array Operations: $unwind, $addToSet, $push

MongoDB documents are full of arrays. Order items, tags, roles, activity logs.

$unwind explodes an array field into one document per element. Five items in the array? Five documents out. Then you can group, sort, and aggregate on individual elements.

Top-selling products across all orders, starting from a lineItems array:

JavaScript

db.orders.aggregate([
 { $match: { status: "completed" } },
 // Flatten the lineItems array
 { $unwind: "$lineItems" },
 // Group by product and calculate totals
 { $group: {
 _id: "$lineItems.productId",
 productName: { $first: "$lineItems.name" },
 totalQuantity: { $sum: "$lineItems.quantity" },
 totalRevenue: {
 $sum: {
 $multiply: [
 "$lineItems.price",
 "$lineItems.quantity"
 ]
 }
 },
 orderIds: { $addToSet: "$_id" }
 }},
 // Sort by revenue descending
 { $sort: { totalRevenue: -1 } },
 // Top 10 products
 { $limit: 10 }
]);

$push vs $addToSet: push keeps duplicates, addToSet does not. Use addToSet for distinct order IDs, unique tags. Use push when duplicates matter — timestamps of repeated events, for instance.

Sometimes you do not want to flatten an array at all. $filter trims array elements in place:

JavaScript

// Keep only line items above $50 without unwinding
db.orders.aggregate([
 { $project: {
 customerName: 1,
 orderDate: 1,
 expensiveItems: {
 $filter: {
 input: "$lineItems",
 as: "item",
 cond: { $gt: ["$$item.price", 50] }
 }
 }
 }}
]);

Document structure stays intact. Fewer elements in the array, nothing else changes.

Window Functions and $setWindowFields

Since MongoDB 5.0. The key difference from $group: window functions keep every document. They add computed fields without collapsing rows.

Running totals, moving averages, rankings. All without losing individual documents.

JavaScript

db.orders.aggregate([
 { $match: { status: "completed" } },
 { $setWindowFields: {
 partitionBy: "$customerId",
 sortBy: { orderDate: 1 },
 output: {
 runningTotal: {
 $sum: "$total",
 window: {
 documents: ["unbounded", "current"]
 }
 },
 orderRank: {
 $rank: {}
 },
 movingAvg: {
 $avg: "$total",
 window: {
 documents: [-2, 0]
 }
 }
 }
 }}
]);

partitionBy groups by customer. Within each partition, documents sort by date.

runningTotal sums from the first document to the current one. Cumulative total per customer. orderRank assigns a rank within each partition. movingAvg averages the current document and the two before it — a 3-order moving average from the [-2, 0] window.

Every original document survives. New fields just appear alongside the existing ones. And if you have been reaching for application code to compute running totals or rolling averages, stop. The database does this faster, and the logic stays where the data lives instead of scattered across your service layer.

Performance Optimization

A correct pipeline that takes 30 seconds is not finished. Works fine on test data, falls apart on production volumes. Every time.

Index Your First $match

MongoDB can only use indexes on the initial $match and sometimes a following $sort. Everything after that is in-memory. If your first stage filters on { status: "completed", orderDate: { $gte: ... } }, you need a compound index on { status: 1, orderDate: 1 }. Verify with .explain("executionStats"). No index on the first $match means a full collection scan before the pipeline even starts.

Limit Early, Project Early

Do not push a million documents through $group and $lookup only to take the top 10 at the end. Move $limit up. Use $project early to drop fields. Fewer fields means less memory at every stage.

Enable allowDiskUse

100 MB of RAM per pipeline stage. Exceed that on a $group or $sort and the operation fails outright. For large datasets:

JavaScript

// Allow disk use for large aggregations
db.orders.aggregate([
 { $match: { status: "completed" } },
 { $group: {
 _id: "$productCategory",
 totalRevenue: { $sum: "$total" },
 allCustomers: { $addToSet: "$customerId" }
 }},
 { $sort: { totalRevenue: -1 } }
], { allowDiskUse: true });

Index the $lookup Foreign Field

$lookup on an unindexed foreign collection is a collection scan per document. Millions of documents, no index? Every single join does a full scan. Always index the foreignField. This is the single most common performance mistake with aggregation pipelines.

Materialized Views with $merge

Same expensive pipeline running on every dashboard load? Write the results to a separate collection with $merge. Query that collection directly. Rerun the pipeline when source data changes. $out works too but replaces the entire output collection. $merge can upsert.

Profile First

.explain("executionStats"). Check executionTimeMillis and totalDocsExamined. If docs examined far exceeds docs returned, you need a better index or an earlier $match. Do not guess. The explain output tells you exactly which stages are expensive.

$facet: Multiple Pipelines in One Query

Read the pipeline. It runs multiple sub-pipelines in parallel. That is it.

You define named facets, each containing its own pipeline stages. One facet computes total revenue. Another computes top products. A third computes customer counts by region. All of them run against the same input documents, all in a single round trip to the database. The result is one document with one field per facet.

Most teams discover $facet after building three separate aggregation queries for a dashboard and realizing they could be one. If your API handler runs multiple aggregations against the same collection with overlapping $match conditions, $facet collapses them into a single query. Network round trips drop. Index scans drop. The query planner only reads the collection once.

Anurag Sinha

Founder & Lead Developer

Anurag is a full stack developer and self-taught programmer from India with over five years of experience building web applications, APIs, and cloud-native systems. He personally researches, writes, tests, and reviews every tutorial on Codertronix.

MongoDB Aggregation Pipeline: Complex Data Transformations Made Simple

The Query You Actually Want to Write

Pipeline Basics: How Stages Flow

Filtering with $match and $project

Grouping and Accumulating with $group

Joining Collections with $lookup

Array Operations: $unwind, $addToSet, $push

Window Functions and $setWindowFields

Performance Optimization

Index Your First $match

Limit Early, Project Early

Enable allowDiskUse

Index the $lookup Foreign Field

Materialized Views with $merge

Profile First

$facet: Multiple Pipelines in One Query

Related Tutorials

PostgreSQL Query Optimization

Redis Caching Strategies

SQLite for Mobile Apps

Stay Updated