Lecture 7: Aggregation operations in MongoDB#

Learning objectives#

By the end of this lecture, students should understand:

  • Define aggregation and the components of an aggregation pipeline.

  • Build an aggregation pipeline that uses the $match and $group stages.

  • Build an aggregation pipeline that uses the $sort and $project stages.

Connect to MongoDB database#

from pymongo import MongoClient # import mongo client to connect
import json # import json to load credentials
import urllib.parse

# load credentials from json file
with open('credentials_mongodb.json') as f:
    login = json.load(f)

# assign credentials to variables
username = login['username']
password = urllib.parse.quote(login['password'])
host = login['host']
url = "mongodb+srv://{}:{}@{}/?retryWrites=true&w=majority".format(username, password, host)

# connect to the database
client = MongoClient(url)

# drop database books and students if they exist
client.drop_database('library')
# create database and collection
db = client["library"]
collection = db["books"]
# Sample book data
sample_books = [
    {"title": "To Kill a Mockingbird", "author": "Harper Lee", "pages": 281, "year": 1960},
    {"title": "Go Set a Watchman", "author": "Harper Lee", "pages": 278, "year": 2015},
    {"title": "Mockingbird Songs: My Friendship with Harper Lee", "author": "Harper Lee", "pages": 288, "year": 2017},
    
    {"title": "1984", "author": "George Orwell", "pages": 328, "year": 1949},
    {"title": "Animal Farm", "author": "George Orwell", "pages": 112, "year": 1945},
    
    {"title": "Pride and Prejudice", "author": "Jane Austen", "pages": 279, "year": 1813},
    {"title": "Sense and Sensibility", "author": "Jane Austen", "pages": 226, "year": 1811},
    {"title": "Emma", "author": "Jane Austen", "pages": 474, "year": 1815},
    
    {"title": "The Great Gatsby", "author": "F. Scott Fitzgerald", "pages": 180, "year": 1925},
    {"title": "Tender Is the Night", "author": "F. Scott Fitzgerald", "pages": 317, "year": 1934},
    {"title": "This Side of Paradise", "author": "F. Scott Fitzgerald", "pages": 305, "year": 1920},
    
    {"title": "Bartleby, the Scrivener", "author": "Herman Melville", "pages": 64, "year": 1853},
    {"title": "Billy Budd, Sailor", "author": "Herman Melville", "pages": 192, "year": 1924},
]

# Insert sample data
collection.insert_many(sample_books)
InsertManyResult([ObjectId('66eb9baaddbe7989b91b279a'), ObjectId('66eb9baaddbe7989b91b279b'), ObjectId('66eb9baaddbe7989b91b279c'), ObjectId('66eb9baaddbe7989b91b279d'), ObjectId('66eb9baaddbe7989b91b279e'), ObjectId('66eb9baaddbe7989b91b279f'), ObjectId('66eb9baaddbe7989b91b27a0'), ObjectId('66eb9baaddbe7989b91b27a1'), ObjectId('66eb9baaddbe7989b91b27a2'), ObjectId('66eb9baaddbe7989b91b27a3'), ObjectId('66eb9baaddbe7989b91b27a4'), ObjectId('66eb9baaddbe7989b91b27a5'), ObjectId('66eb9baaddbe7989b91b27a6')], acknowledged=True)

MongoDB Aggregation#

Introduction#

MongoDB’s aggregation framework is a powerful tool for performing data processing and analysis directly within the database. It allows you to transform and combine documents in a collection to produce aggregated results. PyMongo, the Python driver for MongoDB, provides a way to interact with MongoDB’s aggregation framework.

Aggregation Pipeline#

The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into aggregated results.

The syntax for an aggregation pipeline in MongoDB is a sequence of stages, where each stage transforms the documents as they pass through the pipeline. Each stage is represented as a document in an array, and the stages are processed in the order they appear.

General Syntax#

[
    { <stage1>: { <stage-specific-operator1>: <expression1>, ... } },
    { <stage2>: { <stage-specific-operator2>: <expression2>, ... } },
    ...
]

Common Stages#

  1. $match: Filters documents.

  2. $group: Groups documents by a specified key.

  3. $sort: Sorts documents.

  4. $project: Reshapes documents by including, excluding, or adding fields.

  5. $limit: Limits the number of documents.

  6. $skip: Skips a specified number of documents.

  7. $unwind: Deconstructs an array field from the input documents to output a document for each element.

Example Pipeline#

Let’s create an example pipeline that:

  1. Filters books with more than 150 pages.

  2. Groups the books by author and calculates the total number of pages for each author.

  3. Sorts the authors by the total number of pages in descending order.

  4. Projects the author and total pages fields, excluding the _id field.

# Define each stage separately
match_stage = {
    "$match": {
        "pages": { "$gt": 150 }
    }
}

group_stage = {
    "$group": {
        "_id": "$author",
        "totalPages": { "$sum": "$pages" }
    }
}

sort_stage = {
    "$sort": {
        "totalPages": -1
    }
}

project_stage = {
    "$project": {
        "_id": 0,
        "author": "$_id",
        "totalPages": 1
    }
}

# Combine stages into a pipeline
pipeline = [
    match_stage,
    group_stage,
    sort_stage,
    project_stage
]

# Execute the pipeline
result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'totalPages': 979, 'author': 'Jane Austen'}
{'totalPages': 847, 'author': 'Harper Lee'}
{'totalPages': 802, 'author': 'F. Scott Fitzgerald'}
{'totalPages': 328, 'author': 'George Orwell'}
{'totalPages': 192, 'author': 'Herman Melville'}

Here is a table comparing SQL clauses (SELECT, WHERE, GROUP BY, ORDER BY) with their MongoDB aggregation framework equivalents.

SQL Clause

MongoDB Aggregation Stage

Description

SELECT

$project

Specifies the fields to include or exclude in the output documents.

WHERE

$match

Filters the documents to pass only those that match the specified condition(s).

GROUP BY

$group

Groups input documents by a specified identifier expression and applies the accumulator expressions.

ORDER BY

$sort

Sorts all input documents and returns them in the specified order.

Explanation of MongoDB Stages#

  • $match: Filters documents where grades.score is greater than 20.

  • $group: Groups documents by cuisine and calculates the average grades.score for each group.

  • $sort: Sorts the grouped documents by averageScore in descending order.

  • $project: Projects the cuisine and averageScore fields, excluding the _id field.

This table and example illustrate how common SQL clauses map to MongoDB aggregation stages, helping to understand the similarities and differences between SQL and MongoDB queries.

This pipeline demonstrates how to use multiple stages to filter, group, sort, and project documents in MongoDB using the aggregation framework.

Key Aggregation Operators#

  1. $match

  2. $group

  3. $sort

  4. $project

1. $match#

The $match stage filters documents to pass only the documents that match the specified condition(s) to the next pipeline stage.

Syntax:

{
    "$match": { <query> }
}

Example: Filter books that have more than 250 pages.

First, recall that we could use the find() function to filter the data, as shown in the previous lecture

for doc in collection.find({"pages": {"$gt": 250}}):
    print(doc)
{'_id': ObjectId('66eb9baaddbe7989b91b279a'), 'title': 'To Kill a Mockingbird', 'author': 'Harper Lee', 'pages': 281, 'year': 1960}
{'_id': ObjectId('66eb9baaddbe7989b91b279b'), 'title': 'Go Set a Watchman', 'author': 'Harper Lee', 'pages': 278, 'year': 2015}
{'_id': ObjectId('66eb9baaddbe7989b91b279c'), 'title': 'Mockingbird Songs: My Friendship with Harper Lee', 'author': 'Harper Lee', 'pages': 288, 'year': 2017}
{'_id': ObjectId('66eb9baaddbe7989b91b279d'), 'title': '1984', 'author': 'George Orwell', 'pages': 328, 'year': 1949}
{'_id': ObjectId('66eb9baaddbe7989b91b279f'), 'title': 'Pride and Prejudice', 'author': 'Jane Austen', 'pages': 279, 'year': 1813}
{'_id': ObjectId('66eb9baaddbe7989b91b27a1'), 'title': 'Emma', 'author': 'Jane Austen', 'pages': 474, 'year': 1815}
{'_id': ObjectId('66eb9baaddbe7989b91b27a3'), 'title': 'Tender Is the Night', 'author': 'F. Scott Fitzgerald', 'pages': 317, 'year': 1934}
{'_id': ObjectId('66eb9baaddbe7989b91b27a4'), 'title': 'This Side of Paradise', 'author': 'F. Scott Fitzgerald', 'pages': 305, 'year': 1920}

Now let’s use the $match method, to be used within an aggregation pipeline

pipeline = [
    {
        "$match": {
            "pages": { "$gt": 250 }
        }
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'_id': ObjectId('66eb9baaddbe7989b91b279a'), 'title': 'To Kill a Mockingbird', 'author': 'Harper Lee', 'pages': 281, 'year': 1960}
{'_id': ObjectId('66eb9baaddbe7989b91b279b'), 'title': 'Go Set a Watchman', 'author': 'Harper Lee', 'pages': 278, 'year': 2015}
{'_id': ObjectId('66eb9baaddbe7989b91b279c'), 'title': 'Mockingbird Songs: My Friendship with Harper Lee', 'author': 'Harper Lee', 'pages': 288, 'year': 2017}
{'_id': ObjectId('66eb9baaddbe7989b91b279d'), 'title': '1984', 'author': 'George Orwell', 'pages': 328, 'year': 1949}
{'_id': ObjectId('66eb9baaddbe7989b91b279f'), 'title': 'Pride and Prejudice', 'author': 'Jane Austen', 'pages': 279, 'year': 1813}
{'_id': ObjectId('66eb9baaddbe7989b91b27a1'), 'title': 'Emma', 'author': 'Jane Austen', 'pages': 474, 'year': 1815}
{'_id': ObjectId('66eb9baaddbe7989b91b27a3'), 'title': 'Tender Is the Night', 'author': 'F. Scott Fitzgerald', 'pages': 317, 'year': 1934}
{'_id': ObjectId('66eb9baaddbe7989b91b27a4'), 'title': 'This Side of Paradise', 'author': 'F. Scott Fitzgerald', 'pages': 305, 'year': 1920}

Note

The primary difference between find and $match in MongoDB lies in their usage and context within the MongoDB query language:

find

  • Context: Used as a standalone query method.

  • Purpose: Retrieves documents from a collection that match the specified query criteria.

  • Usage: Directly used on a collection to fetch documents.

  • Example:

    result = collection.find({"pages": {"$gt": 250}})
    for doc in result:
        print(doc)
    
    
    

$match

  • Context: Used within an aggregation pipeline.

  • Purpose: Filters documents to pass only those that match the specified condition(s) to the next stage in the pipeline.

  • Usage: Part of the aggregation framework, typically used in conjunction with other stages like $group, $sort, $project, etc.

  • Example:

    pipeline = [
        {"$match": {"pages": {"$gt": 250}}}
    ]
    result = collection.aggregate(pipeline)
    for doc in result:
        print(doc)
    
    
    

In summary, use find for straightforward document retrieval and $match within an aggregation pipeline for more complex data processing tasks.

2. $group#

The $group stage groups input documents by the specified _id expression and for each distinct grouping, outputs a document. The output documents can also contain computed fields that hold the values of some accumulator expressions.

Syntax:

{
    "$group": {
        "_id": <expression>,
        <field1>: { <accumulator1> : <expression1> },
        ...
    }
}

Explanation#

  • $group: This is the stage operator that specifies the grouping operation.

  • _id: This field specifies the key by which to group the documents. The value of _id can be any valid expression that resolves to a value. Documents with the same _id value are grouped together. If you want to group all documents together, you can set _id to null.

  • <field1>: This is the name of the field in the output documents that will hold the result of the aggregation. You can specify multiple fields to perform different aggregations on the grouped data.

  • <accumulator1>: This specifies the accumulator operator to use for the aggregation. Common accumulator operators include:

    • $sum: Calculates the sum of numeric values.

    • $avg: Calculates the average of numeric values.

    • $min: Finds the minimum value.

    • $max: Finds the maximum value.

    • $push: Appends values to an array.

    • $addToSet: Adds values to an array, but only unique values.

    • $first: Returns the first value in a group.

    • $last: Returns the last value in a group.

  • <expression1>: This is the expression that specifies the field or value to be aggregated. It can be a field path, a constant value, or a more complex expression.

Example#

Let’s consider an example where we group books by their author and calculate the total number of pages for each author.

pipeline = [
    {
        "$group": {
            "_id": "$author",  # Group by author
            "totalPages": { "$sum": "$pages" }  # Calculate the sum of pages for each author
        }
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'_id': 'Herman Melville', 'totalPages': 256}
{'_id': 'George Orwell', 'totalPages': 440}
{'_id': 'Jane Austen', 'totalPages': 979}
{'_id': 'Harper Lee', 'totalPages': 847}
{'_id': 'F. Scott Fitzgerald', 'totalPages': 802}

In this example:

  • _idis set to $author, so the documents are grouped by the author field.

  • totalPages is the name of the field in the output documents.

  • { "$sum": "$pages" } specifies that the totalPages field should contain the sum of the pages field for each group.

Example: Group books by author to:

  • count the number of books for each author

  • determine the earliest publication year

  • determine the latest publication year.

pipeline = [
    {
        "$group": {
            "_id": "$author",
            "count": { "$sum": 1 },
            "min_year": { "$min": "$year" },
            "max_year": { "$max": "$year" }
        }
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'_id': 'Herman Melville', 'count': 2, 'min_year': 1853, 'max_year': 1924}
{'_id': 'Harper Lee', 'count': 3, 'min_year': 1960, 'max_year': 2017}
{'_id': 'F. Scott Fitzgerald', 'count': 3, 'min_year': 1920, 'max_year': 1934}
{'_id': 'Jane Austen', 'count': 3, 'min_year': 1811, 'max_year': 1815}
{'_id': 'George Orwell', 'count': 2, 'min_year': 1945, 'max_year': 1949}

3. $sort#

The $sort stage sorts all input documents and returns them to the pipeline in sorted order.

Syntax:

{
    "$sort": { <field1>: <sort order>, <field2>: <sort order>, ... }
}

Example: Sort books by the number of pages in descending order.

pipeline = [
    {
        "$sort": {
            "pages": -1
        }
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'_id': ObjectId('66eb9baaddbe7989b91b27a1'), 'title': 'Emma', 'author': 'Jane Austen', 'pages': 474, 'year': 1815}
{'_id': ObjectId('66eb9baaddbe7989b91b279d'), 'title': '1984', 'author': 'George Orwell', 'pages': 328, 'year': 1949}
{'_id': ObjectId('66eb9baaddbe7989b91b27a3'), 'title': 'Tender Is the Night', 'author': 'F. Scott Fitzgerald', 'pages': 317, 'year': 1934}
{'_id': ObjectId('66eb9baaddbe7989b91b27a4'), 'title': 'This Side of Paradise', 'author': 'F. Scott Fitzgerald', 'pages': 305, 'year': 1920}
{'_id': ObjectId('66eb9baaddbe7989b91b279c'), 'title': 'Mockingbird Songs: My Friendship with Harper Lee', 'author': 'Harper Lee', 'pages': 288, 'year': 2017}
{'_id': ObjectId('66eb9baaddbe7989b91b279a'), 'title': 'To Kill a Mockingbird', 'author': 'Harper Lee', 'pages': 281, 'year': 1960}
{'_id': ObjectId('66eb9baaddbe7989b91b279f'), 'title': 'Pride and Prejudice', 'author': 'Jane Austen', 'pages': 279, 'year': 1813}
{'_id': ObjectId('66eb9baaddbe7989b91b279b'), 'title': 'Go Set a Watchman', 'author': 'Harper Lee', 'pages': 278, 'year': 2015}
{'_id': ObjectId('66eb9baaddbe7989b91b27a0'), 'title': 'Sense and Sensibility', 'author': 'Jane Austen', 'pages': 226, 'year': 1811}
{'_id': ObjectId('66eb9baaddbe7989b91b27a6'), 'title': 'Billy Budd, Sailor', 'author': 'Herman Melville', 'pages': 192, 'year': 1924}
{'_id': ObjectId('66eb9baaddbe7989b91b27a2'), 'title': 'The Great Gatsby', 'author': 'F. Scott Fitzgerald', 'pages': 180, 'year': 1925}
{'_id': ObjectId('66eb9baaddbe7989b91b279e'), 'title': 'Animal Farm', 'author': 'George Orwell', 'pages': 112, 'year': 1945}
{'_id': ObjectId('66eb9baaddbe7989b91b27a5'), 'title': 'Bartleby, the Scrivener', 'author': 'Herman Melville', 'pages': 64, 'year': 1853}

Example 1: Sort by Author Name in Ascending Order

Sort the grouped results by author name in ascending order.

pipeline = [
    {
        "$group": {
            "_id": "$author",
            "count": { "$sum": 1 },
            "min_year": { "$min": "$year" },
            "max_year": { "$max": "$year" }
        }
    },
    {
        "$sort": {
            "_id": 1  # Sort by author name in ascending order
        }
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'_id': 'F. Scott Fitzgerald', 'count': 3, 'min_year': 1920, 'max_year': 1934}
{'_id': 'George Orwell', 'count': 2, 'min_year': 1945, 'max_year': 1949}
{'_id': 'Harper Lee', 'count': 3, 'min_year': 1960, 'max_year': 2017}
{'_id': 'Herman Melville', 'count': 2, 'min_year': 1853, 'max_year': 1924}
{'_id': 'Jane Austen', 'count': 3, 'min_year': 1811, 'max_year': 1815}

Example 2: Sort by Count of Books in Descending Order Sort the grouped results by the count of books in descending order.

pipeline = [
    {
        "$group": {
            "_id": "$author",
            "count": { "$sum": 1 },
            "min_year": { "$min": "$year" },
            "max_year": { "$max": "$year" }
        }
    },
    {
        "$sort": {
            "count": -1  # Sort by count of books in descending order
        }
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'_id': 'Harper Lee', 'count': 3, 'min_year': 1960, 'max_year': 2017}
{'_id': 'F. Scott Fitzgerald', 'count': 3, 'min_year': 1920, 'max_year': 1934}
{'_id': 'Jane Austen', 'count': 3, 'min_year': 1811, 'max_year': 1815}
{'_id': 'Herman Melville', 'count': 2, 'min_year': 1853, 'max_year': 1924}
{'_id': 'George Orwell', 'count': 2, 'min_year': 1945, 'max_year': 1949}

Example 3: Sort by Count of Books in Descending Order and then by Author Name in Ascending Order

Sort the grouped results first by the count of books in descending order, and then by author name in ascending order.

pipeline = [
    {
        "$group": {
            "_id": "$author",
            "count": { "$sum": 1 },
            "min_year": { "$min": "$year" },
            "max_year": { "$max": "$year" }
        }
    },
    {
        "$sort": {
            "count": -1,  # Sort by count of books in descending order
            "_id": 1      # Then sort by author name in ascending order
        }
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'_id': 'F. Scott Fitzgerald', 'count': 3, 'min_year': 1920, 'max_year': 1934}
{'_id': 'Harper Lee', 'count': 3, 'min_year': 1960, 'max_year': 2017}
{'_id': 'Jane Austen', 'count': 3, 'min_year': 1811, 'max_year': 1815}
{'_id': 'George Orwell', 'count': 2, 'min_year': 1945, 'max_year': 1949}
{'_id': 'Herman Melville', 'count': 2, 'min_year': 1853, 'max_year': 1924}

4. $project#

The $project stage reshapes each document in the stream, such as by adding new fields or removing existing fields.

Syntax:

{
    "$project": {
        <field1>: <expression1>,
        <field2>: <expression2>,
        ...
    }
}

Example: Project only the title and author fields of the books.

pipeline = [
    {
        "$project": {
            "title": 1,
            "author": 1,
            "_id": 0
        }
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'title': 'To Kill a Mockingbird', 'author': 'Harper Lee'}
{'title': 'Go Set a Watchman', 'author': 'Harper Lee'}
{'title': 'Mockingbird Songs: My Friendship with Harper Lee', 'author': 'Harper Lee'}
{'title': '1984', 'author': 'George Orwell'}
{'title': 'Animal Farm', 'author': 'George Orwell'}
{'title': 'Pride and Prejudice', 'author': 'Jane Austen'}
{'title': 'Sense and Sensibility', 'author': 'Jane Austen'}
{'title': 'Emma', 'author': 'Jane Austen'}
{'title': 'The Great Gatsby', 'author': 'F. Scott Fitzgerald'}
{'title': 'Tender Is the Night', 'author': 'F. Scott Fitzgerald'}
{'title': 'This Side of Paradise', 'author': 'F. Scott Fitzgerald'}
{'title': 'Bartleby, the Scrivener', 'author': 'Herman Melville'}
{'title': 'Billy Budd, Sailor', 'author': 'Herman Melville'}

Combining stages into a pipeline#

You can combine multiple stages to form a more complex aggregation pipeline.

Example: Find the top 2 authors with the most books, sorted by the number of books in descending order.

pipeline = [
    {
        "$group": {
            "_id": "$author",
            "count": { "$sum": 1 }
        }
    },
    {
        "$sort": {
            "count": -1
        }
    },
    {
        "$limit": 2
    }
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'_id': 'Jane Austen', 'count': 3}
{'_id': 'Harper Lee', 'count': 3}

Here’s another example that combines multiple stages in an aggregation pipeline. This example will:

  1. Filter books published after the year 1900.

  2. Group the books by author and calculate the total number of pages and the average number of pages for each author.

  3. Sort the authors by the total number of pages in descending order.

  4. Project the author, total pages, and average pages fields, excluding the _id field.

# Define each stage separately
match_stage = {
    "$match": {
        "year": { "$gt": 1900 }
    }
}

group_stage = {
    "$group": {
        "_id": "$author",
        "totalPages": { "$sum": "$pages" },
        "averagePages": { "$avg": "$pages" }
    }
}

sort_stage = {
    "$sort": {
        "totalPages": -1
    }
}

project_stage = {
    "$project": {
        "_id": 0,
        "author": "$_id",
        "totalPages": 1,
        "averagePages": { "$round": ["$averagePages", 2] }
    }
}

# Combine stages into a pipeline
pipeline = [
    match_stage,
    group_stage,
    sort_stage,
    project_stage
]

# Execute the pipeline
result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'totalPages': 847, 'author': 'Harper Lee', 'averagePages': 282.33}
{'totalPages': 802, 'author': 'F. Scott Fitzgerald', 'averagePages': 267.33}
{'totalPages': 440, 'author': 'George Orwell', 'averagePages': 220.0}
{'totalPages': 192, 'author': 'Herman Melville', 'averagePages': 192.0}

Conclusion#

MongoDB’s aggregation framework is a powerful tool for data analysis and transformation. By using stages like $match, $group, $sort, and $project, you can perform complex queries and data manipulations directly within the database. PyMongo provides a convenient way to interact with this framework from Python.

Practice exercises#

We are going to use the restaurants collection in the sample_restaurants database

db = client.sample_restaurants
collection = db.restaurants

Here’s a sample document in the restaurants collection

collection.find_one()
{'_id': ObjectId('5eb3d668b31de5d588f4292c'),
 'address': {'building': '2206',
  'coord': [-74.1377286, 40.6119572],
  'street': 'Victory Boulevard',
  'zipcode': '10314'},
 'borough': 'Staten Island',
 'cuisine': 'Jewish/Kosher',
 'grades': [{'date': datetime.datetime(2014, 10, 6, 0, 0),
   'grade': 'A',
   'score': 9},
  {'date': datetime.datetime(2014, 5, 20, 0, 0), 'grade': 'A', 'score': 12},
  {'date': datetime.datetime(2013, 4, 4, 0, 0), 'grade': 'A', 'score': 12},
  {'date': datetime.datetime(2012, 1, 24, 0, 0), 'grade': 'A', 'score': 9}],
 'name': 'Kosher Island',
 'restaurant_id': '40356442'}

Exercise 1: Count Restaurants by Borough#

Objective: Count the number of restaurants in each borough.

  1. Group: By borough.

  2. Count: The number of restaurants in each borough.

  3. Sort: By the count in descending order.

  4. Project: Borough and count.

Hide code cell source
group_stage = {
    "$group": {
        "_id": "$borough",
        "count": { "$sum": 1 }
    }
}

sort_stage = {
    "$sort": {
        "count": -1
    }
}

project_stage = {
    "$project": {
        "_id": 0,
        "borough": "$_id",
        "count": 1
    }
}

pipeline = [
    group_stage,
    sort_stage,
    project_stage
]

result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'count': 10259, 'borough': 'Manhattan'}
{'count': 6086, 'borough': 'Brooklyn'}
{'count': 5656, 'borough': 'Queens'}
{'count': 2338, 'borough': 'Bronx'}
{'count': 969, 'borough': 'Staten Island'}
{'count': 51, 'borough': 'Missing'}

Exercise 2: Find the Top 3 Cuisines with the Most Restaurants#

Objective: Identify the top 3 cuisines with the highest number of restaurants.

  1. Group: By cuisine.

  2. Count: The number of restaurants for each cuisine.

  3. Sort: By count in descending order.

  4. Limit: To the top 3 cuisines.

  5. Project: Cuisine and count.

Example Code#

Hide code cell source
# Define each stage separately
group_stage = {
    "$group": {
        "_id": "$cuisine",
        "count": { "$sum": 1 }
    }
}

sort_stage = {
    "$sort": {
        "count": -1
    }
}

limit_stage = {
    "$limit": 3
}

project_stage = {
    "$project": {
        "_id": 0,
        "cuisine": "$_id",
        "count": 1
    }
}

# Combine stages into a pipeline
pipeline = [
    group_stage,
    sort_stage,
    limit_stage,
    project_stage
]

# Execute the pipeline
result = list(collection.aggregate(pipeline))
for doc in result:
    print(doc)
{'count': 6183, 'cuisine': 'American'}
{'count': 2418, 'cuisine': 'Chinese'}
{'count': 1214, 'cuisine': 'Café/Coffee/Tea'}

Summary#

These exercises provide practice with the following stages:

  • $group: Grouping documents by a specified key and performing aggregations.

  • $sort: Sorting documents based on a specified field.

  • $project: Reshaping documents by including or excluding fields.

  • $limit: Limiting the number of documents passed to the next stage in the pipeline.

By completing these exercises, students will gain hands-on experience with MongoDB’s aggregation framework and learn how to perform complex data transformations and analyses.

Supplemental materials#

Read more here: https://www.mongodb.com/docs/manual/aggregation/

MongoDB University lesson: https://learn.mongodb.com/learn/course/mongodb-aggregation-in-python/lesson-1-building-a-mongodb-aggregation-pipeline-in-python-applications/learn