Worksheet 2: Data modelling with MongoDB#

Exercise 1: Identify documents and collections#

Case Study: Online Bookstore#

You are tasked with designing a MongoDB database for an online bookstore. The bookstore sells books, and each book can have multiple authors. Customers can purchase books, and each purchase can contain multiple books. Additionally, customers can leave reviews for books they have purchased.

Objectives:#

  1. Design the data model for the online bookstore.

  2. Define the relationships between different entities.

  3. Create sample documents for each collection.

Step-by-Step Instructions:#

  1. Identify Entities and Relationships:

    • Books: Each book has a title, ISBN, publication date, and a list of authors.

    • Authors: Each author has a name and a list of books they have written.

    • Customers: Each customer has a name, email, and a list of purchases.

    • Purchases: Each purchase has a date, customer reference, and a list of books purchased.

    • Reviews: Each review has a rating, comment, customer reference, and book reference.

  2. Design the Data Model:

    • Books Collection:

      {
          "_id": ObjectId,
          "title": String,
          "ISBN": String,
          "publication_date": Date,
          "authors": [ObjectId]  // References to Authors
      }
      
    • Authors Collection:

      {
          "_id": ObjectId,
          "name": String,
          "books": [ObjectId]  // References to Books
      }
      
    • Customers Collection:

      {
          "_id": ObjectId,
          "name": String,
          "email": String,
          "purchases": [ObjectId]  // References to Purchases
      }
      
    • Purchases Collection:

      {
          "_id": ObjectId,
          "date": Date,
          "customer_id": ObjectId,  // Reference to Customer
          "books": [ObjectId]  // References to Books
      }
      
    • Reviews Collection:

      {
          "_id": ObjectId,
          "rating": Number,
          "comment": String,
          "customer_id": ObjectId,  // Reference to Customer
          "book_id": ObjectId  // Reference to Book
      }
      
  3. Create Sample Documents:

    • Books Collection:

      {
          "_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1a"),
          "title": "MongoDB Basics",
          "ISBN": "1234567890",
          "publication_date": ISODate("2021-01-01T00:00:00Z"),
          "authors": [ObjectId("60c72b2f9b1d8b3a4c8e4d1b")]
      }
      
    • Authors Collection:

      {
          "_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1b"),
          "name": "John Doe",
          "books": [ObjectId("60c72b2f9b1d8b3a4c8e4d1a")]
      }
      
    • Customers Collection:

      {
          "_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1c"),
          "name": "Jane Smith",
          "email": "jane.smith@example.com",
          "purchases": [ObjectId("60c72b2f9b1d8b3a4c8e4d1d")]
      }
      
    • Purchases Collection:

      {
          "_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1d"),
          "date": ISODate("2021-06-01T00:00:00Z"),
          "customer_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1c"),
          "books": [ObjectId("60c72b2f9b1d8b3a4c8e4d1a")]
      }
      
    • Reviews Collection:

      {
          "_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1e"),
          "rating": 5,
          "comment": "Great book on MongoDB!",
          "customer_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1c"),
          "book_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1a")
      }
      

Exercise 2: Identify database workload#

In this exercise, you will extend your data modeling skills by identifying entities and attributes, quantifying entities, and analyzing read and write operations for different types of application users. You will also map these operations to application flows and quantify them.

Quantifying Reads and Writes in the Online Bookstore Example#

To quantify the read and write operations in the online bookstore example, we need to consider the different types of operations performed by various users (customers and admins) and the frequency of these operations.

Fill out the table below with possible operations by customers and admins and specify the estimated frequency, as well as whether it’s a read or write operations. You can make your own assumptions.

I have added one operation as an example.

Operation

User Type

Read/Write

Frequency (per day)

Description

Browse Books

Customer

Read

100,000

Customers browsing the list of available books.

View Book Details

Customer

Read

50,000

Customers viewing details (e.g., title, author, reviews) for specific books.

Add a new book

Admin

Write

5

Admin adding new books to the inventory

Leave a Book Review

Customer

Write

5,000

Customers leaving reviews and ratings for purchased books.

Search for a Book

Customer

Read

80,000

Customers using search functionality to find books.

Exercise 3: Identifying and Modeling Relationships in MongoDB#

Case Study: Online Bookstore (Continued)#

In this exercise, you will identify one-to-one, one-to-many, and many-to-many relationships between entities in an online bookstore. You will analyze these entities to determine whether to embed or reference them using common guidelines. Finally, you will model these relationships using both embedded and referenced approaches.

Objectives:#

  1. Identify one-to-one, one-to-many, and many-to-many relationships between entities.

  2. Analyze entities to determine whether to embed or reference using common guidelines.

  3. Model embedded and referenced one-to-one, one-to-many, and many-to-many relationships.

Step-by-Step Instructions:#

For each of the following pair of entities, identify:

  • What type of relationships is there

  • Determine whether to use embedding or referencing

  • Give an example in JSON format

Example:#

Entities: Books and Reviews

  1. Type of Relationship: One-to-Many (e.g., one book has many reviews)

  2. Embed or Reference: Embed

  3. Explanation: Reviews are typically small in size and are frequently accessed together with the book details. Embedding reviews within the book document ensures that all related data can be retrieved in a single read operation, improving read performance.

JSON Example:#

{
    "_id": ObjectId("bookId1"),
    "title": "MongoDB Basics",
    "ISBN": "1234567890",
    "reviews": [
        {
            "rating": 5,
            "comment": "Great book on MongoDB!",
            "customer_id": ObjectId("customerId1")
        },
        {
            "rating": 4,
            "comment": "Very informative.",
            "customer_id": ObjectId("customerId2")
        }
    ]
}

YOUR TURN

For each of the following pairs of entities, identify the type of relationship, decide whether to use embedding or referencing, explain why, and provide a JSON example.

  1. Entities: Books and Authors

    • Type of Relationship:

    • Embed or Reference:

    • Explanation:

    • JSON Example:

YOUR ANSWER HERE

  1. Entities: Books and Authors

    • Type of Relationship: Many-to-Many

    • Embed or Reference: Embed for books, and embed for authors

    • Explanation: Each book can have multiple authors, but each author can also write multiple books.

    • For the books collection, since the number of authors of each book is relatively small, we can directly embed authors information in the books collection. This will help users to retrieve information on authors while browsing a book faster.

    • For the authors collection, each authors can have multiple books, but the number of books that each author wrote over time is likely to be limited (e.g. each author has less than 20 books). Also, when people click on authors name, they will likely want to browse all the books written by that authors. So embedding here would make sense. Data that are accessed together, should stay together!

    • JSON Example:

Books collection (embedded authors)

{
    "book_id": "book_id_3",
    "title": "Good Omens",
    "ISBN": "978-0-06-085398-3",
    "publication_date": "1990-05-01",
    "authors": [
        {
            "author_id": "3",
            "name": "Neil Gaiman"
        },
        {
            "author_id": "4",
            "name": "Terry Pratchett"
        }
    ]
}

Authors collection (embed books)

{
    "author_id": "2",
    "name": "George Orwell",
    "books": [
        {
            "book_id": "book_id_2",
            "title": "1984",
            "ISBN": "978-0-452-28423-4",
            "publication_date": "1949-06-08"
        },
        {
            "book_id": "book_id_3",
            "title": "Animal Farm",
            "ISBN": "978-0-452-28424-1",
            "publication_date": "1945-08-17"
        }
    ]
}
  1. Entities: Customers and Purchases

    • Type of Relationship:

    • Embed or Reference:

    • Explanation:

    • JSON Example:

YOUR ANSWER HERE

  1. Entities: Customers and Purchases

    • Type of Relationship: One-to-many

    • Embed or Reference: We would reference purchase ID in the customers collection. The reason is that each customer can have a large number of purchases, embedding it would result in unbounded documents, extremly long json file and slow down querry time. So referencing would be a better choice here.

    • Explanation: Each customer can have multiple purchases, but each purchase order belongs to only one customer

    • JSON Example:

{
    "customer_id": "customer_id_2",
    "name": "Jane Smith",
    "email": "jane.smith@example.com",
    "purchases": [
        "purchase_id_1",
        "purchase_id_2",
        ...,
        "purchase_id_100"
    ]
}
  1. Entities: Books and Purchases

    • Type of Relationship:

    • Embed or Reference:

    • Explanation:

    • JSON Example:

YOUR ANSWER HERE

  1. Entities: Books and Purchases

    • Type of Relationship:: Many-to-many

    • Embed or Reference:

    • Explanation: Each book can be purchased multiple times, each purchase order can have multiple books

    • For the books collection, we can reference the purchased id since we can have hundreds to millions of purchase orders for a single popular book (e.g. Harry Potter). This will also help with reducing data redundancy when we need to update some information about the purcharse order.

    • For the purchases collection, we can use embedding as we assume that most purchase order will contain only a small number of books. When we querry data about purchases, we most likely will always want to see the book information within the purchase order as well. Books information will unlikely to be changed/updated frequently, so there’s no concerns for data redundancy. Data that are accessed together, should stay together!

    • JSON Example: books collection (referencing purchase ids)

{
    "book_id": "book_id_1",
    "title": "Harry Potter and the Sorcerer's Stone",
    "ISBN": "978-0-590-35340-3",
    "publication_date": "1997-06-26",
    "purchases": [
        "purchase_id_1",
        "purchase_id_2",
        ...,
        "purchase_id_100"
    ]
}

purchase collection (embed books info)

{
    "purchase_id": "purchase_id_1",
    "date": "2023-10-01",
    "customer_id": "customer_id_1",
    "books": [
        {
            "book_id": "book_id_1",
            "title": "Harry Potter and the Sorcerer's Stone",
            "price": 8.99
        },
        {
            "book_id": "book_id_2",
            "title": "The Hobbit",
            "price": 12.99
        }
    ]
}
  1. Entities: Customers and Reviews

    • Type of Relationship:

    • Embed or Reference:

    • Explanation:

    • JSON Example:

YOUR ANSWER HERE

  1. Entities: Customers and Reviews

    • Type of Relationship: One-to-many

    • Embed or Reference: Reference reviews in customer collection

    • Explanation: Each customer can write multiples reviews, each review belongs to a single customer. We should reference the reviews id in the customer collection because the number of reviews could get quite large, hence referencing would be suitable here. Also, if we think about the kind of operations that would querry the customer collection, it’s unlikely that we would need to show all the reviews that each customer made.

    • JSON Example:

{
    "customer_id": "customer_id_1",
    "name": "John Doe",
    "email": "john.doe@example.com",
    "reviews": [
        "review_id_1",
        "review_id_2",
        "review_id_3",
        "review_id_4",
        "review_id_5",
        "review_id_6",
        "review_id_7",
        "review_id_8",
        "review_id_9",
        "review_id_10"
    ]
}

Submission instructions#

{rubric: mechanics = 5}

  • Make sure the notebook can run from top to bottom without any error. Restart the kernel and run all cells.

  • Commit and push your notebook to the github repo

  • Double check your notebook is rendered properly on Github and you can see all the outputs clearly

  • Submit a URL to the github repo that contain this worksheet to Moodle