Worksheet 1: Introduction to noSQL databases and document model with MongoDB#
Welcome! In this week, we will start off easy with some exercise to help you set up the environment in which you will be using for the upcoming assignment.
First off, you will need to install Miniconda, if you haven’t already
Install miniconda#
macOS
You can find the Mac ARM and Intel download links here: https://conda-forge.org/miniforge/. Make sure you use the Miniforge3 installers, not the other ones listed. We will assume you downloaded the file into your Downloads folder.
Once downloaded, open up a terminal and run the following command
bash \({HOME}/Downloads/Miniforge3.sh -b -p "\){HOME}/miniforge3”
After installation run the following commands
source “${HOME}/miniforge3/etc/profile.d/conda.sh”
conda activate
conda init
After installation, restart the terminal. If the installation was successful, you will see (base) prepending to your prompt string. To confirm that conda is working, you can ask it which version was installed:
conda –version
which should return something like this, it doesn’t have to be the exact same version:
conda 24.7.1
Note: If you see zsh: command not found: conda
, see the section on Bash below to set your default Terminal shell to Bash as opposed to Zsh.
You can change the default shell in your Terminal to Bash by opening the Terminal and typing:
chsh -s /bin/bash
You will have to quit all instances of open Terminals and then restart the Terminal for this to take effect.
Windows
You can find the Mac ARM and Intel download links here: https://conda-forge.org/miniforge/. Make sure you use the Miniforge3 installers, not the other ones listed. We will assume you downloaded the file into your Downloads folder.
Once downloaded, run the installer.
Use all the default options in the installer.
The install location should look something like: C:\Users\YOUR_USER_NAME\miniforge3
Note: Do not add miniforge to PATH. We will set this up later.
After installation, open the Start Menu and search for the program called “Miniforge Prompt”. When this opens you will see a prompt similar to (base) C:\Users\your_name
.
Type the following to check that your Python installation is working:
python –version
which should return Python 3.11.0 or greater:
Python 3.11.0
If not, run the following command in the Miniforge Prompt, Confirm that you are in the (base) environment. Then update the base python with:
conda install python=3.11
Integrating Python with the Git Bash terminal
To avoid having to open the separate Anaconda Prompt every time we want to use Python, we can make it available from the (Git Bash) terminal, which is what we will be using most of the time. To set this up, open the “Miniforge Prompt” again and type:
conda init bash
You will see that this modified a few configuration files, which makes conda visible to the terminal. Close all open terminal windows and launch a new one, you should now see that the prompt string has changed to include the word (base).
Set up the ADSC 3610’s environment#
First off, clone the worksheet 1 github repo to your local computer.
If you don’t know how to clone a github repo yet, you can just open a terminal and type
git clone [insert your github URL to worksheet1 here]
For example, if I were a student, I would run git clone https://github.com/TRU-PBADS/week1-intro-nosql-quan3010
Second, open a terminal, and navigate to the git folder that you just cloned locally
cd [PATH to your git repo locally]
Third, install the conda environment based on the provided
adsc_3610_env.yaml
file
conda env create -f adsc_3610_env.yaml
This will take a while to set up the environment. You might be prompted to say y/n a few times. Once it’s finished, you can open a IDE (e.g., vscode or jupyterlab) and select the adsc_3610 environment
Exercise 1#
{rubric: accuracy = 5}
Follow the installation guide above. If successful, you should be able to run the following code chunk. We will try to import pymongo
(to connect to mongoDB), pyspark
which we will learn later, and otter
which is an autograder package.
import pymongo
print(pymongo.__version__)
4.8.0
import pyspark
print(pyspark.__version__)
3.5.1
import otter
print(otter.__version__)
5.5.0
Exercise 2#
{rubric: accuracy = 5}
In this course, there will be some autograded question. Now I would like to you test one of them to see if it’s working.
Write a simply Pythonofunction to sum up two numbers
def sum(a,b):
# BEGIN SOLUTION
return a+b
# END SOLUTION
Run the cell below to test if your sum function pass the autograded test. You should see a message like “q1 passed! 🎉”
assert sum(1,2) == 3, "Your function is not implemented correctly"
If you run into an error like this
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 grader.check("q1")
NameError: name 'grader' is not defined
Then make sure you run the first cell, on top of this notebook, to initialize otter-grader
Exercise 3: Create a cluster on mongoDB Atlas#
In this exercise, we will create a free cluster on mongoDB Atlas and load sample datasets.
3.1#
{rubric: completion = 5}
Please watch the following tutorial and create your own cluster
3.2#
{rubric: accuracy = 5}
After loading the data, open the sample_restaurant
database and answer the following questions:
How many collections are present in the
sample_restaurant
database?List the number of documents and the storage size of each collection in the
sample_restaurant
database
Hints: You can read more about the documentation of sample_restaurant
database here https://www.mongodb.com/docs/atlas/sample-data/sample-restaurants/
SOLUTION
There are total 2 collections in the sample_restaurant database.
Neighbour collection has 195 documents and 1.83MB storage.
Restaurant collection has 25359 documents and 4.72MB storage.
3.3#
{rubric: accuracy = 5}
Open the restaurants
collection. Filter for this ObjectId
{_id: ObjectId(‘5eb3d668b31de5d588f4292c’)}
Answer following questions:
What are the coordinates of that restaurant?
Which borough does this restaurant locate in?
How many reviews did this restaurant get?
What is the lowest rating score?
What is the highest rating score?
_id : 5eb3d668b31de5d588f4292c
coord : Array (2) 0 -74.1377286 1 40.6119572
borough : “Staten Island”
cuisine “Jewish/Kosher”
reviews : 4
lowest score : 9
highest score : 12
Exercise 4: Connect to your mongoDB using pymongo#
What you need:
The host URL of your mongoDB connection (should look something like
cluster0.lqirl.mongodb.net
)Your mongoDB username
Your mongoDB password
Modify the credentials_mongodb.json
file with appropriate information above.
Below is the starter code to connect to your MongoDB database, provided that you have a credentials_mongodb.json
file with the correct information.
from pymongo import MongoClient # import mongo client to connect
import json # import json to load credentials
import urllib.parse
# load credentials from json file
with open('credentials_mongodb.json') as f:
login = json.load(f)
# assign credentials to variables
username = login['username']
password = urllib.parse.quote(login['password'])
host = login['host']
url = "mongodb+srv://{}:{}@{}/?retryWrites=true&w=majority".format(username, password, host)
# connect to the database
client = MongoClient(url)
4.1#
{rubric: accuracy = 5}
List all databases in the client server
Hint: See the lecture 3 for example
# list all databases
client.list_database_names()
['bookstore',
'library',
'sample_airbnb',
'sample_analytics',
'sample_geospatial',
'sample_guides',
'sample_mflix',
'sample_restaurants',
'sample_supplies',
'sample_training',
'sample_weatherdata',
'school',
'admin',
'local']
4.2#
{rubric: accuracy = 5}
List all collections in the sample_restaurants database
Hint: See the lecture 3 for example
# list all collections in the sample_restaurants database
client.sample_restaurants.list_collection_names()
['sample', 'restaurants', 'neighborhoods']
4.3#
{rubric: accuracy = 5}
Display the first document in the restaurants collection in the sample_restaurants database
Hint: See the lecture 3 for example
# show the first document
client.sample_restaurants.restaurants.find_one()
{'_id': ObjectId('5eb3d668b31de5d588f4292c'),
'address': {'building': '2206',
'coord': [-74.1377286, 40.6119572],
'street': 'Victory Boulevard',
'zipcode': '10314'},
'borough': 'Staten Island',
'cuisine': 'Jewish/Kosher',
'grades': [{'date': datetime.datetime(2014, 10, 6, 0, 0),
'grade': 'A',
'score': 9},
{'date': datetime.datetime(2014, 5, 20, 0, 0), 'grade': 'A', 'score': 12},
{'date': datetime.datetime(2013, 4, 4, 0, 0), 'grade': 'A', 'score': 12},
{'date': datetime.datetime(2012, 1, 24, 0, 0), 'grade': 'A', 'score': 9}],
'name': 'Kosher Island',
'restaurant_id': '40356442'}
Submission instructions#
{rubric: mechanics = 5}
Make sure the notebook can run from top to bottom without any error. Restart the kernel and run all cells.
Commit and push your notebook to the github repo
Double check your notebook is rendered properly on Github and you can see all the outputs clearly