A Complete Guide on Using MongoDB

Ritika Last Updated : 08 Jun, 2022

9 min read

This article was published as a part of the Data Science Blogathon.

MongoDB is a type of NoSQL database which is open-sourced and widely used in data science and machine learning in form of a database. Hence, knowledge of it is important in the long run, especially in the field of data science and analytics.

In this article, we will explore how to use MongoDB for our data science applications. We shall learn about its powerful aggregation pipeline feature.

For this guide tutorial, we shall be using the Nobel Prize Dataset API which is a database of all Nobel prize winners. One can check the documentation of their API. In this tutorial, we shall use two datasets “laureates”(which contains detailed information about each of the laureate who won the Nobel prize) and “prize” (which contains information about Nobel prize winners).

Introduction to MongoDB

MongoDB was founded in the year 2007 by Dwight Merriman, Eliot Horowitz and Kevin Ryan. MongoDB is a scalable and flexible NoSQL document-based database platform designed to overcome the shortcomings of relational databases. It is known for its horizontal scaling & load balancing capabilities, which have given application developers an unprecedented level of scalability & flexibility

MongoDB features

MongoDB has many useful features they are:

Document-oriented

It is a document-oriented database, which is a great feature itself. As is known in relational databases, there are tables and rows of the data. Every row has a fixed number of columns which is non-dynamic.

This is where the unstructured nature of NoSQL comes where there are fields instead of tables and rows. In MongoDB, we can have many Databases(analogous to the schema of SQL). Each database consists of several collections(analogous to tables in SQL). There are collections of similar documents. The collection consists of documents and each document contains fields and unique keys and indexes such as object id which can be user or system defined.

Create Database and Collection in MongoDB

As discussed earlier our first step is to create a database name “nobel“.

Mongo shell command : use

use nobel

Pymongo command

In mongo, we can use the `dict` notation on the client. If the db already exists then mongo returns a reference to that db otherwise creates new db. To insert a number of documents we can use the `insert_many()` method on the collection instance and pass an iterable python object such as a list of dictionaries.

import pymongo
import requests
#Connect to local mongo server through client
mongo_client = pymongo.MongoClient("mongodb://localhost:27017/")
#Create database
nobel_prize_db = mongo_client['nobel']
#fetch prize json data from api request
response = requests.get("http://api.nobelprize.org/v1/prize.json")
#convert response to json data
prize_data = response.json()['prizes']
#insert the list of documents into collection ; we can pass any iterable python elemnent to insert many
nobel_prize_db['prize'].insert_many(prize_data)

Similarly, do for laureates’ data

#fetch laureate json data from api request
response = requests.get("http://api.nobelprize.org/v1/laureate.json")
#convert response to json data
laureate_data = response.json()['laureates']
#insert the list of documents into collection ; we can pass any iterable python elemnent to insert many
nobel_prize_db['laureates'].insert_many(laureate_data)

Now if we check the compass and refresh we will see our database within it two collections and each collection consists of respective documents.

Create Database and Collection in MongoDB

NOTE: MongoDB does not allow us to create an empty database . This means even if we run the use db or create client[‘nobel’] then also no db is created unless we create any collection in it.

Common MongoDB Operations

List database and collection names

Mongo Shell :

Pymongo

nobel_prize_db.list_database_names()
prize_coll = nobel_prize_db['prize']
prize_coll.list_collection_names()

Drop Collection or Database

We can drop any collection or database by simply switching to that Db to drop a collection within that db.

Mongo Shell

Pymongo

nobel_prize_db.drop()
prize_coll = nobel_prize_db['prize']
prize_coll.drop()

Find Documents based on condition query

Pymongo

We can search and retrieve documents based on a condition query. We have two methods in collection .find() and .find_one() . To retrieve a single document we use .find_one() and for getting multiple documents we can use .find() which returns a cursor like object using next() method we can get each document through cursor. This method accepts an optional filter argument that specifies the pattern that the document must match.

Find Nobel prize winner of chemistry in the 2021 year. This will return a single document so we use the find_one() method.

#to find nobel prize winners of chemistry in 2021
single_docu = prize_coll.find_one({"year":'2021', "category":"chemistry"})
print(single_docu)

Output

{‘_id’: ObjectId(‘629427daf79936a9106eb2be’), ‘year’: ‘2021’, ‘category’: ‘chemistry’, ‘laureates’: [{‘id’: ‘1002’, ‘firstname’: ‘Benjamin’, ‘surname’: ‘Li

st’, ‘motivation’: ‘”for the development of asymmetric organocatalysis”‘, ‘share’: ‘2’}, {‘id’: ‘1003’, ‘firstname’: ‘David’, ‘surname’: ‘MacMillan’, ‘moti

vation’: ‘”for the development of asymmetric organocatalysis”‘, ‘share’: ‘2’}]}

Count Number of Documents in Collection

Pymongo

Here we need to pass an empty query filter({}) to the count_documents method of find. If we pass any query then it will filter documents based on that query and return the count of documents based on the filter

prize_coll = nobel_prize_db['prize']
#to find number of documents in a collection
print(prize_coll.count_documents({}))

Update documents in a collection

Pymongo

We can update data in a collection using update_one() method and update_many() method. We need to pass a filter query to select a document to update and then the second argument is an update operation. There are a set of update operators that can be used for example here we set a new field to an existing document using the `$set` operator.

#update single document
val =prize_coll.update_one({"year":'2021', "category":"chemistry"}, {"$set":{"test_update": "update"}})
print(val.modified_count , val.matched_count)

Output

1 1

Dot Notation and Projection

Dot notation is how MongoDB allows us to query document substructure. MongoDB allows us to query document substructure using dot notation. In the laureates’ collection, each laureate has won a prize and for each prize, they are affiliated with a particular university/college during the time of winning.

Let us find the affiliated college where “Amartya Sen won the Nobel prize”.

Projections allow us to select which field we want to display in the returned documents from a query. Syntax is {: 0 or 1} 0 means hide and 1 means display and default is 0. If no projection is mentioned the whole document is returned as-is.

laureates_coll = nobel_prize_db['laureates']
affiliated_college = laureates_coll.find({"firstname":"Amartya"},{"prizes.affiliations.name": 1})
print(list(affiliated_college))

Output
[{'_id': ObjectId('629427e1f79936a9106eb808'), 'prizes': [{'affiliations': [{'name': 'Trinity College'}]}]}]

Composing Query Filters

There are a number of ways in which we can query the document and MongoDB provides a number of query selectors. We will go through some important ones here.

We can use a query filter document which uses the query operators to specify conditions in the following form:

{ : { :  }, ... }

$exists Operator

#Exists
no_bornCountry = laureates_coll.find({"bornCountry": {"$exists": False}}, {"firstname": 1})
list_no_bornCountry = list(no_bornCountry)
print("Count of winners with no born country:", len(list_no_bornCountry))
for v in list_no_bornCountry:

   print(v['firstname'])

Output

Count of winners with no born country: 26

Institute of International Law

Permanent International Peace Bureau

International Committee of the Red Cross

Nansen International Office for Refugees

Friends Service Council

American Friends Service Committee

Office of the United Nations High Commissioner for Refugees

League of Red Cross Societies

….

We can see there are 26 winners with no born country they are mainly organizations.

In and Greater Than Operator

Let us find winners of the chemistry and physics categories with a prize share value greater than equal to 4.

#In
counts = laureates_coll.count_documents({"prizes.category":{"$in":["physics","chemistry"]}, "prizes.share":{"$gte":"4"}})
print(counts)

Output
56

Introduction Aggregation Pipelines

There are cases where you may want to avoid having to fetch and iterate over lots of data client-side. MongoDB can do a good chunk of our data analysis and aggregation for us using the Aggregation Pipeline. In the aggregation pipeline, we define a series of stages which are transformations that needs to be done on the collection. It takes in an input collection performs the stages and returns the processed final results of documents.

Each stage of the pipeline is executed before transferring the output to the next stage for processing. They are extremely fast and performant.

Source: https://www.codeproject.com/Articles/1149682/Aggregation-in-MongoDB

Nobel Prize Data Analysis using Pipelines

We shall now apply the various aggregation stages to our data and see their usage. To read more about various aggregation pipelines visit :

Grouping Stage

This stage is similar to Group By in SQL wherein we can group based on any field and then perform operations on each group.

Lets us try to find the count of Nobel prizes in each category in the prize collection.

Mongo Compass

In Mongo Compass we can easily apply aggregations and view results in its GUI.

Pymongo

category_wise_count = list(prize_coll.aggregate([{
    "$group":{"_id": "$category", "categorywise_count": {"$count": {}}}
            }]))
print(category_wise_count)

Output

[{‘_id’: ‘peace’, ‘categorywise_count’: 121}, {‘_id’: ‘physics’, ‘categorywise_count’: 121}, {‘_id’: ‘literature’, ‘categorywise_count’: 121}, {‘_id’: ‘med

icine’, ‘categorywise_count’: 121}, {‘_id’: ‘chemistry’, ‘categorywise_count’: 121}, {‘_id’: ‘economics’, ‘categorywise_count’: 53}]

In the output we can see there are 6 categories and almost all have 121 prize counts except economics with a prize count of 53.

Unwind, Set, Sort, Project Stages

Now we will find the count of laureates grouped by category and country. For this, we have to perform 5 stages. First is the $unwind stage which unfolds each array element specified for the document. Then we use $group stage to group based on multiple fields this time. Next, we use the $set operator which allows us to create new fields for each document. Next, we sort the documents in descending order based on the count. Finally we $project the results.

Pymongo

results = list(laureates_coll.aggregate([
    #1st Stage Unwind prize Array
    {"$unwind":{"path": "$prizes", "preserveNullAndEmptyArrays": False}},
    #2nd Stage: Group based on category and borncountry and perform count operation
    {"$group":{"_id": {"category":"$prizes.category","country":"$bornCountry"}, "count": {"$count":{}}}},
    #3rd Stage Set new fields on documents from previous stage COuntry and category
    {"$set": {"category": "$_id.category","country":"$_id.country"}},
    #4th Stage : Sort the document in descending order of count from prev stage
    {"$sort":{"count": -1}},
    #5th Stage Project the country , category and count fields
    {"$project":{"_id" : 0,"category":1,"country":1,"count":1}}

]))

for i in results[:10]:
    print(i)

Output
{'count': 79, 'category': 'medicine', 'country': 'USA'} 
{'count': 70, 'category': 'physics', 'country': 'USA'}  
{'count': 55, 'category': 'chemistry', 'country': 'USA'}
{'count': 50, 'category': 'economics', 'country': 'USA'}
{'count': 28, 'category': 'peace'}                      
{'count': 25, 'category': 'medicine', 'country': 'United Kingdom'}
{'count': 25, 'category': 'chemistry', 'country': 'United Kingdom'}
{'count': 23, 'category': 'physics', 'country': 'United Kingdom'}
{'count': 22, 'category': 'chemistry', 'country': 'Germany'}
{'count': 19, 'category': 'peace', 'country': 'USA'}

Thus, we can observe that for the categories “medicine,physics ,chemistry,economics” highest number of winners are from the USA country followed by the UK.

Match, Date Handling

Let us now try to find the category-wise average age of laureates. For this, we use the match operator which acts as the select operator of SQL.

Pymongo

results = list(laureates_coll.aggregate([
    #1st Stage Unwind prize Array

{“$unwind”:{“path”: “$prizes”, “preserveNullAndEmptyArrays”: False}},

    #2nd Stage: Select documents which have born and prizes.year field
    {"$match": {"born": {"$exists": True},"prizes.year":{"$exists": True}}},
    #3rd Stage Set new fields on documents extract born year using substring
    {"$set": {"bornYear": {"$toLong" :{"$substr":["$born",0,4]}}, "winningYear":{"$toLong":"$prizes.year"} }},
    #4th Stage : Set new field which is calculated age of laureates at the time of winning nobel prize
    {"$set":{"age":{"$toInt":{"$subtract":["$winningYear","$bornYear"]} }}},
    #5th Stage Group the fields by category and find average age per category and count
    {"$group":{"_id" : "$prizes.category","average_age":{"$avg": "$age"},"category_wise_laureates_count":{"$count":{}} }}
]))

for i in results[:10]:
print(i)

Output

{‘_id’: ‘chemistry’, ‘average_age’: 58.755319148936174, ‘category_wise_laureates_count’: 188}

{‘_id’: ‘medicine’, ‘average_age’: 58.59375, ‘category_wise_laureates_count’: 224}

{‘_id’: ‘physics’, ‘average_age’: 56.821917808219176, ‘category_wise_laureates_count’: 219}

{‘_id’: ‘literature’, ‘average_age’: 64.86440677966101, ‘category_wise_laureates_count’: 118}

{‘_id’: ‘peace’, ‘average_age’: 57.68382352941177, ‘category_wise_laureates_count’: 136}

{‘_id’: ‘economics’, ‘average_age’: 66.7752808988764, ‘category_wise_laureates_count’: 89}

As we can see the average is highest for economics and the lowest for physics.

Sharding

Sharding is the process of dividing a dataset and storing them in multiple machines this improves the efficiency of operations for very large data.

A sharded cluster consists of :

Shards
Mongos
Config Server

The shard is a subset of the dataset.

The mongos serves as a query router for client requests, handling both read and write operations. Clients do not connect to individual shards instead they connect to a mongos. It dispatches client requests to appropriate shards & aggregates the result from shards to consistent client response.

Config servers are the handle source of sharding metadata. The metadata consists of various information such as sharded collections, routing information, etc.

Shard Key

MongoDB performs sharding at the collection level. MongoDB uses the shard key as a strategy to distribute collection documents across shards. MongoDB first splits data into “chunks”, by dividing the span of shard key values into non-overlapping ranges. MongoDB then tries to distribute those chunks evenly among the shards in the cluster.

Indexes

Indexes are data structures that store a small portion of the collection’s data set so that it can be traversed easily. The index stores the value of a specific field or set of fields, which are sorted by the value of the field. Thus, ordering index entries supports speedy equality matches and range-based query operations. Apart from this, MongoDB can return sorted results by using order in the index.

MongoDB defines indexes at the collection level.

https://www.mongodb.com/docs/manual/indexes/

By default, the object _id is the default index.

Creating of the Index

One can easily create an index in Mongo Compass.

Select the collection
Navigate to the Indexes Tab at the top
Click on create Index and then add fields.
Choose the strategy sort in ascending or descending.

Conclusion

Thus, we saw how MongoDB is becoming an indispensable NoSQL database with several applications in Data Science. We learned about installing and setting up Mongo server locally and went through all its essential and common operations with examples from the Nobel prize dataset. We then saw how to use the powerful Aggregation Pipeline in Python using pymongo.

As a further learning endeavour, one could explore MongoDB as a service through MongoDB Atlas.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ritika

I am a professional working as data scientist after finishing my MBA in Business Analytics and Finance. A keen learner who loves to explore and understand and simplify stuff! I am currently learning about advanced ML and NLP techniques and reading up on various topics related to it including research papers .

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

A Complete Guide on Using MongoDB

Introduction to MongoDB

MongoDB features

Document-oriented

Create Database and Collection in MongoDB

Common MongoDB Operations

Dot Notation and Projection

Composing Query Filters

Introduction Aggregation Pipelines

Nobel Prize Data Analysis using Pipelines

Grouping Stage

Mongo Compass

Unwind, Set, Sort, Project Stages

Match, Date Handling

Sharding

Shard Key

Indexes

Creating of the Index

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at