Here’s a piece of advice I wish someone had given me when I was starting out in data science – learn as much as you can about working with databases.
Here’s a quick look at where your database knowledge will come into play:
And a whole lot more!
The incontrovertible truth is that we are generating data at an unprecedented pace and scale right now. The sheer fact that more than 8,500 Tweets and 900 photos on Instagram are uploaded in just one second blows my mind. It boggles the mind – how are modern-day databases coping up with such volumes of data?
To handle this much amount of data, we need a distributed database system that can run multiple nodes and are partition tolerant as well. It means even if one of the nodes goes down for any reason, the system should work seamlessly. So Partition Tolerance is a must-have thing. Now according to CAPs theorem, we cannot have Partition Tolerance, Availability, and Consistency all three at the same time.
We have to trade between Availability and Consistency. For example, in a banking application, a customer should see the correct balance regardless of where he/she accesses it from. The results can be a few seconds late but they should be highly consistent.
In this article, we will see different types of NoSQL databases, their features, and when to use each database type.
So what is a NoSQL database?
You might have heard people saying that a NoSQL Database is any non-relational database that doesn’t have any relationship between the data. Well, that’s not completely true. They can also store the relationship between the data but in a different way.
We can say that “NoSQL” stands for “Not Only SQL”. Here, data is not split into multiple tables, as it allows all the data that is related in any way possible, in a single data structure. When you work with a huge amount of data, you don’t need to worry about the performance lags when you query a NoSQL database. No need to run the expensive joins! They are highly scalable and reliable and designed to work in a distributed environment.
Now that we know what a NoSQL database is, let’s explore the different types of NoSQL databases in this section.
Document-based databases store the data in JSON objects. Each document has key-value pairs like structures:
The document-based databases are easy for developers as the document directly maps to the objects as JSON is a very common data format used by web developers. They are very flexible and allow us to modify the structure at any time.
Some examples of document-based databases are MongoDB, Orient DB, and BaseX.
As the name suggests, it stores the data as key-value pairs. Here, keys and values can be anything like strings, integers, or even complex objects. They are highly partitionable and are the best in horizontal scaling. They can be really useful in session oriented applications where we try to capture the behavior of the customer in a particular session.
Some of the examples are DynamoDB, Redis, and Aerospike.
This database stores the data in records similar to any relational database but it has the ability to store very large numbers of dynamic columns. It groups the columns logically into column families.
For example, in a relational database, you have multiple tables but in a wide-column based database, instead of having multiple tables, we have multiple column families.
Here is a good resource to learn more about column-based databases:
Popular examples of these types of databases are Cassandra and HBase.
They store the data in the form of nodes and edges. The node part of the database stores information about the main entities like people, places, products, etc., and the edges part stores the relationships between them. These work best when you need to find out the relationship or pattern among your data points like a social network, recommendation engines, etc.
Some of the examples are Neo4j, Amazon Neptune, etc.
Now, let’s have a look at some of the NoSQL databases and their features.
MongoDB is the most widely used document-based database. It stores the documents in JSON objects.
According to the website stackshare.io, more than 3400 companies are using MongoDB in their tech stack. Uber, Google, eBay, Nokia, Coinbase are some of them.
If you want to start with MongoDB, I highly recommend going through the below articles:
Cassandra is an open-source, distributed database system that was initially built by Facebook (and motivated by Google’s Big Table). It is widely available and quite scalable. It can handle petabytes of information and thousands of concurrent requests per second.
Again, according to stackshare.io, more than 400 companies are using Cassandra in their tech stack. Facebook, Instagram, Netflix, Spotify, Coursera are some of them.
This is also an open-source, distributed NoSQL database system. It is highly scalable and consistent. You can also call it as an Analytics Engine. It can easily analyze, store, and search huge volumes of data.
If the full-text search is a part of your use case, ElasticSearch will be the best fit for your tech stack. It even allows search with fuzzy matching.
More than 3000 companies are using Elasticsearch in their tech stack, including Slack, Udemy, Medium, and Stackoverflow.
It is a key-value pair based distributed database system created by Amazon and is highly scalable. But unfortunately, it is not open-source. It can easily handle 10 trillion requests per day so you can see why!
More than 700 companies are using DynamoDB in their tech stack including Snapchat, Lyft, and Samsung.
It is also an open-source highly scalable distributive database system. HBase was written in JAVA and runs on top of the Hadoop Distributed File System (HDFS).
More than 70 companies are using Hbase in their tech stack, such as Hike, Pinterest, and HubSpot.
This is by no means an exhaustive list. There are more NoSQL databases out there but these are the most widely used in the industry.
If you have worked with any of these databases or any other NoSQL database, let me know in the comments section below. I would love to hear about your experience!
There is a lot of difference in the data science we learn in courses and self-practice and the one we work in the industry. I’d recommend you to go through the following crystal clear free courses to understand everything about analytics, machine learning, and artificial intelligence:
A great overview and a good starting point for learning No SQL Databases
Thank you. Nicely written, I enjoy reading.☕☺
I got a brief overview of NoSQL database, Thank you so much...