Every day on the internet, more than 2.5 quintillion bytes of data are created. This data is increasing in terms of variety, velocity and volume, hence called big data. To analyze this data, one has to collect this data, store it in a safe place, clean it and then perform analysis. One of the major problems faced by big data engineers is dealing with unuseful or redundant data. A lot of time and memory is used to store and analyze this extra data which turns out to be fruitless in the end. Thus, the removal of duplicate data becomes extremely essential to cut the analysis cost and reduce redundancy.
Data cleaning can be done using various techniques but before cleaning the data, it is necessary to know the amount of useful data present in the dataset. Therefore, before the removal of duplicate data from a data stream or database, it is necessary to have knowledge of distinct or unique data present. A way to do so is by hashing the elements of the universal set using the Flajolet Martin Algorithm. The FM algorithm is used in a database query, big data analytics, spectrum sensing in cognitive radio sensor networks, and many more areas. It shows superior performance as compared with many other methods to find distinct elements in a stream of data.
This article was published as a part of the Data Science Blogathon
Flajolet Martin Algorithm, also known as FM algorithm, is used to approximate the number of unique elements in a data stream or database in one pass. The highlight of this algorithm is that it uses less memory space while executing.
Pseudo Code-Stepwise Solution:
The Flajolet-Martin algorithm is a potent tool for ascertaining the count of distinct elements within a database. Its utility shines in scenarios where memory size is substantial and processing the complete dataset poses challenges. Key uses and advantages include:
Let us compare this algorithm with our conventional algorithm using python code. Assume we have an array(stream in code) of data of length 20 with 8 unique elements. Using the brute force approach to find the number of unique elements in the array, each element is taken into consideration. Another array(st_unique in code) is formed for unique elements. Initially, the new array is empty(st_unique length equals zero), so naturally, the first element is not present in it. The first element is considered to be unique as it does not exist in the new array and thus a copy of the first element is inserted into the new array(1 is appended to st_unique).
Similarly, all the elements are checked, if they are already present in the new array, they are not considered to be unique, else a copy of the element is inserted into the new array. Running the brute force algorithm for our array, we will get 8 elements in the new array. If each element takes 20 bytes of data, the new array will take 8*20= 160 bytes memory to run the algorithm.
import time
stream=[1,2,3,4,5,6,4,2,5,9,1,6,3,7,1,2,2,4,2,1]
print('Using conventional Algorithm:')
start_time = time.time()
st_unique=[]
for i in stream:
if i in st_unique:
continue
else:
st_unique.append(i)
print('distinct elements',len(st_unique))
print("--- %s seconds ---" % (time.time() - start_time))
For the same array, if use the FM algorithm for the same array, we define a variable(maxnum in code) that stores the maximum number of zeroes at the end. For each value in the array(stream in code), we run a loop to covert its hash function in the form ax+b mod c,(a=1, b=6 and c=32 in this case) into binary(we place [2:] at the end because in python when converted to binary, the number starts with ‘0b’). We run another loop to find if the number of zeroes at the end exceeds the maximum number of zeroes.
In this case, if each variable occupies 2 bytes of data, the whole program takes 4*20= 80 bytes of data, i.e half of the memory used in the above case. Here, we have considered the variables maxnum, sum, val, and j. We have not considered i, time, and stream as they are common in both of the codes.
stream=[1,2,3,4,5,6,4,2,5,9,1,6,3,7,1,2,2,4,2,1]
print('Using Flajolet Martin Algorithm:')
import time
start_time = time.time()
maxnum=0
for i in range(0,len(stream)):
val= bin((1*stream[i] + 6) % 32)[2:]
sum=0
for j in range(len(val)-1,0,-1):
if val[j]=='0':
sum+=1
else:
break
if sum>maxnum:
maxnum=sum
print('distict elements', 2**maxnum)
print("--- %s seconds ---" % (time.time() - start_time))
For small values of m(where m is the number of unique elements), the brute force approach can work, but for large data sets or data streams, where m is very large, a lot of space is required. The compiler may not let us run the algorithm in some cases.
This is where the Flajolet Martin Algorithm can be used. Not only does it occupy less memory, but it also shows better results in terms of time in seconds when the python code is run which can be shown in our output as we calculated seconds taken by both algorithms by using time.time() in python. As shown in the output, it can clearly be said that the FM algorithm takes very little time as compared to the conventional algorithm.
It is important to choose the hash parameters wisely while implementing this algorithm as it has been proven practically that the FM Algorithm is very sensitive to the hash function parameters. The hash function used in the above code is of the form ax+b mod c where x is an element in the stream and a,b,c are 1,6,32 respectively where a,b,c are the hash function parameters. For any other values of a,b,c the algorithm may not give the same result. Thus, it is important to observe the stream and then assign proper values to a,b and c.
This approach is used to maintain a count of distinct values seen so far, given a large number of values. For example, getting an approximation of the number of distinct URLs surfed by a person on the web. Many companies want to check how many unique users logged in to their website the previous day to check if their advertisement was successful or not. Here, the FM algorithm is an excellent solution for these companies.
If you want to learn more of these data science concepts, enroll in our Blackbelt program that offers exclusive modules for data science enthusiasts. Explore the program now.
A. The Flajolet-Martin algorithm is a probabilistic method to estimate the cardinality or number of distinct elements in a dataset or data stream. It employs hash functions and bit manipulation for efficient approximations.
A. The Flajolet-Martin algorithm uses hash-based techniques to estimate unique elements in a data stream. It leverages bit patterns to approximate cardinality, making it efficient for large-scale datasets with minimal memory usage.
A. In data analytics, the Flajolet-Martin algorithm aids in estimating distinct elements within large datasets. It offers a memory-efficient way to understand dataset diversity, which is crucial for insights and decision-making.
A. In machine learning, the Flajolet-Martin algorithm isn’t a central component. However, its probabilistic approach might find applications in tasks where estimating the cardinality of distinct features is relevant, such as feature engineering or data preprocessing.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Actually, in your experiment FM appears to be slower: note that the running time of "conventional" is multiplied by e-05. 20 elements is just too small to appreciate the difference.
Insightful read. Fascinating to even think about the possible applications for this Algorithm. This info can be leveraged in a big way .
Your code doesn't implement bloom filter algorithm at all. first study the algorithm and then implement it all you have done is just iterate the list and store the distinct elements. I'm surprised this article is present on such a good platform.