This article was published as a part of the Data Science Blogathon.
Data sharing has become so easy today, and we can share the details with just a few clicks. To access services, we need to share essential details like email IDs, phone numbers, social security numbers, etc. These details can get leaked if the service provider doesn’t follow a robust data protection methodology. Many data breaches happen due to negligent or accidental exposure, which may impact the user personally, professionally, or economically. We have our email ids, phone numbers, and government-issued cards, which are sensitive and confidential. We must protect them so they can’t get into the wrong hands.
In this article, we will work on two different methods to encrypt these data so that they can’t get into the hands of unauthorized users. We will see how we can encrypt and decrypt the sensitive data using PySpark.
Data encryption is essential in several contexts. Suppose an organization that deals with different clients has to share the data to provide services to them. Clients share their confidential details with firms like their database, customer info, products they sell or purchase, etc.
All these details are sensitive and must be protected so they can’t get into the wrong hands. If unauthorized individuals access these data, it can lead to severe consequences such as financial loss, reputational damage, or even legal liabilities.
So data encryption helps us to protect sensitive and confidential information. It is a very crucial aspect of data security.
To perform encryption and decryption, we need sample data with essential information like user email id, phone number, social security number, address, etc. Before sending these details to a user, they need to be encrypted. So, we will create a sample dataframe that has this information. The dataframe has four columns named ‘customer_name’, ‘mail_id’, ‘phone_num’, and ‘social_security_number’. The column’s descriptions are as follows:-
# import necessary libs from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, LongType, StringType # create a SparkSession spark = SparkSession.builder.appName("demo").getOrCreate() # define the schema for the DataFrame schema = StructType([ StructField("customer_name", StringType(), True), StructField("mail_id", StringType(), True), StructField("mobile_num", LongType(), True), StructField("social_security_number", StringType(), True) ]) # create the sample data data = [ ("Max", '[email protected]', 9789457864, '7548-8546-4512'), ("Michael", '[email protected]', 9089848243, '7845-8745-8756'), ("Alex", '[email protected]', 9589848643, '3245-6547-9854'), ("Hector", '[email protected]', 9189648245, '6547-7845-2150') ] # create the DataFrame df = spark.createDataFrame(data, schema) df.show()
The output of the above dataframe will be:-
+-------------+-------------------+----------+----------------------+ |customer_name| mail_id|mobile_num|social_security_number| +-------------+-------------------+----------+----------------------+ | Max| [email protected]|9789457864| 7548-8546-4512| | Michael|[email protected]|9089848243| 7845-8745-8756| | Alex|[email protected]|9589848643| 3245-6547-9854| | Hector|[email protected]|9189648245| 6547-7845-2150| +-------------+-------------------+----------+----------------------+
The above data, like name, email, mobile number, and social security number, are the user’s personal information and can’t be shared directly with any other person or organization. To share these details, we must encrypt and send this data. On the other end, the receiver can decrypt this data with the key.
We will work with the inbuilt function to encrypt the above dataframe data. We will use the aes_encrypt function to encrypt the ‘mail_id’, ‘mobile_num’, and ‘social_security_number’ columns. Later we will use the aes_decrypt function to decrypt the encrypted data. The decoded data value will get compared with the original values for successful decryption.
aes_encrypt() – This function encrypts the plain text. In this, we will pass the column name whose data needs to encrypt inside the expr arguments. Then we give the key to decrypt the encrypted data. Then we pass the mode argument value and, finally, the padding value. The output of this function is the encrypted values.
This function will take the following arguments as input:-
Syntax of this function is aes_encrypt(expr, key[, mode[, padding]]). The output of this function will be encrypted data values. This function supports the key lengths of 16, 24, and 32 bits. The default mode is the GCM.
Now we will pass the column names in the expr function to encrypt the data values. The column names whose data we will encrypt are ‘mail_id’, ‘mobile_num’, and ‘social_security_num’. We are going to store the encrypted data in a new dataframe.
enc_df = df.withColumn('encrypted_mail', expr("base64(aes_encrypt(mail_id, '1234567890abcdef', 'ECB', 'PKCS'))")) .withColumn('encrypted_mobile_num', expr("base64(aes_encrypt(mobile_num, '1234567890abcdgh', 'ECB', 'PKCS'))")) .withColumn('encrypted_ssn', expr("base64(aes_encrypt(social_security_number, '1234567890abcdij', 'ECB', 'PKCS'))")) enc_df.show()
In this, we have created new column names using the ‘withColumn’ function; inside it, we have passed the column name in the expr function. We have used ‘1234567890abcdef’ as the encryption key to encrypt the ‘mail_id’ data. ECB is the mode, and PKCS is helpful for padding. The same thing goes for the other two columns. Only the keys are different. Here we also used ‘base64’ conversion to convert the bytes data into a text string.
Now we get the encrypted data which looks like this.
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+ |customer_name| mail_id|mobile_num|social_security_number| encrypted_mail|encrypted_mobile_num| encrypted_ssn| +-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+ | Max| [email protected]|9789457864| 7548-8546-4512|sk33JvRxTV9PU11qw...|4DF70TSV5/k2f7XDy...|kruqxwUhDD582Q4mf...| | Michael|[email protected]|9089848243| 7845-8745-8756|RzIRtA7ihZG7YlRj9...|eaMgFEdzEkqz7b6+Q...|QvfthH7TQqL6aJNp6...| | Alex|[email protected]|9589848643| 3245-6547-9854|ZahqBXBlprhgNfTyU...|msPEyWULCkIhbtel0...|1Majk18XVhQIJ10J5...| | Hector|[email protected]|9189648245| 6547-7845-2150|O3JpFSx0DGqs+XSIO...|647cANlvcGS4rwwVU...|cMH3zNTAgq8RmHL5R...| +-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
So, we have our encrypted data, and now we will see how to decrypt this data and get our original data back.
aes_decrypt()-Â We use this function to decrypt the data values. In this, we pass the data column information whose data need to decode. It will return the decoded data values as the final output.
Syntax of this function is aes_decrypt(expr, key[, mode[, padding]]). The output of this function will be decrypted original data values. This function supports the key lengths of 16, 24, and 32 bits.
Now we will pass the encrypted data columns in this function and compare the results with the original data.
# original data +-------------+-------------------+----------+----------------------+ |customer_name| mail_id|mobile_num|social_security_number| +-------------+-------------------+----------+----------------------+ | Max| [email protected]|9789457864| 7548-8546-4512| | Michael|[email protected]|9089848243| 7845-8745-8756| | Alex|[email protected]|9589848643| 3245-6547-9854| | Hector|[email protected]|9189648245| 6547-7845-2150| +-------------+-------------------+----------+----------------------+
# encrypted data +-------------+--------------------+--------------------+--------------------+ |customer_name| encrypted_mail|encrypted_mobile_num| encrypted_ssn| +-------------+--------------------+--------------------+--------------------+ | Max|sk33JvRxTV9PU11qw...|4DF70TSV5/k2f7XDy...|kruqxwUhDD582Q4mf...| | Michael|RzIRtA7ihZG7YlRj9...|eaMgFEdzEkqz7b6+Q...|QvfthH7TQqL6aJNp6...| | Alex|ZahqBXBlprhgNfTyU...|msPEyWULCkIhbtel0...|1Majk18XVhQIJ10J5...| | Hector|O3JpFSx0DGqs+XSIO...|647cANlvcGS4rwwVU...|cMH3zNTAgq8RmHL5R...| +-------------+--------------------+--------------------+--------------------+
# decrypted data +-------------+-------------------+--------------------+--------------+ |customer_name| decrypted_mail|decrypted_mobile_num| decrypted_ssn| +-------------+-------------------+--------------------+--------------+ | Max| [email protected]| 9789457864|7548-8546-4512| | Michael|[email protected]| 9089848243|7845-8745-8756| | Alex|[email protected]| 9589848643|3245-6547-9854| | Hector|[email protected]| 9189648245|6547-7845-2150| +-------------+-------------------+--------------------+--------------+
Now we will use the cryptography library to perform encryption and decryption. In this, we will create a user-defined function (udf) that will take data and complete the encryption and decryption.
Encrypting –
# import necessary libs from pyspark.sql.functions import udf, lit, col from cryptography.fernet import Fernet # encrypt func def encrypt_data(plain_text, KEY): f = Fernet(KEY) encrip_text = f.encrypt(str(palin_text).encode()).decode() return encrp_text encrypt_udf = udf(encrypt_val, StringType()) # generate the encryption key Key = Fernet.generate_key() # encrypt the 'mail_id', 'mobile_num', and 'social_security_number' cols enc_df = df.withColumn("encrypted_mail_id", encrypt(col('mail_id'), lit(Key))) .withColumn("encrypted_mobile_num", encrypt(col('mobile_num'), lit(Key))) .withColumn("encrypted_ssn", encrypt(col('social_security_number'), lit(Key))) enc_df.show()
In this, we have to generate the key to encrypt the data using the cryptography library, then pass the columns that we want to encrypt, and pass the encryption key along with it. Now we will see the encrypted results.
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+ |customer_name| mail_id|mobile_num|social_security_number| encrypted_mail_id|encrypted_mobile_num| encrypted_ssn| +-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+ | Max| [email protected]|9789457864| 7548-8546-4512|gAAAAABjpED66V3Xw...|gAAAAABjpED6oaixb...|gAAAAABjpED6TWeAg...| | Michael|[email protected]|9089848243| 7845-8745-8756|gAAAAABjpED7nVl6j...|gAAAAABjpED77xy8P...|gAAAAABjpED7D73yg...| | Alex|[email protected]|9589848643| 3245-6547-9854|gAAAAABjpED7Iuq5N...|gAAAAABjpED73BQYd...|gAAAAABjpED7OjE8W...| | Hector|[email protected]|9189648245| 6547-7845-2150|gAAAAABjpED7sT3Tz...|gAAAAABjpED7lH29J...|gAAAAABjpED7SXANT...| +-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
So, the ’email_id’, ‘mobile_num’, and ‘social_security_num’ gets encrypted.
Now we will see how to decrypt these encrypted columns to get the original values back.
Decrypting-
def decrypt_data(encrypt_data, KEY): f = Fernet(bytes(KEY)) decoded_val = f.decrypt(encrypt_data.encode()).decode() return decoded_val decrypt_udf = udf(decrypt_data, StringType()) # decrypt the data dec_df = enc_df.withColumn("decrypted_mail_id", decrypt_udf(col('encrypted_mail_id'), lit(Key))) .withColumn("decrypted_mobile_num", decrypt_udf(col('encrypted_mobile_num'), lit(Key))) .withColumn("decrypted_ssn", decrypt_udf(col('encrypted_ssn'), lit(Key))) .drop('mail_id', 'mobile_num', 'social_security_number') dec_df.show()
In this, we successfully decrypted the data and got back our original data. We can now see the result and compare it with actual data.
# original and encrypted data +-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+ |customer_name| mail_id|mobile_num|social_security_number| encrypted_mail_id|encrypted_mobile_num| encrypted_ssn| +-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+ | Max| [email protected]|9789457864| 7548-8546-4512|gAAAAABjpED66V3Xw...|gAAAAABjpED6oaixb...|gAAAAABjpED6TWeAg...| | Michael|[email protected]|9089848243| 7845-8745-8756|gAAAAABjpED7nVl6j...|gAAAAABjpED77xy8P...|gAAAAABjpED7D73yg...| | Alex|[email protected]|9589848643| 3245-6547-9854|gAAAAABjpED7Iuq5N...|gAAAAABjpED73BQYd...|gAAAAABjpED7OjE8W...| | Hector|[email protected]|9189648245| 6547-7845-2150|gAAAAABjpED7sT3Tz...|gAAAAABjpED7lH29J...|gAAAAABjpED7SXANT...| +-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
# decrypted data +-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+ |customer_name| encrypted_mail_id|encrypted_mobile_num| encrypted_ssn| decrypted_mail_id|decrypted_mobile_num| decrypted_ssn| +-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+ | Max|gAAAAABjpEE9TcrVL...|gAAAAABjpEE907red...|gAAAAABjpEE92mIuZ...| [email protected]| 9789457864|7548-8546-4512| | Michael|gAAAAABjpEE9UXJF6...|gAAAAABjpEE9OlqYJ...|gAAAAABjpEE9TV8rm...|[email protected]| 9089848243|7845-8745-8756| | Alex|gAAAAABjpEE93b3z_...|gAAAAABjpEE9knvQ7...|gAAAAABjpEE9rXc4g...|[email protected]| 9589848643|3245-6547-9854| | Hector|gAAAAABjpEE9bbV1Z...|gAAAAABjpEE9DfOWj...|gAAAAABjpEE9Lvw6g...|[email protected]| 9189648245|6547-7845-2150| +-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+
Note:- Encryption and hashing are different things. Hashing, once done, cannot be reverted to the original data. At the same time, we can decode the encoded values later to get the actual data value back.
In this article, we have covered two methods to encrypt and decrypt data while sharing. By doing so, we can ensure that our data is kept secure and protected from unauthorized access. In PySpark, we can achieve this by following the above two methods and efficiently safeguarding our data.
Key takeaways from this article are:-
This article helps you to perform encryption and decryption in PySpark. If you have any opinions or questions, then comment down below. Connect with me on LinkedIn for further discussion.
Keep Learning!!!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.