This article was published as a part of the Data Science Blogathon.
In my previous article, I discussed three python projects with codes and explained them in detail. Also gave you some examples which you can try. All these projects were beginner-friendly. This time, we will look at some more python projects with codes again. And the more projects you will make, the more you will get better in the programming and the language.
Image Source: https://realpython.com
Let’s get started!
OpenCV is a library of programming functions used mainly for computer vision tasks. With this, you can process images, resize images, object detection, etc. We will see how to extract text in a snap using contours.
Install these:
pip install pytesseract
pip install opencv-python
Python-tesseract is Google’s Tessaract-OCR engine used to get text from images. You will need this to execute a tesseract file and Download it from here.
Now let’s begin with the text extractions step by step:
1. Convert the image to Gray using cv2.COLOR_BGR2GRAY.
cv2.cvtColor(input_image, cv2.COLOR_BGR2GRAY)
2. Finding contours in the image:
To find contours use cv2.findContours(). It takes three parameters: the source image, contour retrieval mode, contour approximation method. This will return a python list of all contours. Contour is nothing but a NumPy array of (x,y) coordinates of boundary points in the object.
3. Apply OCR.
By looping through each contour, take x,y and width, height using cv2.boundingRect() function. Then draw a rectangle function in image using cv2.rectange(). This has five parameters: input image, (x, y), (x+w, y+h), boundary colour for rectangle, size of the boundary.
4. Crop the rectangular region and pass that to tesseract to extract text. Save your content in a file by opening it in append mode.
For more details, go through code comments also.
Code:
import cv2 import pytesseract # path to Tesseract-OCR in your computer pytesseract.pytesseract.tesseract_cmd = 'path_to_tesseract.exe' img = cv2.imread("input.png") #input image gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Converting image to gray scale # performing OTSU threshold ret, img_thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)
# give structure shape and kernel size # kernel size increases or decreases the area of the rectangle to be detected. rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (18, 18)) #dilation on the threshold image dilation = cv2.dilate(img_thresh , rect_kernel, iterations = 1) img_contours, hierarchy = cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) im2 = img.copy() file = open("Output.txt", "w+") #text file to save results file.write("") file.close() #loop through each contour for contour in img_contours: x, y, w, h = cv2.boundingRect(contour) rect = cv2.rectangle(im2, (x, y), (x + w, y + h), (0, 255, 0), 2) cropped_image = im2[y:y + h, x:x + w] #crop the text block file = open("Output.txt", "a") text = pytesseract.image_to_string(cropped_image) #applying OCR file.write(text) file.write("n") file.close()
Input image:
Output image:
Say you have some book as PDF to read, but you are feeling too lazy to scroll; how good it would be then if that PDF is converted to an audiobook. So, let’s implement this using python.
We will need these two packages:
pyttsx3: It is for Text to Speech, and it will help the machine speak.
PyPDF2: It is a PDF toolkit. It is capable of extracting document information, merging documents, etc.
Install them using these commands:
pip install pyttsx3 pip install PyPDF2
Steps:
Code:
# import the modules import PyPDF2 import pyttsx3 # path of your PDF file path = open('Book.pdf', 'rb') # PdfFileReader object pdfReaderObj = PyPDF2.PdfFileReader(path) # the page with which you want to start from_page = pdfReaderObj.getPage(12) content = from_page.extractText() # reading the text speak = pyttsx3.init() speak.say(content) speak.runAndWait()
That’s it! It will do the job. This small code is beneficial to you when you don’t want to read; you can hear.
Next, you can provide a GUI to this project using tikinter or anything else. You can give a GUI to enter the pdf path, the page number to start from, a stop button. Try this!
Let’s move to the next project.
Let’s understand what the benefit of reading the mailbox with Python is. So, let’s suppose if we are working on a project where some data comes daily in word or excel, which is required for the script as input or to Machine learning model as input. So, if you have to download this data file daily and give it to the hand, it will be hectic. But if we can automate this step, read this file, and download the required attachment, it would be a great help. So, let’s implement this.
We will use pywin32 to implement automatic attachment download from a particular mail. It can access Windows applications like Excel, PowerPoint, Word, Outlook, etc., to perform some actions. We will focus on Outlook and download attachments from the outlook mailbox.
Note: This does not need authentication like user email id or password. It can access Outlook that is already logged in to your machine. (Keep the outlook app open while running the script).
In the above example, we chose smtplib because it can only send emails and not download attachments. So, we will go with pywin32 to download attachments from Outlook, and it will be pretty straightforward. Let’s look at the code.
Command to install: pip install pywin32
Import module
import win32com.client
Now, establish a connection to Outlook.
outlook = win32com.client.Dispatch(“Outlook.Application”).GetNamespace(“MAPI”)
Let’s try to access Inbox:
inbox = outlook.GetDefaultFolder(number)
This function takes a number/integer as input which will tell the index of the inbox folder in our outlook app.
To check the index of all folders, just run this code snippet:
import win32com.client outlook=win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI") for i in range(50): try: box = outlook.GetDefaultFolder(i) name = box.Name print(i, name) except: pass
Output:
3 Deleted Items 4 Outbox 5 Sent Items 6 Inbox 9 Calendar
As you can see in the output Inbox index is 6. So we will use 6 in the function.
inbox = outlook.GetDefaultFolder(6)
If you want to print the subject of all the emails in the inbox, use this:
messages = inbox.Items # get the first email message = messages.GetFirst() # to loop through all the email in the inbox while True: try: print(message.subject) # get the subject of the email message = messages.GetNext() except: message = messages.GetNext()
There are other properties also like “message. subject”, “message. senton”, which can be used accordingly.
If you want to print all the names of attachments in a mail:
for attachment in message.Attachments: print(attachment.FileName)
Let’s download an attachment (an excel file with extension .xlsx) from a specific sender.
import win32com.client import re import os outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI") inbox = outlook.GetDefaultFolder(6) messages = inbox.Items message = messages.GetFirst() while True: try: if re.search('Data Report', str(message.Subject).lower()) != None and re.search("ABC prasad", str(message.Sender).lower()) != None: attachments = message.Attachments for attachment in message.Attachments: if ".xlsx" in attachment.FileName or ".XLSX" in attachment.FileName: attachment_name = str(attachment.FileName).lower() attachment.SaveASFile(os.path.join(download_folder_path, attachment_name)) else: pass message = messages.GetNext() except: message = messages.GetNext() exit
This is the complete code to download an attachment from Outlook inbox. Inside try block, you can change conditions. For example, I am searching for those mails which have subjects such as Data Report and Sender name “ABC prasad”. So, it will iterate from the first mail in the inbox, and if the condition gets true, it will then look if that particular mail has an attachment with the extension .xlsx or .XLSX. So you can change all these things subject, sender, file type and download the file you want. Once it finds the file, it is saved to a path given as “download_folder_path”.
We discussed three projects in a previous article and three in this article. I hope these python projects with codes helped you to polish your skill set. Just do some hands-on and try these; you will enjoy coding them. I hope you find this article helpful. Let’s connect on Linkedin.
Thanks for reading 🙂
Happy coding!
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.