NLP Case Study: Build Your Own Skill Matching Algorithm

Ganeshi Last Updated : 27 Mar, 2023

8 min read

Introduction

Building a good resume has always motivated every student out there to get hired by their dream company. Thousands of people from various platforms like Linkedin, naukri.com, etc., start applying as the company starts its recruitment process. It’s highly impossible to, of course, interview everyone who applies. Here comes artificial intelligence’s resume screener (Word2Vec) for identifying good resumes and shortlisting those for interviews.

After cleaning the data with NLP methods such as tokenization and stopword removal, I used Word2Vec from gensim for word embeddings. Using these word embeddings, the K-Means Algorithm is used to generate K Clusters. Some of the clusters in this list contain skills (Tech, Non-tech & soft skills).

Word2Vec

Learning Objectives

In this article, you will-

Identify the layout of the resume and determine the flow of content.
Learn about Word2vec
How does Word2Vec help in extracting skills from resumes?

Dictionary Approach for Resume Screening
What is Word2Vec?
How is Word2Vec Effective for Skill Matching?
3.1 Training the word2vec model
3.2 Reading the resume and performing tokenization
3.3 Finding the similarities between JD skills and resume tokens.
Drawbacks of Word2Vec Skill Matching
Script
Conclusion

Dictionary Approach for Resume Screening

A resume screener usually includes the following steps:

Reading resume
Layout Classification
- Identifying the resume’s layout is essential since it determines the flow of content within the resume
Section Segmentation
- Identifying the section headers and segmenting the resume using these headers like Educational Qualification, Work Experience, Skill Set sections, etc.
Information extraction Includes
- Candidate’s Primary Details
- Skill Set
- Academic Details
- Work Experience
- Company and job designation
- Job Location

Skill set extraction includes identifying the technical skills present in the resume and matching them with JD’s mandatory skills. The easiest way of extraction is by checking its presence in the technical skills dictionary in the backend. Usually, JD has domains specified in it as skills, and hence the skills in the dictionary need to be mapped to its domain.

Word2Vec

What if the skills mentioned in the resume are missing in the dictionary? What if a resume skill is not mapped to its domain? Simple, the resume will be rejected!
To solve this problem, instead of checking for the presence of a skill in the dictionary, checking for the presence of a skill or its relevant skills will be more efficient. A deep learning architecture has been introduced in this article to match resume skills with JD skills efficiently.

What is Word2Vec?

Word2Vec

Word2Vec is one of the word embedding architectures for transforming text into numerics, i.e., a vector. Word2Vec is different from other representation techniques like BOW, One-Hot encoding, TF-IDF, etc., as it captures semantic and syntactic relationships between words using a simple neural network with one hidden layer. In short, the words that are related will be placed close to each other in the vector space. The weights obtained in the hidden layer after the convergence of the model are the embeddings. So, using word2vec, we can perform tasks like next word/words prediction based on the two different Word2Vec architectures

Continuous Bag of Words
Given a sequence of words, i.e., context words, it predicts a word that is highly probable to occur next.
Skip Gram
It works exactly opposite to CBOW, which is given the word, it predicts the next t context words.

Click on this link to know more about Word2Vec

How is Word2Vec Effective for Skill Matching?

How’s word2vec useful in matching resume skills with JD? The solution is just three simple steps:

Training the word2vec model
Reading the resume and performing tokenization
Finding the similarities between JD skills and resume tokens.

Training the word2vec model

Note – Our implementation is limited only to data science resumes. It can further be generalized by improving the data.

Importing all the necessary libraries

import gensim
from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec
import pandas as pd
import joblib

Data Collection:

1. Web scraping
  - Data is collected by scraping data from various data science-related websites, e-books, etc., using python’s beautiful soup.
2. Data Preprocessing
  - Lower case conversion
  - Removal of numerics
  - Removal of stop words

Stemming and lemmatization are not performed to avoid the loss of vocabulary. For example, when “Machine Learning” is stemmed or lemmatized, the words “machine” and “learning” will be stemmed or lemmatized separately. Thus, it results in “machine learning” and, thus, loss of skill.
Here’s our sample data
Creating n-gram words using gensim’s phrases class. The data is passed to the phrases class and returns an object. The object returned can be saved locally and used whenever required.

df=pd.read_csv('/content/data_100.csv')
sent = [row.split() for row in df['data']]
phrases = Phrases(sent, min_count=30, progress_per=10000)
sentences=phrases[sent]

Reading the resume and performing tokenization

Reading a resume
A resume can be of different forms like pdf, docx, image, etc. Different tools are used for extracting information from different forms of resumes.
PDF – using pdfplumber
Image – using OCR

Data preparation
After extracting the data, the next step is preprocessing, creating n-grams, and tokenization.

Finding the similarities between JD skills and resume tokens

Here comes the final step. After performing the first two steps, we obtain the following things

Word2vec model/Word Embeddings
Phrases object
Data vocabulary
Resume tokens

JD’s skills are entered manually. Now, we need to find the similarity between JD skills and resume tokens; if a JD skill has at least one relevant skill in the resume tokens, then it will be considered as “present” in the resume else, “absent” in the resume.
How to check relevant skills? The answer is cosine similarity. The skill is considered relevant if the cosine similarity between the two embeddings is less than a certain threshold.
We create two arrays of JD skill embeddings and resume token embeddings for finding the numerator of cosine similarity of all the embeddings simultaneously, i.e., A.B

Drawbacks of Word2Vec for Skill Matching

What if a JD skill is not present in the vocabulary which was used for building the model? The model will not have its embedding; such words are called out of vocabulary words. This is a major drawback of word2vec. Character-level embeddings could be done to solve this issue. FastText works at character-level embeddings.

The major difference between Word2Vec and FastText is that Word2Vec feeds individual words into Neural Network to find the embeddings, whereas, FastText breaks words into several n-grams (sub-words). The word embedding vector for a word will be the sum of all the n-grams.

Script

Installing Necessary Packages

!pip install pdfplumber
!pip install pytesseract
!sudo apt install tesseract-ocr
!pip install pdf2image
!sudo apt-get update
!sudo apt-get install python-poppler 
!pip install PyMuPDF
!pip install Aspose.Email-for-Python-via-NET
!pip install aspose-words

Importing Necessary Libraries

import pandas as pd
import os
import warnings
warnings.filterwarnings(action = 'ignore')
import gensim
from gensim.models import Word2Vec
import string
import numpy as np
from itertools import groupby, count
import re
import subprocess
import os.path
import sys
import logging
import joblib
from gensim.models.phrases import Phrases, Phraser
import pytesseract
import cv2
from pdf2image import convert_from_path
from PIL import Image 
Image.MAX_IMAGE_PIXELS = 1000000000 
import aspose.words as aw
import fitz
logger_watchtower = logging.getLogger(__name__)
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

Function for reading resume

def _skills_in_box(image_gray,threshold=60):
  '''
  Function for identifying boxes and identifying skills in it: Given an imge path, 
        returns string with text in it.
        Parameters:
            img_path: Path of the image
            thresh : Threshold of the box to convert it to 0
  '''
  img = image_gray.copy()
  thresh_inv = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)[1]
  # Blur the image
  blur = cv2.GaussianBlur(thresh_inv,(1,1),0)
  thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
  # find contours
  contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0]
  mask = np.ones(img.shape[:2], dtype="uint8") * 255
  available = 0
  for c in contours:
    # get the bounding rect
    x, y, w, h = cv2.boundingRect(c)
    if w*h>1000:
        cv2.rectangle(mask, (x+5, y+5), (x+w-5, y+h-5), (0, 0, 255), -1)
        available = 1

  res = ''
  if available == 1:
    res_final = cv2.bitwise_and(img, img, mask=cv2.bitwise_not(mask))
    res_final[res_final<=threshold]=0 kernel = np.array([[0, -1, 0], [-1, 5,-1], [0, -1, 0]]) res_fin = cv2.filter2D(src=res_final, ddepth=-1, kernel=kernel) vt = pytesseract.image_to_data(255-res_final,output_type='data.frame') vt = vt[vt.conf != -1] res = '' for i in vt[vt['conf']>=43]['text']:
      res = res + str(i) + ' '
  print(res)
  return res
  
def _image_to_string(img):
  '''
  Function for converting images to grayscale and converting to text: Given an image path, 
  returns text in it.
  Parameters:
      img_path: Path of the image
  '''
  img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
  res = ''
  string1 = pytesseract.image_to_data(img,output_type='data.frame')
  string1 = string1[string1['conf'] != -1]
  for i in string1[string1['conf']>=43]['text']:
    res = res + str(i) + ' '
  string3 = _skills_in_box(img)
  return res+string3
  
def _pdf_to_png(pdf_path):
    '''
    Function for converting pdf to image and saves it in a folder and 
    convert the image into string
    Parameter:
        pdf_path: Path of the pdf
    '''
    string = ''
    images = convert_from_path(pdf_path)
    for j in tqdm(range(len(images))):
        # Save pages as images in the pdf
        image = np.array(images[j])
        string += _image_to_string(image)
        string += '\n'
    return string
def ocr(paths):
    '''
    Function for checking the pdf is image or not. If the file is in .doc it converts it into .pdf
    if the pdf is in image format the function converts .pdf to .png
    Parameter:
        paths: list containg paths of all pdf files
    '''
    text = ""
    res = ""
    try:
        doc = fitz.open(paths)
        for page in doc:
            text += page.get_text()
        if len(text) <=10 :
            res = _pdf_to_png(paths)
        else:
            res = text
    except:
        doc = aw.Document(paths)
        doc.save("Document.pdf")
        doc = fitz.open("Document.pdf")
        for page in doc:
            text += page.get_text()
        if len(text) <=10 :
            res = _pdf_to_png("Document.pdf")
        else:
            res = text
        os.remove("Document.pdf")
    return res

Function for finding Cosine Similarity

def to_la(L):
  k=list(L)
  l=np.array(k)
  return l.reshape(-1, 1)

def cos(A, B):
  dot_prod=np.matmul(A,B.T)
  norm_a=np.reciprocal(np.sum(np.abs(A)**2,axis=-1)**(1./2))
  norm_b=np.reciprocal(np.sum(np.abs(B)**2,axis=-1)**(1./2)) 
  norm_a=to_la(norm_a)
  norm_b=to_la(norm_b)
  k=np.matmul(norm_a,norm_b.T)
  return list(np.multiply(dot_prod,k))

Function for finding the similarities and returning the final matched skills

def check(path,skills,l2,w2v_model1,phrases,pattern):
  text = ocr(path)
  text = re.sub(r'[^\x00-\x7f]',r' ',text)
  text = text.lower()
  text = re.sub("\\\|,|/|:|\)|\("," ",text)
  t2 = text.split()
  l_2=l2.copy()
  match=list(set(re.findall(pattern,text)))
  sentences=phrases[t2]
  resume_skills_dict={}
  res_jdskill_intersect=list(set(sentences).intersection(set(l_2)))
  if(len(match)!=0):
    for k in match:
      k=k.replace(' ','_')
      resume_skills_dict[k]=1
      try:
        l_2.remove(k)  
      except:
        continue
  l6=list(set(l_2).intersection(skills['0']))
  l6_minus_skills=list(set(l_2).difference(skills['0']))
  for i in l6_minus_skills:
    resume_skills_dict[i]=0
  if(len(l6)==0):
    return resume_skills_dict
  l4=list(set(sentences).intersection(skills['0']))
  arr1=np.array([w2v_model1[i] for i in l6])
  arr2=np.array([w2v_model1[i] for i in l4])
  similarity_values=cos(arr1,arr2)
  count=0
  for i in similarity_values:
    k=list(filter(lambda x: x<0.38, list(i))) if(len(k)==len(i)): resume_skills_dict[l6[count]]=0 else: resume_skills=[s for s in range(len(i)) if(i[s])>0.38]
      resume_skills_dict[l6[count]]=1
    count+=1
  return resume_skills_dict

Functions required for performing JD skills preprocessing

def Convert(string):
    li = list(string.split())
    return list(set(li))

def preprocess(string):
  string = string.replace(",",' ')
  string= string.replace("'",' ')
  string = Convert(string)
  return string

Main Function

if __name__ == "__main__":
   #Arg 1 = vocabulary, Arg 2 = model, Arg 3 = phrases object, Arg 4 = JD's Mandatory Skills, Arg 5 = Resume Path 
   argv = sys.argv[1:]
   w2v_model1 = joblib.load(argv[0])
   skills=pd.read_csv(argv[1])
   mapper = {}
   underscore=[]
   jd_skills=argv[3]
   jd_skills=" ".join(jd_skills.strip().split())
   jd_skills=jd_skills.replace(', ',',')
   pattern=jd_skills.replace(',','|').lower()
   for i in jd_skills.split(','):
    if '_' in i:
      underscore.append(i)
      mapper[i.lower().replace('_',' ')] = i
   jd_skills=jd_skills.replace(' ','_')
   jd_skills=jd_skills.replace(',',', ')
   for i in jd_skills.split(', '):
    if i not in underscore:
      if '_' in i:
        mapper[i.lower().replace('_',' ')] = i.replace('_',' ')
      elif '-' in i:
        mapper[i.lower().replace('-',' ')] = i
      else:
        mapper[i.lower()] = i
   jd_skills=jd_skills.replace('-','_')
   phrases=Phrases.load(argv[2])
   lines = [preprocess(jd_skills.lower().rstrip())]
   phrases=Phrases.load(argv[2])
   final_jd_skills=list(set(lines[0]).intersection(skills['0']))
   path = argv[4]
   res=check(path,skills,lines[0],w2v_model1,phrases,pattern)
   for dict in res:
    res_dict={}
    for i in dict.keys():
      j=i.replace('_',' ')
      res_dict[mapper[j]] = dict[i]
    print('skills_matched :',res_dict)

Command Line Argument

!python3 demo1.py '/content/drive/MyDrive/Skill_Matching_Files/Model(cbow).joblib' '/content/drive/MyDrive/Skill_Matching_Files/vocab_split.csv' '/content/drive/MyDrive/Skill_Matching_Files/phrases_split.pkl' 'julia, kaggle, ml, mysql, oracle, python, pytorch, r, scikit learn, snowflake, sql, tensorflow' '/content/drive/MyDrive/Skill_Matching_Files/TESTING RESUME/Copy of 0_A.a.aa.pdf'

Output

skills_matched : {'python': 1, 'r': 1, 'oracle': 0, 'snowflake': 1, 'pytorch': 1, 'tensorflow': 1, 'ml': 1, 'sql': 1, 'kaggle': 1, 'mysql': 1, 'julia': 1, 'scikit learn': 1}

Conclusion

I hope the article provided you the insights into extracting skills from resumes. You learned how the Word2Vec word embedding technique is used to vet the resumes by several companies in the recruitment industry and companies.

Please comment below or connect with me on LinkedIn to drop a query or feedback if you have any doubts.

Ganeshi

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

NLP Case Study: Build Your Own Skill Matching Algorithm

Introduction

Table of Contents

Dictionary Approach for Resume Screening

What is Word2Vec?

How is Word2Vec Effective for Skill Matching?

Training the word2vec model

Reading the resume and performing tokenization

Finding the similarities between JD skills and resume tokens

Drawbacks of Word2Vec for Skill Matching

Script

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm