The H1B visa program opens doors for skilled individuals worldwide to bring their expertise to the United States. Thousands of talented professionals enter the US through this program each year, contributing to various industries and driving innovation. Let’s dive into the fascinating world of H1B visa data from the Office of Foreign Labor Certification (OFLC) and explore the stories behind the numbers. This article reveals H1B Visa data analysis and we get insights and interesting stories from the data. Through feature engineering, we enhance the dataset with additional information from external sources. Use the meticulous data wrangling to organize data carefully so that we can better understand and analyze it. Finally, data visualizations unveil fascinating trends and untold insights about skilled workers in the US in-between the years 2014 and 2016.
Note:🔗 Please find the complete code and dataset for this analysis on Kaggle to explore the whole process and code behind the analysis: H1B Analysis on Kaggle
This article was published as a part of the Data Science Blogathon.
The H1B visa program is a key component of U.S. immigration policy, aimed at attracting highly skilled foreign workers to fill specialized positions in various industries. It addresses skill shortages, promotes innovation, and drives economic growth.
To obtain an H1B visa, a person must follow these key steps:
The process involves meeting specific requirements, such as holding a bachelor’s degree or equivalent, and navigating additional considerations, such as prevailing wage determinations and documentation of the employer-employee relationship. Compliance and thorough preparation are crucial for a successful H1B visa application.
Join me on a fascinating data transformation journey! I convert the xlsx file to CSV, rename columns for consistency since each column contains the same data but has different names across each year., and combine three years of data. The result? A vast dataset with 1,786,160 rows and 17 columns.
# convert date columns to datetime format and assign nan to invalid data
final_df['LCA_CASE_EMPLOYMENT_START_DATE'] = pd.to_datetime(final_df['LCA_CASE_EMPLOYMENT_START_DATE'], errors='coerce')
final_df['LCA_CASE_EMPLOYMENT_END_DATE'] = pd.to_datetime(final_df['LCA_CASE_EMPLOYMENT_END_DATE'], errors='coerce')
# subtract the LCA_CASE_EMPLOYMENT_END_DATE from LCA_CASE_EMPLOYMENT_START_DATEto find employment period
LCA_CASE_EMPLOYMENT_PERIOD=final_df["LCA_CASE_EMPLOYMENT_END_DATE"]-final_df["LCA_CASE_EMPLOYMENT_START_DATE"]
# create a new column with LCA_CASE_EMPLOYMENT_PERIOD value
final_df.insert(7, 'LCA_CASE_EMPLOYMENT_PERIOD', LCA_CASE_EMPLOYMENT_PERIOD)
# converting LCA_CASE_EMPLOYMENT_PERIOD into days format
final_df['LCA_CASE_EMPLOYMENT_PERIOD'] = final_df['LCA_CASE_EMPLOYMENT_PERIOD'].dt.days
# delete the outlier value, i.e employment days less than 0.
final_df = final_df[final_df['LCA_CASE_EMPLOYMENT_PERIOD'] > 0]
final_df['LCA_CASE_EMPLOYMENT_PERIOD'].describe()
# the employment period is converted into months
final_df['LCA_CASE_EMPLOYMENT_PERIOD'] = (round(final_df['LCA_CASE_EMPLOYMENT_PERIOD'] / 30.44))
#filled the missing value with 0 and converted the column type to int
final_df['LCA_CASE_EMPLOYMENT_PERIOD']=final_df['LCA_CASE_EMPLOYMENT_PERIOD'].fillna(0).astype(int)
# Convert the LCA_CASE_NAICS_CODE column to string data type
final_df['LCA_CASE_NAICS_CODE'] = final_df['LCA_CASE_NAICS_CODE'].astype(str)
# Extract the first two digits of each string value
final_df['LCA_CASE_NAICS_CODE'] = final_df['LCA_CASE_NAICS_CODE'].str[:2]
# reading the NAICS_data to cross check and create a new column for employer sector
NAICS_data=pd.read_csv("/kaggle/input/h1b-visa/NAICS_data.csv")
NAICS_data.head()
# loop through all the NAICS in the naics_unique_values
for i in naics_unique_values:
try:
# assuming your dataframe is called 'df'
NAICS_data_code = NAICS_data.loc[NAICS_data['NAICS_CODE'] == i, 'NAICS_TITLE'].iloc[0]
except:
#if there is no index with the particular soc code the occupation name will be null
NAICS_data_code = "Unknown"
# create a boolean mask for the conditions
mask = (final_df['LCA_CASE_NAICS_CODE'] == i)
# update the LCA_CASE_SOC_NAME column for the filtered rows
final_df.loc[mask, 'EMPLOYER_SECTOR'] = NAICS_data_code
# extract the year component from the datetime column LCA_CASE_SUBMIT and store it in a new column year
final_df['year'] = final_df['LCA_CASE_SUBMIT'].dt.year
# remove numbers after "." period in 'LCA_CASE_SOC_CODE' column
final_df['LCA_CASE_SOC_CODE'] = final_df['LCA_CASE_SOC_CODE'].astype(str).apply(lambda x: x.split('.')[0])
# function to correct the LCA_CASE_SOC_CODE
def preprocess_column(column):
pattern = r"^\d{2}-\d{4}$" # regex pattern for "XX-XXXX" format
def preprocess_value(value):
if ("-" not in value) and len(value) < 6:
cleaned_value=np.nan
elif "-" in value :
value=value.replace('-','')
cleaned_value=value[0:2]+"-"+value[2:6]
if len(cleaned_value) != 7:
cleaned_value=np.nan
elif ("-" not in value) and len(value) > 5:
value=value.replace('/', '')
cleaned_value=value[0:2]+"-"+value[2:6]
return cleaned_value
cleaned_column = column.apply(lambda x: np.nan if pd.isna(x) else (x if re.search(pattern, str(x)) else preprocess_value(x)))
return cleaned_column
final_df["LCA_CASE_SOC_CODE"] = preprocess_column(final_df["LCA_CASE_SOC_CODE"])
# Replace the values in the 'LCA_CASE_WAGE_RATE_FROM' column
# define a custom function to preprocess the wage_rate column
def preprocess_wage_rate(cell_value):
if isinstance(cell_value, float):
return cell_value
elif '-' in cell_value:
return cell_value.split('-')[0].strip()
else:
return cell_value
# apply the custom function to the wage_rate column
final_df['LCA_CASE_WAGE_RATE_FROM'] = final_df['LCA_CASE_WAGE_RATE_FROM'].apply(lambda x: preprocess_wage_rate(x))
# initialize webdriver
driver = webdriver.Chrome()
# navigate to webpage
driver.get('https://www.bls.gov/oes/current/oes_stru.htm#29-0000')
# find all li elements
li_elements = driver.find_elements("xpath","//li")
# create empty list to store data
data = []
# loop through li elements
for li in li_elements:
text = li.text
if "-" in text:
# use regular expression to extract SOC code and occupation name
words = text.split()
soc=words[0]
name = (" ".join(words[1::])).replace('"', '').strip()
name_list=(words[1::])
if "-" in name:
for i, word in enumerate(name_list):
if ("-" in word) and (len(word) > 1):
name =(' '.join(name_list[:i])).replace('"', '').strip()
break
data.append({'SOC Code': soc, 'Occupation Name': name})
# close webdriver
driver.quit()
# create dataframe
occupation_data = pd.DataFrame(data)
# save dataframe as CSV
occupation_data.to_csv('occupations.csv', index=False)
Between 2014 and 2015, the number of H1B visa applications skyrocketed, growing by an impressive 17.7%. Skilled individuals from around the world were eager to seize opportunities in the United States. However, things took an unexpected turn in 2016 when there was a sudden 9% drop in applications. This decline left us wondering: What caused this significant change?
Surprisingly the rejection rate for the visa has decreased significantly from 5.41% to 3.4% over the years. On the other hand, the acceptance rate has been steadily increasing every year. It may also suggest employers have become more adept at submitting strong applications, thereby reducing the rejection rate.
The decreasing rejection rate reflects a positive trend and signifies a more favourable environment for H1B visa applicants. The increasing acceptance rate indicates a growing demand for highly skilled foreign workers in the United States.
These positive trends could be attributed, to the US government’s efforts to foster a welcoming environment for skilled immigrants. The government may have implemented more favourable policies, streamlining the visa process and removing unnecessary barriers. This, in turn, has likely contributed to the decrease in the rejection rate. Therefore, the decline in applications cannot be solely attributed to a higher rejection rate.
After conducting our analysis, we have found that a significant portion of H1B visa applications, approximately 72.4%, with the professional, scientific, and technical services sectors. This sector encompasses various fields, including computer programming, scientific research, engineering, and consulting services. The surge in applications are escalating demand for skilled professionals in these domains, given the specialized expertise and knowledge required.
Furthermore, influence by larger companies actively contributing to the increase in H1B visa sponsorships for their employees, within the professional, scientific, and technical services sectors. These companies rely on highly skilled workers to uphold their competitive edge and sustain growth in the industry.
Based on comprehensive analysis, we found that the professional, scientific, and technical services sectors are held with 9 out of the top 10 employers with the highest number of H1B visa applications. Known for its consistent demand for skilled professionals and encompasses diverse fields such as computer programming, scientific research, engineering, and consulting services.
Infosys stands out as the leading employer, with a staggering 82,271 approved applications and the lowest number of denied applications among the top 10 employers. This dominance in the H1B visa application count surpasses the combined numbers of the second-ranked TCS and the third-ranked Wipro.
The outstanding performance of Infosys in the H1B visa application process raises intriguing questions about the company’s approach and the specific job roles they are recruiting for.
After analyzing the data, we created a chart to visually represent the contribution of the top 10 H1B visa-sponsoring employers to the top 10 job positions. The remaining employers were grouped as “other employers.” The chart highlights that while “other employers” hold a substantial portion, the top 10 employers have made a significant impact on the top 10 job positions.
For instance, Infosys has played a major role in the computer systems analyst position, while Microsoft has made a notable contribution to the Software developers, application position. Similarly, IBM has significantly influenced the computer programmer, applications position.
This chart emphasizes the significant influence of the top 10 employers in the H1B visa application process and the specific job positions.
After conducting data analysis, we saw that there is no significant correlation between the salary range and the application status. The observation holds true for both the top 10 accepted and denied job positions, they had the same salary range.
To gain further insight into the relationship between application status and salary range, we divided the salary range into four quantiles: low, average, above-average, and higher pay salary levels. We then analyzed the data for each category. The findings reveal that the majority of approved H1B visa applications fell into (Q1) and average (Q2) salary range categories.
The low (Q1) and average (Q2) salary range categories encompassed the majority of approved H1B visa applications. Nevertheless, no clear trends were seen between the salary range and the application status. Factors other than salary have played a significant role in determining the outcome of H1B visa applications.
Upon analyzing the data, we discovered that there was no substantial correlation between the employment period and the visa decision. Surprisingly, both the top approved and top denied job positions showcased an average employment period of 33 months. The duration of employment doesn’t appear to be a determining factor in the visa decision-making process.
We divided the applicants into 2 groups based on the average duration of their employment period: below 33 months and above 33 months.
Despite thorough analysis, we couldn’t identify trend regarding the employment period and its influence on the outcome of H1B visa applications. The absence of a discernible pattern suggests that the duration of employment might not have played a pivotal role.
Ta-da! finally, the data, in its infinite wisdom, has given us a virtual high-five, confirming that we were right finally.
it becomes apparent that specific states within the US exhibit a higher concentration of employers who applied for H1B visas. Notably, Texas, California, and New Jersey emerge as the top three states with the greatest number of employers, accounting for approximately 684,118 applications combined. This observation suggests that these states likely experience a heightened demand for skilled workers, attracting a larger pool of employers to seek H1B visas.
The analysis uncovers distinct hotspots across the US where the majority of H1B employers are available. These regions are indicative of areas with significant demand for skilled professionals.
In conclusion, this blog delves into the significance of H1B visa data from the OFLC, providing a comprehensive exploration of its role in attracting skilled foreign workers to the United States. The article focuses on various data preprocessing techniques, including cleaning, feature engineering, and transformation, to ensure accurate analysis. Through an examination of H1B visa acceptance and rejection rates, the blog uncovers influential factors impacting these rates, presenting valuable insights through data visualization.
By equipping yourself with the right skills, adopting a targeted approach to states and employers, and embracing the possibilities that lie ahead, you can confidently embark on your H1B visa adventure and thrive in a world that values your talents.
A. visit their website at https://www.foreignlaborcert.doleta.gov/performancedata.cfm. Navigate to the “Disclosure Data” tab and select “LCA Programs (H-1B, H-1B1, E-3)” to download the data file in .xlsx format. Additionally, you can find a PDF file on the website that provides a description of the data included in the .xlsx file.
A. It is crucial to exercise caution when considering the removal of null value columns. Dropping these columns can result in information loss and the elimination of valuable data. Furthermore, it may introduce bias and distort the data representation, affecting the accuracy of the analysis. As an alternative approach, we scraped data from different websites and compared the corresponding data columns from two datasets to impute the missing values effectively.
Plotly is over Matplotlib or Seaborn for data analysis due to its interactivity, and extensive customization options. It offers high-quality charts and graphs, advanced features like animations and 3D visualizations, and the ability to create interactive dashboards.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.