What were you thinking when you choose your Data Scientist Profile?

Mrinal Singh Last Updated : 07 Mar, 2022

6 min read

This article was published as a part of the Data Science Blogathon.

Source: Unsplash.com

There are many data science fields, you will have to work closely with your business to identify issues. You will get many articles telling tips for creating a solid profile, but no one will tell you which profile you should pick as your entire professional career depends upon it.

In today’s article, I want to share four significant reasons for going through different data scientist profiles before you frankly select one for you:

Reason 1: Cultivating Self-Awareness

https://unsplash.com/@jareddrice

– I want you to think about who you are now when it comes to data science.
— I want you to think about your goals regarding data science and how you would like your data scientist profile to change over the next 6 months.

Become a specialist in one thing or a generalist? or some mix? There are career benefits and disadvantages to each nonetheless of whether you’re in academia or industry.

Reason 2: Illustrate the Importance of Standardization in Visualization

https://unsplash.com/@goumbik

I wanted to reveal standardizing visualizations of users as a mix of characteristics. (You should think about how you will do it, and then also question yourself whether you think a standardized visualization has any significance.)

In this particular case:

(a) Standardizing The X-Axis: I used the main buckets that I thought were approximately some of the skills one lacks as a data scientist. I’m not tied to these buckets, but it seemed helpful in the starting days, and we can revise this going forward.

The chosen buckets- “Data Viz,” “Software Engineer,” “Math, “Statistics,” “Machine Learning(ML),” “Communication skills,” and “Field expertise” are convenient and contestable.

Also, I said, “maybe software engineer should be CS, I don’t know,” and then didn’t really make a decision, and you didn’t seem to mind (thanks!), but it did result in some people having different labels than others.

I pointed out that we had to evaluate whether the labels would be ordered or not. One way would be to go from left to right in terms of harder to softer skills. But felt stating Software Engineering was a more complex (more technical) skill than ML or Mathematics was problematic.

Alternatively, we could believe ordering according to the “data science pipeline,” starting with engineering, moving towards analysis with math, statistics, ML (would have to choose an order), and then moving into visualization, reporting, storytelling, and communication.

The complexity of the pipeline makes left to proper ordering non-obvious. So rather than resolve this at the moment because I could see it going either of several ways, I decided to not think of them as requested.

So once we think we are not interpreting them as instructed, we have to be careful not to see patterns that aren’t there but are just a manifestation of the (arbitrarily) selected order.

Also, some people in the industry might feel that I wasn’t being granular or broad enough, depending on their structure of reference. So I believe this is flawed, but again you have to start somewhere, and usually someplace reasonably uncomplicated, and that’s part of EDA!

(b) Standardizing The Y-Axis: I drew my profile on the panel and showed my data scientist profile when I completed my bachelor’s and how it changed after working on a great data science team learning from my collaborators and colleagues.

Here the comparison is before and after. I decided not to label the scale because I didn’t want my notion of expertise to influence you. One man’s specialty is another man’s poser.

A student just learning this stuff has a different scale than someone who has been doing this for years. Each would have a different interpretation of “expertise,” reflecting over-or under-confidence.

So we have to accept that our scales will be subjective if we label them. (We should think about what it would mean to standardize the scale. How would we do it? What would the consequences of it be? How do we define “expert”?)

Reason 3: Our First Step to Thinking about Data Science Teams

I want you to join a data science community. One way to think about going about it would be to combine complementary profiles. It helps you understand the role, meet like-minded people and learn beforehand.

Reason 4: Demonstrate your Thought Process before you do EDA

It’s a mix of intuition and math/stats know-how. I first came up with a simple, standardized visualization, which I could then compare different profiles. The lack of standardization means I would try to focus on relative conditions. Did I know what I would see before I did it? No. But I had a hunch that some of the following would happen:
(a) I’d discover something new
(b) I’d witness natural clusters of profiles. Some people are similar to each other. (Think: what does “similar” mean? What is the “distance” between two profiles? How do I measure similarity?)
(c) I’d obtain a sense of the distribution across profiles
(d) I’d begin getting an intuition for joining a data science community.
(e) I’d begin thinking of machine learning or analysis problems I could potentially work on with this data set or a generalized version of it.

Just let your imagination go here as a data scientist. How would you use these profiles or something along these bars as a method to think about or construct functional teams?

My Meta-thoughts And Analysis Before You Show The Results

My thoughts about this, who I am as a data scientist, my strengths relative to others, and what I contribute to a team have been shaped and influenced by many conversations I’ve had with my collaborator, mentors and friends.

Final Things for you to Think About

Thought experiment: Generalize this problem by visualizing a team rather than a person.

Thought experiment: Some data sets could be millions of users/humans. (unlikely to be a set of millions of potential data scientists!). So how would you think about scaling this process? Is there a difference in what you would do if the numbers were self-reported vs. logged user actions on a website?

Think of a social networking or online dating website to get concrete about this. How would you explore a data set of users and their attributes? If the attributes were self-reported attributes like “how happy are you on a scale of 1-10″, how would you handle the subjectivity of “10”? How would you visualize it, cluster it, represent the distribution over it?

Scaling also suggests that you start by sampling and doing it by eye yourself to gain intuition, but then build an algorithm to automate. (This is an example of machine learning)

Also, remind yourself that I asked you to question standardization and think about how having un-standardized input might impact all this. Does the importance of standardization change for you when we are dealing with smaller data sets vs millions?

Final Words

I hope this article was helpful for you to understand the importance of visualization and EDA before you select any Data Scientist Profile.

Thanks for reading my article on data scientist profile, and have a good day 🙂

Read the latest articles on our blog.

About Author

I am a Data Scientist with a Bachelors’s degree in computer science specializing in Machine Learning, Artificial Intelligence, and Computer Vision. Mrinal is also a freelance blogger, author, and geek with five years of experience in his work. With a background working through most areas of computer science, I am currently pursuing Masters in Applied Computing with a specialization in AI from the University of Windsor, and I am a Freelance content writer and content analyst.

Connect with me on my social media profiles and follow me for a quick virtual cup of coffee.

LinkedIn | Github | Email | Medium | Instagram | Facebook | Portfolio

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Mrinal Singh

Data Scientist and a Technical Writer! I will give you the best of Open-Source and AI.

Talks about #chatgpt, #opensource, #contentcreation, #communitybuilding, and #artificialintelligence

Technical Writer | Data Science, ML, AI, Open-Source | Do More with Data - Litmus

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

What were you thinking when you choose your Data Scientist Profile?

Reason 1: Cultivating Self-Awareness

Reason 2: Illustrate the Importance of Standardization in Visualization

Reason 3: Our First Step to Thinking about Data Science Teams

Reason 4: Demonstrate your Thought Process before you do EDA

Final Things for you to Think About

Final Words

About Author

LinkedIn | Github | Email | Medium | Instagram | Facebook | Portfolio

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect