Common mistakes Data Engineers do in their Learning Path

Saurav Last Updated : 13 Apr, 2021

4 min read

It has been lately called that ‘data scientist’ is the sexiest job of the 21st century. However, now, data engineering jobs are poised to give data scientists tough competition. Data Engineering Jobs are getting more popular than Data Science jobs.

Common mistakes Data Engineers Meme

So once you’ve decided data engineering is the field for you, you need to understand that becoming a great data engineer is a journey and not a destination. Everyone talks about success stories and what to do for it, but nobody talks about the nuances of what not to do and where not to waste time.

It does not come easy. Industry experts keep complaining that there is a large gap between self-educated data engineer’s skills and real-world work in the field of data engineering.

In this article, I will discuss the common mistakes data engineers make in their learning path(I have made some of them myself). I have also provided tips wherever applicable with the aim of helping you avoid these pitfalls on your data engineering journey.

Mistake #1: Not making data fundamentals strong
Mistake #2: Learning outdated/ legacy skills/technologies
Mistake #3: Missing the required depth/ breadth of topics
Mistake #4: Not doing ample hands-on practice
Mistake #5: Unable to visualize and understand the end to end picture

Mistake #1: Not making data fundamentals strong

Common mistakes Data Engineers 1

The first and foremost mistake data engineers make is not making their fundamentals base learning strong enough. A data engineer is expected to be reasonably good in coding/scripting and SQL as well. Without being able to work on simple programs if a data engineer directly jumps to write a complex data pipeline, it is definitely going to be a mess of a code.

Also, a data engineer should be conversant enough in the basics of databases and relational database management systems as well. Not understanding the difference between a primary key and a surrogate key is going to create problems even to define a simple data model.

Mistake #2: Learning outdated/ legacy skills/technologies

The second common mistake data engineers do is to learn outdated technologies too much in-depth like learning too much in-depth Map Reduce OR Data warehousing concepts in Kimball /Inmon or some DWBI(Data Warehousing Business Intelligence ) tools which are not being used readily in the industry today. Time is a precious thing, learners can’t afford to miss focus on their learning priorities. It’s better to see the job descriptions and pick the most common skills like Spark, Kafka, NoSQL, Flink, etc rather than spending time and effort on outdated tools and techniques. But, do learn how to create Data models on NoSQL and Data lake systems.

Mistake #3: Missing the required depth/ breadth of topics

I agree there are too many topics to be studied, there is Spark or Hive. Then, there are Kafka, NoSQL databases like Hbase or MongoDB. In-stream analytics, we have Spark streaming or Flink. On the cloud side, we have AWS, Azure, and GCP. So is it mandatory to be thorough in all of these tools and technologies? Absolutely not.

The need is to be proficient in the fundamental concepts in these data processing tools e.g how Spark internals work, how Kafka Pub-Sub mechanism works, and how NoSQL is different from SQL when to use which one. Preferably, we should go with any one of the options rather than focusing on everything.

Personal recommendation is to just learn one programming language: Scala/Python, Kafka, Spark, MongoDB/Hbase, and finally AWS for Cloud. Sometimes it is better to go with tools used in current projects when you don’t have an option.

Mistake #4: Not doing ample hands-on practice

This is something of paramount importance. Everyone just completes theory by reading documentations and some videos but no one really does the hard work of actually writing an end-to-end pipeline themselves. This not only leads to surprises and hiccups while working on actual projects but also shows the shallow knowledge when the interviewer starts to grill on the project hand on the part.

The recommendation is to start with a public dataset and a real-time API(e.g. Twitter etc). Ingest the dataset into Storage like HDFS and Kafka. Process it using Spark SQL/DS and Streaming(for real-time API data). Finally, presenting the insights in a visualized form like Tableau will add icing to the cake.

Performance optimization of the initial build of pipelines can further increase your chances of cracking the interviews.

Mistake #5: Unable to visualize and understand end to end picture

Finally, without knowing the end-to-end pipeline just focusing on ingestion or storage or processing will not make the data engineer understand what is going on with his/ her work. Apart from knowing business impact data engineers should also understand the technical architecture and system design of the data pipelines and supporting frameworks.

Things like DevOps, Platform Infrastructure, and Networking are completely ignored by data engineers. These are critical aspects and supporting frameworks that are definitely important to understand the end-to-end picture. A basic overview of these supporting frameworks is definitely important if not in-depth.

Hope you had a good time reading the 5 common mistakes data engineers make, do share your experiences and any questions on the above.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Saurav

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Common mistakes Data Engineers do in their Learning Path

Table of Contents

Mistake #1: Not making data fundamentals strong

Mistake #2: Learning outdated/ legacy skills/technologies

Mistake #3: Missing the required depth/ breadth of topics

Mistake #4: Not doing ample hands-on practice

Mistake #5: Unable to visualize and understand end to end picture

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS