During the meetups we conduct, we get a mix of audience. From complete starters in data science to experts in the field, every one attacks the problem under a single roof. However, one thing stands out when we interact with these people – a large proportion of these people (including some experts) didn’t have their machines set up and tuned for data science. A lot of them never took time out to set themselves up for data science journey. As a result of which they came across some of the industry resources as a matter of chance.
No one told them which blogs to follow, which newsletters to subscribe, where to read industry news. They also never tuned their machines or did not have the necessary hardware or software. This then leads to a lower productivity and even frustration in some cases, when they should be actually loving the experience.
Still don’t relate to it? Think of visiting a website, which take more than 10 seconds to load. You will likely get bored in this time, open up a new tab for another site or would just steer away from what was to be done. Same thing happens with data science. The longer your code runs, the chances of you steering away from work increases!
This is how we came across this unsaid problem people face in industry and hence we thought to create a guide for people to get ready for data science.
As mentioned above, this guide is meant for any one in data science industry, who has not tuned their machine to performance. I think it would be of more use to the beginners than to experts, but I have seen experts benefit from these tips equally well.
The first thing to ensure is that you are on the right hardware for data science. There is not much any one can do, if your hardware does not have what you would need. Since laptops are the mainstream device for computing now a days, my recommendations below are for laptop. If you use a desktop / iMac, you can go with even better configuration.
While this choice will ultimately boil down to how much you can shell out for a machine, I would recommend a machine with quad-core processor, preferably i7 (in case of Intel chips). Make sure you check that the processor you choose if quad core and not dual core. Lately, it has been really difficult to find good quad core chips. You can check benchmark performance of various chips in your budget against each other using sites like cpuboss.
Next, it is always recommended to maximize your RAM to the extent possible. A lot of tools use RAM for computations and you don’t want to run out of RAM while doing them (you eventually will in some cases!).
If your budget allows, you should upgrade to SSD as your read / write operations with datasets will take a fraction of time compared to normal SATA hard disk. For those, who are really serious about learning machine learning and deep learning, it is recommended to have a NVIDIA GPU, so that you can run intense computations using CUDA.
Here are a few good recommendations available currently:
A few additional notes:
People might argue that you don’t need to invest in such an advanced machine. You might be better off working with a mediocre machine over the cloud. I personally like accessibility provided by a personal machine and the fact that I can start working at any place without hooking on to the internet.
Once you have selected your machine, the next most important choice would be your OS.
Once you have finalized the OS, make sure you tune your OS to high performance. For example, in Windows, you can disable the transition effects and animations in Windows (Run sysdm.cpl . Go to advanced Tab -> performance section -> Settings and then disable the visual effects), remove unnecessary startup programs and switch the power plan to Performance.
Here is the list of a few softwares you will need apart from the analytics / data science tools (which are discussed in coming points).
This section would vary depending on your choice of main tools you choose for data mining. If you are still to choose your main tool, check out this comparison – SAS vs. R vs. Python. If you already have a tool of choice, select the one which apply to you:
Other options include MATLAB / Octave / RapidMiner.
In addition to the softwares mentioned above, it makes sense to have a tool specifically for data visualization. They usually help a lot while data exploration and when you present the data story to your customers at the end of every project. Again there are a lot of options here. A comprehensive coverage of them would be an article in itself. If you just want one, I would recommend QlikView – it is easy to use, has a personal version which is free to download and can handle large data really well. Tableau is another popular choice, which is very intuitive to use, but is not as effective for use on large datasets in my experience.
If you know JavaScript, you can also use libraries like D3.js
At times, when data set is huge or you are building an application for end users, you will need to use databases – SQL being the most common one. You can use MySQL or PostgreSQL. SQLite, which comes bundled in Python packages can be a effective option for small applications as well. If you work frequently on huge datasets, setting up a Hadoop cluster is inevitable. If you work on real time streams of data, you will need Spark as well.
In addition to these databases, you should also keep a couple of NoSQL databases, in case you need them. I would recommend MongoDB and Neo4j for usage.
By this time, your machine has almost all the resources you need for your data science journey. Now, let us look at a few other resources, you should use during your data science journey.
What if you want to work on a dataset which is 400 GB in size? Even the machines I recommended above would fail to load this in their memory while using R! It is scenarios like this, where a cloud account will come in handy. You can use either of the 2 services on cloud – Amazon Web Services (popularly known as AWS) or on Microsoft Azure. Both of them provide highly scalable solutions. The Azure platform in its new avatar is probably more user friendly, but Amazon is still the King of cloud services. You can sign up for accounts on both of them and give them a try.
I am assuming you already have a subscription to Analytics Vidhya articles. If not, please go ahead and subscribe. Apart from Analytics Vidhya, you should follow KDNuggets and DataScienceCentral.
On the newsletter front, I would recommend O’Reilly, DataScienceWeekly and Data Elixir newsletters.
I use my mobile to read a lot of content on the go. Whether I am travelling in metro or just have 5 minutes to sneak the latest publications, I rely a lot on my mobile for that. I use a combination of Prismatic and Flipboard to find new content. Combined, both of them provide me with all the latest gemstones published in the industry.
In addition, I have Termux, a fully functional Linux terminal, just in case I need to ssh into a server while on the go. I use it occasionally to play around in a Python shell for quick prototyping as well.
You can look out for meetups happening in your area. They provide opportunity to people to interact with like minded people. Analytics Vidhya conducts its hackathons in several cities in India. DataKind has several meetups as well.
For a starter, you can look at this discussion on Analytics Vidhya. Apart from this, KDNuggets maintains a list of open datasets and UCI provides a lot of datasets for machine learning.
You can also look at data.gov to find data from open sources.
If you have not done so already, sign up for our discussion portals. You would not only interact with other data scientists from community, but can also participate in various hackathons we conduct. In addition to this, you should check out Kaggle competitions and DataTau for hacker news style industry news.
In addition, you can also find data science community on Twitter, LinkedIn, GitHub, Facebook and Reddit. You can also subscribe to YouTube channels
I think you are all set now. You now have a machine with all the necessary software, tuned for performance. You would also be part of multiple communities and portals to stay tuned with industry.
If you have done all of this, you might be wondering what next? Stay tuned with us, we are coming up with a resource finder shortly, which will assume you have done all of this and will provide you necessary resources to master various concepts, tools and techniques in Data Science.
In the meanwhile, if you think there should be some more steps or resources I have missed on, please feel free to add them here. I hope this article proves to be immensely helpful to all those people, who work with non-optimized machines and resources which leads to frustration and loss of productivity.
Hi, I think the information about Tableau is inaccurate. There is a OSX Tableau version. Thanks
Thanks Edwin for pointing it out. I will update the information accordingly. I was not up to speed. Regards, Kunal
I m looking to buy new laptop for analytics, which laptop to go for keeping cost also in mind
How much are you willing to spend on a machine? Also, how urgent is the requirement? 6th Generation Intel chips are coming out right now and it might be able to get more value out of your spend in couple of months.
Hi Kunal, What could be a good laptop recommendation considering today's models/configurations available in market ?