So You Want To Be A Data Scientist

Wed 24 October 2018

With all the hype around data science as a profession (21st century's sexiest job!), many people are clamoring to take online courses, attend in-person boot camps, and do newly created master's programs at different universities. While some of that hype is indeed deserved (much of what is going on is awesome and fascinating), understanding the challenges and reality of starting a data science career and the actual work itself is important especially for newcomers. As someone who completed a boot camp and has worked 9 months as a professional data scientist, I thought it would be worthwhile to share some of my experiences and observations to help others see through a lot of that hype.

The hype

Sexy Data Science Fun Times!

By Libertinus via Wikimedia Commons

The reality

Get in there and clean those data toilets, data janitors

Data Science is a highly technical field, but don't let that daunt you

This may be obvious to some, but data scientists are in short supply because the job requires both mathematical and computer programming skills. Granted, you do not need to be a master of either to get started learning or even start as a professional data scientist. However, if your skills in these areas are weak, you may want to consider brushing up on math and programming before jumping into that online data science course or boot camp. In addition to those skills, many related skills are also important to working as a data scientist, such as using the command line interface and version control systems like git comfortably, and communicating with audiences of varied technical backgrounds.

This may be somewhat disheartening, but even if you didn't do so well in math in high school or college, it's never too late to learn. I've done volunteer work teaching prison inmates community college courses. Many of them suffer from "math trauma" from bad experiences they had in math classes growing up (among other bad experiences). They manage to overcome that trauma through a lot of hard work and emotional maturity, and some of them have left and gone on to do four-year degrees in science and engineering. If they can do that in a challenging environment like prison, I'm sure that whatever roadblocks you have are surmountable as well.

Do side projects, not just course work

In that regard, courses online on Coursera, Udemy, or DataCamp are often a good starting point if you do not know a particular package or field very well. They are just the first step, though. Coming up with a project idea where you're forced to learn these things is often the best, particularly for programming and statistics. Curious about the latest economic trends? Get statistics from the St. Louis Fed ( https://fred.stlouisfed.org/ ) via its API and plot them yourself. Just with a simple project like that, you can learn some different Python packages and technologies (requests, matplotlib, numpy, pandas, APIs, JSON) in a more real-world context than a class. A class can get you started understanding how to use these packages and technologies better than reading the documentation, but nothing beats trying to immediately apply them yourself! Here, I wouldn't worry about coming up with the most creative ideas in the world. Just find something that interests you (economics, finance, beer, sports, movies, etc.) and think of a project you might do to apply your new skills! (Learning how to web scrape is also your friend here.)

In doing projects like this, you will also get exposed to the reality of the work as a data scientist. Namely, data is rarely packaged in a neat form for you to immediately throw into a machine learning or deep learning algorithm like on Kaggle. In fact, while the latter receives lots of hype, you will probably spend very little time on the actual modeling part of a project. I can attest to that in my own side projects or projects at work, where I spent more time collecting and cleaning the data than actually modeling. (That's usually the "easy" part!) Without high quality data, even the coolest, latest and greatest model will produce garbage predictions. As such, much of the work as a data scientist is spent on the following:

  • Finding a way to fill in missing values without introducing bias
  • Verifying the quality of data, accounting for any possible sources of error
  • Dealing with inevitable spelling mistakes, inconsistencies between data sets, etc.
  • Communicating with various stake holders who help generate or use that data

As such, a more appropriate title for the profession might be "data janitor" not "data scientist" (see "the reality" above). If you have a low tolerance for tedious tasks and unresolved ambiguity, data science is probably not a career for you. A lot of the difficulty in data science is not only learning the programming or statistical skills involved but also picking up on what my coworkers and I like to call "the data paranoia". The best way to describe it is never really fully trusting the data or your own analyses or models of it. And I'm not really sure how one could truly learn that aspect of data science without going out and looking at real data!

Carefully consider your motivations for wanting to enter data science

Are you just following the hype? Looking to make a lot of money? Or do you actually have an interest in working with data? Given what I discussed above, I think it's important to be really honest with yourself about why you're interested in data science. Data scientists are in short supply because they require highly technical skills. But, because of the inevitable tedium involved, the job also requires persistence in the face of adversity. Working with real world data is hard. You will make mistakes, whether in your analysis or programming. Before you take the leap into a boot camp or degree program, think really hard about whether you're actually interested. The suggestions I mentioned above about taking online classes and working on projects will help you figure that out. In my experiences going through a boot camp, the job search afterwards, observing others who have done the same, and interviewing people, you really need to be persistent and not let minor set backs get to you.

Get involved in the community (Network, network, network!)

Compared to my previous profession (academia), the data science community is very open and welcoming. Depending on where you are at (SF, NYC, Chicago, Seattle, etc.), there are often several Meetups where you can see technical talks about various aspects of data science (there are many!) and meet actual data scientists. Don't forget about that part! It's a great way to get a better sense for what being a data scientist is actually like and build your community. In my experience, people are often very willing to talk about their work at these events and potentially meet for coffee or lunch or help you out with a project you're working on. I've definitely met with people and been willing to give advice myself. This is often a better way to find out about opportunities and learn about the profession than reading rambling posts on the Internet sitting by yourself. One caveat: don't approach meeting people in a transactional "get me a job" way unless someone directly offers, or you will burn bridges fast!

Stay positive!

I'm almost writing this to myself! It's true, though: switching careers to data science will be a really difficult and challenging road if you are truly interested. I definitely did not realize how stressed out I was during that transition until it was over. The fun and interesting things you can do with data, even on your own, made it a worthwhile transition at least for me. That, more than the money or the hype, is what made me want to make the switch. Ten months after getting my first data science position, I definitely do not regret that choice. However, if you are interested in making that leap or switch yourself, I only ask that you make sure you're aware of the reality of what a data science career actually entails.

blogroll

social