As I mention in my About section, I consider myself a data scientist in training. I worked as a data analyst and did about as much as a non-programmer can do with Excel, including creating macros, writing in-cell formulas, and pushing Pivot Tables to their upward limit of functionality. I applied to graduate school in order to take my skills to the next level as a programmer, and it’s like drinking from a fire hydrant when it comes to all the skills and knowledge necessary to really claim the mantel of data scientist, especially in an area like New York City with so many talented data scientists.
Having finished my first semester as a full-time graduate student, I was casting about for the way forward. Since moving to NYC in July, I’ve participated in a number of interesting and enlightening forums in NYC on data and data science, including Ignite NYC, DataGotham, and the Strata Conference + Hadoop World. Each only opened up new avenues to devote my time and energy. I’m not one to have a path dictated to me, but I’ve come to realize I need to have some important landmarks to guide my progress, otherwise I’m likely to go around in circles.
Today I came across a blog post by Hilary Mason (the high priestess of data science and cheeseburgers) that helped give me those landmarks. Likely to help stave off the flood of emails she must get from people like me, she lays out what it takes in “Getting Started with Data Science.”
What resonated for me was the three fundamentals: math, code, and communication. For me, that means reviewing the statistics I took as an undergraduate and taking the linear algebra course I’ve been eyeing at Coursera. So too was her exhortation to get out and do data science. Too often I’ve let other things get in the way. I’ve signed up for DataKind and Kaggle, but haven’t had the time to start working on my first project. I don’t intend to win any competitions in the case of Kaggle, but the availability of data sets makes me salivate, especially after the disappointment I felt trying to use data from NYC Open Data.
In terms of the second fundamental, I have a foundation in C++ and Java, which is helpful as I learn R through Coursera and slowly teach myself Python. The issue isn’t a lack of resources. There are plenty of books, blogs, websites, and Meetups teaching a myriad of languages, techniques, and approaches. What I experience is a lack of time, forcing me to cultivate a discriminatory eye towards the various tasks that make their way onto my several “to do” lists. What is more important than getting your career going in a meaningful way?
And as far as the third thing, communication, first I’m trying to blog more. I honestly don’t think most people care to read my thoughts even if I’m selective about what I share, but I find value in bringing signal to the noise of my musings. If I’m lucky, someone will make an insightful comment putting to shame all my words and help me along in this process. Secondly, I’m trying to get more exposure at Meetups and online forums because in engaging those around me, I break out of the bubble and learn from those doing the work I’d like to do. Lastly, I’m working on creating a website to establish a more permanent beachhead in cyberspace and plan on posting more of the work I’ve done so a wider audience can peruse the work I’ve done, provide comments, insightful and otherwise, and generally have something to show people when they ask, “So, what do you do?”
In the same vein, Hilary Mason also posted “Interview Questions for Data Scientists.” Having been a technical interviewer in my past life, I’m sensitive to this idea of how you demonstrate your competency in the job your applying for. Her first question (“What was the last thing you made for fun?”) stood out to me. What would I build for fun? Well, I just happen to have a short list (in no particular order):
- A color randomizer for R. There are a lot of colors in R, some good, some not so good, but it’s a bit of a pain to go through and choose a color when I’m creating graphics. A simple script could randomly produce one for me when I needed one. Even better would be to generate a palette of colors like this that someone could use to create something resembling color coordination. I love making my life easier. If I can also make my work aesthetically pleasing, all the better.
- A Sudoku solver. I love Sudoku. I love the challenge of inducing the underlying logic necessary in solving puzzles, particularly the more advanced puzzles requiring elaborate inferences to successfully solve. After I started grad school, I starting things about how a computer would solve a Sudoku puzzle. While my approach worked for me, a computer program would have to approach the problem differently. I’m still not certain how to implement it. I know others have and I’m happy to learn how they did it, but I still want to do it myself. The next step is to find out how sparse a puzzle can be while still being solvable. At some point, there must be too much ambiguity in the puzzle for it to be solvable, even by a computer. Where is that point? Does anyone know? I’d like to know.
I’m not particularly fond of New Year’s resolutions, but these are the landmarks for how I see my professional career progressing in the coming year. So it has been written, so hopefully it will be done.