Kaggle me a Data Scientist

2017-05-12

I feel like calling oneself a data scientist bring upon notions of someone with a Ph.D. in machine learning or statistics who has spent years honing his craft. This is not me, but in the last week, I have learned enough to be in the top 4% of a data science competition on Kaggle which has a prize pool of $1,200,000 USD.

My journey started with upskilling myself while I search for work in Europe. I was already quite familiar with Python. I mainly used it to run little scripts to analyze data, nothing too complex or large, and generally just for plotting scientific results.

To learn Data Science I had previously tried to delve into a few online courses, but they were either too simple or were focused too much on the theory. Nothing I found was practical enough to get up and running quickly to start solving real-world problems.

Then I rediscovered Kaggle. For those of you not aware, Kaggle is a data science site that runs competitions on datasets released by companies. The companies have real-world problems that they want the data science community to help resolve. Often there is prize money involved and you can compete with some of the best data scientists in the world. It was also recently acquired by Google.

A week ago I started competing in my first Kaggle competition. The competition involved predicting real estate sale prices using features from each of the properties. Each property had 58 features (columns), and we had training data set of 160,000 properties (rows). From this, I had to predict the sale value for approximately 3,000,000 homes.

I had progressed from the Top 66% to the Top 4% of the competition in a space of a week. For my first submission, I placed in the top 66% of the competition. Initially, I was satisfied with the result. There was only a very little margin separating me between the leaders approximately 0.004 difference in our score.

But this sparked an interest to understand what my submission was missing? what else could I do improve my score? So I began to tinker with my code and exploring other options. This led me down a rabbit hole of blog posts, YouTube videos, and the documentation of the software libraries that I was using. In the space of a week, I had progressed from the Top 66% to the Top 4% of the competition.

The immediate feedback from competing with other data scientist across the globe also kept me motivated to keep improving. There is also an extremely helpful community that discusses the latest technologies, and methods they are employing to achieve some of the higher scores.

Currently, I have a few ideas that I would like to test out, to see if I can crack into the Top 1%, unfortunately, though my 6-year-old laptop has met its performance limit. Most of my models take a few 3+ hours to run and I need to get back to job searching mode. I had considered using setting up an AWS instance but the competition ends in a few days.

Data science is a constantly evolving field that requires continuous development and dedication. There is still a wealth of information that I am looking forward to acquiring. Using Kaggle makes it a lot more entertaining.

My advice to anyone looking to upskill their data analytics skill. Dive straight into a Kaggle competition. There is a wealth of information and a helpful community that will make you a better analyst because of it.

If anyone else has other resources or opportunities that would help other budding Data Scientists improve their data science game let us know!